Analyzing GitHub Pull Request Data with BigQuery
I really enjoyed Felipe Hoffa’s post on Analyzing GitHub issues and comments with BigQuery .
Which got me wondering about my favorite subject ever, The Art of Closing. I wonder what the stats are for the top 15 projects on GitHub in terms of pull requests opened vs. pull requests closed. This post will use the GitHub Archive dataset.
Top 15 repositories with the most pull requests
First let’s find the top 15 repos with the most pull requests from 2015. Let’s make sure to check the payload action is ”opened”.
SELECT
repo.name,
COUNT(*) c
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"opened"')
GROUP BY
repo.name
ORDER BY
c DESC
LIMIT
15
repo_name | c |
---|---|
openmicroscopy/snoopys-sandbox | 11656 |
brianchandotcom/liferay-portal | 10803 |
Homebrew/homebrew | 9519 |
caskroom/homebrew-cask | 6833 |
apache/spark | 6667 |
saltstack/salt | 6636 |
mozilla-b2g/gaia | 6609 |
jlord/patchwork | 6155 |
GoogleCloudPlatform/kubernetes | 5937 |
jsdelivr/jsdelivr | 5747 |
rust-lang/rust | 5559 |
cms-sw/cmssw | 5507 |
code-dot-org/code-dot-org | 5267 |
docker/docker | 5083 |
NixOS/nixpkgs | 4873 |
Okay that’s a lot of pull requests. Let’s find the projects will the most unique number of pull request authors.
SELECT
repo.name,
COUNT(*) c,
COUNT(DISTINCT actor.id) authors,
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"opened"')
GROUP BY
repo.name
ORDER BY
authors DESC
LIMIT
15
repo_name | c | authors |
---|---|---|
jlord/patchwork | 6155 | 5396 |
octocat/Spoon-Knife | 3966 | 3741 |
deadlyvipers/dojo_rules | 4847 | 3076 |
Homebrew/homebrew | 9519 | 2186 |
udacity/create-your-own-adventure | 2709 | 2167 |
caskroom/homebrew-cask | 6833 | 1517 |
borisyankov/DefinitelyTyped | 2694 | 1127 |
rails/rails | 3100 | 1012 |
LarryMad/recipes | 1086 | 989 |
laravel/framework | 2736 | 891 |
docker/docker | 5083 | 882 |
rdpeng/ProgrammingAssignment2 | 922 | 866 |
apache/spark | 6667 | 851 |
JetBrains/swot | 951 | 836 |
rust-lang/rust | 5559 | 835 |
Now let’s see what the merge vs. close numbers look like for those projects.
SELECT
repo.name,
COUNT(*) c,
COUNT(DISTINCT actor.id) authors,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed,
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"closed"')
GROUP BY
repo.name
ORDER BY
authors DESC
LIMIT
15
repo_name | c | authors | merged | closed |
---|---|---|---|---|
deadlyvipers/dojo_rules | 1636 | 1022 | 0 | 1636 |
octocat/Spoon-Knife | 1103 | 944 | 0 | 1103 |
jlord/patchwork | 6595 | 705 | 4905 | 1690 |
LarryMad/recipes | 588 | 532 | 0 | 588 |
apache/spark | 6653 | 468 | 0 | 6653 |
Homebrew/homebrew | 9548 | 451 | 5 | 9543 |
udacity/create-your-own-adventure | 2765 | 301 | 1946 | 819 |
rdpeng/ProgrammingAssignment2 | 341 | 284 | 0 | 341 |
docker/docker | 5250 | 254 | 3979 | 1271 |
NixOS/nixpkgs | 4707 | 249 | 3438 | 1269 |
odoo/odoo | 3412 | 233 | 712 | 2700 |
borisyankov/DefinitelyTyped | 2529 | 221 | 2173 | 356 |
mozilla-b2g/gaia | 7197 | 215 | 5251 | 1946 |
rails/rails | 3254 | 212 | 2090 | 1164 |
caskroom/homebrew-cask | 6928 | 210 | 3044 | 3884 |
Oh that is super weird. After looking into a few of the repos with 0 merged, it seems they aren’t really using GitHub for merges.
Calculating the merge ratio
So let’s exclude those and try again, this time we can even calculate the merge ratio.
SELECT
repo.name,
COUNT(*) c,
COUNT(DISTINCT actor.id) authors,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed,
ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"closed"')
GROUP BY
repo.name
HAVING
merged > 10
ORDER BY
authors DESC
LIMIT
15
repo_name | c | authors | merged | closed | merge_ratio |
---|---|---|---|---|---|
jlord/patchwork | 6595 | 705 | 4905 | 1690 | 74.37 |
udacity/create-your-own-adventure | 2765 | 301 | 1946 | 819 | 70.38 |
docker/docker | 5250 | 254 | 3979 | 1271 | 75.79 |
NixOS/nixpkgs | 4707 | 249 | 3438 | 1269 | 73.04 |
odoo/odoo | 3412 | 233 | 712 | 2700 | 20.87 |
borisyankov/DefinitelyTyped | 2529 | 221 | 2173 | 356 | 85.92 |
mozilla-b2g/gaia | 7197 | 215 | 5251 | 1946 | 72.96 |
rails/rails | 3254 | 212 | 2090 | 1164 | 64.23 |
caskroom/homebrew-cask | 6928 | 210 | 3044 | 3884 | 43.94 |
cms-sw/cmssw | 5475 | 205 | 4312 | 1163 | 78.76 |
symfony/symfony | 2587 | 185 | 1387 | 1200 | 53.61 |
facebook/react-native | 1563 | 185 | 494 | 1069 | 31.61 |
robbyrussell/oh-my-zsh | 731 | 185 | 307 | 424 | 42.0 |
githubteacher/github-for-developers-sept-2015 | 404 | 181 | 301 | 103 | 74.5 |
nightscout/cgm-remote-monitor | 1096 | 178 | 419 | 677 | 38.23 |
Using the diff data
Sweet now let’s see on average what the size of the diffs are for these projects’ pull requests.
SELECT
repo.name,
COUNT(*) c,
COUNT(DISTINCT actor.id) authors,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END) AS merged,
SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN 1 ELSE 0 END) AS closed,
ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio,
AVG(JSON_EXTRACT(payload, '$.pull_request.additions')) AS avg_additions,
AVG(JSON_EXTRACT(payload, '$.pull_request.deletions')) AS avg_deletions,
AVG(JSON_EXTRACT(payload, '$.pull_request.changed_files')) AS avg_changed_files,
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"closed"')
GROUP BY
repo.name
HAVING
merged > 10
ORDER BY
authors DESC
LIMIT
15
repo_name | c | authors | merged | closed | merge_ratio | avg_additions | avg_deletions | avg_changed_files |
---|---|---|---|---|---|---|---|---|
jlord/patchwork | 6595 | 705 | 4905 | 1690 | 74.37 | 47.45595147839272 | 172.14268385140258 | 175.20257771038666 |
udacity/create-your-own-adventure | 2765 | 301 | 1946 | 819 | 70.38 | 30.39746835443038 | 13.116455696202532 | 6.742133815551537 |
docker/docker | 5250 | 254 | 3979 | 1271 | 75.79 | 214.36685714285716 | 115.88342857142857 | 8.139619047619048 |
NixOS/nixpkgs | 4707 | 249 | 3438 | 1269 | 73.04 | 339.9751434034417 | 40.72678988740174 | 5.380072232844699 |
odoo/odoo | 3412 | 233 | 712 | 2700 | 20.87 | 1626.0741500586166 | 1907.4182297772568 | 128.01992966002345 |
borisyankov/DefinitelyTyped | 2529 | 221 | 2173 | 356 | 85.92 | 887.0581257413997 | 864.4827995255041 | 2.8730723606168445 |
mozilla-b2g/gaia | 7197 | 215 | 5251 | 1946 | 72.96 | 415.85396693066554 | 138.59233013755733 | 10.55578713352786 |
rails/rails | 3254 | 212 | 2090 | 1164 | 64.23 | 54.88414259373079 | 29.18561770129072 | 6.880762138905962 |
caskroom/homebrew-cask | 6928 | 210 | 3044 | 3884 | 43.94 | 8.448469976905312 | 4.0329099307159355 | 3.315675519630485 |
cms-sw/cmssw | 5475 | 205 | 4312 | 1163 | 78.76 | 2160.7702283105023 | 713.1713242009132 | 37.51086757990868 |
facebook/react-native | 1563 | 185 | 494 | 1069 | 31.61 | 189.86756238003838 | 86.54638515674984 | 10.595649392194497 |
robbyrussell/oh-my-zsh | 731 | 185 | 307 | 424 | 42.0 | 54.0328317373461 | 11.285909712722297 | 1.987688098495212 |
symfony/symfony | 2587 | 185 | 1387 | 1200 | 53.61 | 142.36722071897952 | 168.96366447622728 | 28.32006184770004 |
githubteacher/github-for-developers-sept-2015 | 404 | 181 | 301 | 103 | 74.5 | 18.217821782178216 | 0.7574257425742574 | 2.517326732673267 |
nightscout/cgm-remote-monitor | 1096 | 178 | 419 | 677 | 38.23 | 519.7043795620438 | 246.9434306569343 | 8.777372262773723 |
Well that’s not all that interesting…
Can we prove you should always keep your pull requests small?
We know that it is always better to make a small pull request to have it merged. Let’s see if we can prove that with data!
SELECT
repo.name,
COUNT(*) c,
COUNT(DISTINCT actor.id) authors,
ROUND(100*SUM(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN 1 ELSE 0 END)/COUNT(*),2) AS merge_ratio,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.additions') END) AS merged_avg_additions,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.deletions') END) AS merged_avg_deletions,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('true') THEN JSON_EXTRACT(payload, '$.pull_request.changed_files') END) AS merged_avg_changed_files,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.additions') END) AS closed_avg_additions,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.deletions') END) AS closed_avg_deletions,
AVG(CASE WHEN JSON_EXTRACT(payload, '$.pull_request.merged') IN ('false') THEN JSON_EXTRACT(payload, '$.pull_request.changed_files') END) AS closed_avg_changed_files,
FROM
[githubarchive:year.2015]
WHERE
type IN ( 'PullRequestEvent')
AND JSON_EXTRACT(payload, '$.action') IN ('"closed"')
GROUP BY
repo.name
HAVING
merge_ratio > 5
ORDER BY
authors DESC
LIMIT
15
repo_name | c | authors | merge_ratio | merged_avg_additions | merged_avg_deletions | merged_avg_changed_files | closed_avg_additions | closed_avg_deletions | closed_avg_changed_files |
---|---|---|---|---|---|---|---|---|---|
jlord/patchwork | 6595 | 705 | 74.37 | 9.3565749235474 | 0.033231396534148826 | 1.0014271151885832 | 158.03431952662723 | 671.6674556213018 | 680.798224852071 |
udacity/create-your-own-adventure | 2765 | 301 | 70.38 | 7.863309352517986 | 0.6747173689619733 | 1.8144912641315518 | 83.94017094017094 | 42.67887667887668 | 18.45054945054945 |
docker/docker | 5250 | 254 | 75.79 | 176.0874591605931 | 90.80949987434029 | 5.965317919075145 | 334.2045633359559 | 194.38001573564122 | 14.946498819826909 |
NixOS/nixpkgs | 4707 | 249 | 73.04 | 137.50581733566025 | 31.841768470040723 | 2.91564863292612 | 888.5090622537431 | 64.79826635145784 | 12.056737588652481 |
odoo/odoo | 3412 | 233 | 20.87 | 200.9129213483146 | 195.0870786516854 | 7.095505617977528 | 2001.8944444444444 | 2358.966296296296 | 159.90814814814814 |
borisyankov/DefinitelyTyped | 2529 | 221 | 85.92 | 390.8085595950299 | 482.45467096180397 | 2.1339162448228257 | 3916.13202247191 | 3196.3567415730336 | 7.384831460674158 |
mozilla-b2g/gaia | 7197 | 215 | 72.96 | 398.6246429251571 | 86.15311369262997 | 6.51628261283565 | 462.3448098663926 | 280.09198355601234 | 21.45580678314491 |
rails/rails | 3254 | 212 | 64.23 | 23.657416267942583 | 11.615789473684211 | 2.6382775119617223 | 110.95274914089347 | 60.732817869415804 | 14.49828178694158 |
caskroom/homebrew-cask | 6928 | 210 | 43.94 | 8.201708278580815 | 5.042706964520368 | 3.9244415243101183 | 8.641864057672503 | 3.241503604531411 | 2.8385684860968072 |
cms-sw/cmssw | 5475 | 205 | 78.76 | 994.2810760667903 | 619.9148886827459 | 8.133812615955472 | 6485.7067927773005 | 1058.9337919174548 | 146.43078245915734 |
symfony/symfony | 2587 | 185 | 53.61 | 63.3914924297044 | 72.16582552271089 | 8.235760634462869 | 233.65 | 280.84583333333336 | 51.534166666666664 |
facebook/react-native | 1563 | 185 | 31.61 | 204.29757085020242 | 88.43522267206478 | 8.024291497975709 | 183.19925163704397 | 85.67352666043031 | 11.783910196445277 |
robbyrussell/oh-my-zsh | 731 | 185 | 42.0 | 49.74267100977199 | 10.824104234527688 | 1.6612377850162867 | 57.139150943396224 | 11.620283018867925 | 2.224056603773585 |
githubteacher/github-for-developers-sept-2015 | 404 | 181 | 74.5 | 4.700996677740863 | 0.4186046511627907 | 1.1727574750830565 | 57.71844660194175 | 1.7475728155339805 | 6.446601941747573 |
nightscout/cgm-remote-monitor | 1096 | 178 | 38.23 | 173.24582338902147 | 58.885441527446304 | 4.985680190930788 | 734.1299852289512 | 363.3338257016248 | 11.124076809453472 |