jayunit100: Building a corpus of large GitHub projects.

I randomly had a need to find a large, random set of source code today...

This can be done using the GitHub API.... But just note, its likely to get you blocked / rate limited after a short period of time since its unauthenticated.

Anyways... heres the script.

# pip install GitPython

import git
import subprocess
import json
import urllib2

# Make sure you run this from an empty directory. The DU calculation is hacky and
# just measures local dir size.

orgs = [
"twitter",
"airbnb",
"apache",
"microsoft",
"adobe",
"ibm",
"square",
"esri",
"yelp",
"shopify",
"sap",
"guardian",
"gilt",
"cfpb"
]

def du(path):
"""disk usage in human readable format (e.g. '2,1GB')"""
return subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8')

def get_repos(organ):
repos = []
uri = "https://api.github.com/users/"+organ+"/repos"
print("Getting ",uri)
values = json.loads(urllib2.urlopen(uri).read())
for r in values:
repos.append(r["git_url"])
return repos

for organization in orgs:
all_repos = get_repos(organization)
for r in all_repos:
print(str(r))
git.Git().clone(str(r))
print du("./")
print r

15.9.17

Building a corpus of large GitHub projects.

No comments:

Post a Comment