15.9.17

Building a corpus of large GitHub projects.

I randomly had a need to find a large, random set of source code today... 

This can be done using the GitHub API.... But just note, its likely to get you blocked / rate limited after a short period of time since its unauthenticated.

Anyways... heres the script.

# pip install GitPython

import git
import subprocess
import json
import urllib2

# Make sure you run this from an empty directory.  The DU calculation is hacky and
# just measures local dir size.

orgs = [
    "twitter",
    "airbnb",
    "apache",
    "microsoft",
    "adobe",
    "ibm",
    "square",
    "esri",
    "yelp",
    "shopify",
    "sap",
    "guardian",
    "gilt",
    "cfpb"
]


def du(path):
    """disk usage in human readable format (e.g. '2,1GB')"""
    return subprocess.check_output(['du','-sh', path]).split()[0].decode('utf-8')

def get_repos(organ):
    repos = []
    uri = "https://api.github.com/users/"+organ+"/repos"
    print("Getting ",uri)
    values = json.loads(urllib2.urlopen(uri).read())
    for r in values:
        repos.append(r["git_url"])
    return repos

for organization in orgs:
    all_repos = get_repos(organization)
    for r in all_repos:
        print(str(r))
        git.Git().clone(str(r))
        print du("./")
        print r

No comments:

Post a Comment