jayunit100: Python, curl, and the wonderful tld extract tool.

I recently found a need to (quickly) extract a bunch of URL domain names from a large list of source URLs. I'm seeing this become more and more relevant - a specific URL is not particulalry useful in and of itself - but the domain name of that url, can tell you some information about the content it holds - in just the same way that a street address, although meaningless alone, when converted to a Zip code, can tell you things about the people living at that address.

Anyways, I was going to write a python script to iteratively extract domain names from ~60 URLs, and then I realized that John Kurkowski had built a wonderful little python module to do this, based on this question :

http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219 question.

So , in order to run John's python script, I had to setup pip. Once I had pip working, it was easy to go off and get the tldextract program :

$ pip install tldextract

But the really amazing thing about it is that john also deployed this as a JSON service on appengine. So, you don't really need pip : just a shell script with a bunch of calls like this :

curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/a"
curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/b"
...

The results :

doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
{"domain": "bbc", "subdomain": "www", "tld": "co.uk"}
doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/b"
{"domain": "google", "subdomain": "docs", "tld": "com"}
doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/a"
{"domain": "google", "subdomain": "docs", "tld": "com"}

Damn, that was easy. THANKS JOHN KURKOWSKI !

By the way, the source is here : https://github.com/john-kurkowski/tldextract, and it looks really nice.

20.12.11

Python, curl, and the wonderful tld extract tool.

No comments:

Post a Comment