Anyways, I was going to write a python script to iteratively extract domain names from ~60 URLs, and then I realized that John Kurkowski had built a wonderful little python module to do this, based on this question :
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url/569219#569219 question.
So , in order to run John's python script, I had to setup pip. Once I had pip working, it was easy to go off and get the tldextract program :
$ pip install tldextract But the really amazing thing about it is that john also deployed this as a JSON service on appengine. So, you don't really need pip : just a shell script with a bunch of calls like this :curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/a"
curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/b"
...
The results :
doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
{"domain": "bbc", "subdomain": "www", "tld": "co.uk"}
doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/b"
{"domain": "google", "subdomain": "docs", "tld": "com"}
doolittle-5:Development Jpeerindex$ curl "http://tldextract.appspot.com/api/extract?url=http://docs.google.com/a"
{"domain": "google", "subdomain": "docs", "tld": "com"}
Damn, that was easy. THANKS JOHN KURKOWSKI !
By the way, the source is here : https://github.com/john-kurkowski/tldextract, and it looks really nice.
No comments:
Post a Comment