Wednesday, October 5, 2011

web scraping and microformats

This proved very helpful to me. It's a short, concise explanation of how to extract data from the structured markup of a website in python. It's a no-brainer, but sometimes you need someone to show you the no-brainer before it becomes one for you.

The bottom line is that html is a string. It's also a tree. You can parse it either way. Microformats and the semantic web aren't only useful to big spidering companies like the goog; they're useful to all of us.