Wednesday, October 5, 2011

web scraping and microformats

This proved very helpful to me. It's a short, concise explanation of how to extract data from the structured markup of a website in python. It's a no-brainer, but sometimes you need someone to show you the no-brainer before it becomes one for you.

The bottom line is that html is a string. It's also a tree. You can parse it either way. Microformats and the semantic web aren't only useful to big spidering companies like the goog; they're useful to all of us.

1 comment:

  1. Hello Dude,

    A microformat is a web-based approach to semantic markup which seeks to re-use existing HTML or XHTML tags to convey metadata and other attributes in web pages and other contexts that support HTML, such as RSS. Thanks a lot....

    Web Harvester