This proved very helpful to me. It's a short, concise explanation of how to extract data from the structured markup of a website in python. It's a no-brainer, but sometimes you need someone to show you the no-brainer before it becomes one for you.
The bottom line is that html is a string. It's also a tree. You can parse it either way. Microformats and the semantic web aren't only useful to big spidering companies like the goog; they're useful to all of us.