Wednesday, April 29, 2009

Screenscraping HTML With TagSoup and XPath

So, long story short: we have something we are trying to use that doesn't work as advertised, and so I had to build a quick n' dirty tool I could use to query one of that app's pages for things and act on that.

HOWEVER: that page is in, of course, HTML, and everyone that's worked in and around web development knows how well-formed that often is (HA!), even if the data I want is in an HTML table.

I'd like to turn the page into a DOM (somewhat reliably - though it doesn't have to be perfect for my uses) and search it with XPath, etc.

Turns out TagSoup seemed to come up in my searches, and I quickly found a way to use it to turn it into a DOM and pull out the bits I care about quite effortlessly with XPath.

Kudos to the author of TagSoup, and thanks for the TagSoup -> DOM writeup. Check write up link for more info and the imports, but it really boils down to this:

URL url = new URL(whatever);
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);

Transformer transformer = TransformerFactory.newInstance().newTransformer();

DOMResult result = new DOMResult();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())),
result);

// here we go - an DOM built from abitrary HTML
return result.getNode();

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?