Wednesday, April 29, 2009
Screenscraping HTML With TagSoup and XPath
So, long story short: we have something we are trying to use that doesn't work as advertised, and so I had to build a quick n' dirty tool I could use to query one of that app's pages for things and act on that.
HOWEVER: that page is in, of course, HTML, and everyone that's worked in and around web development knows how well-formed that often is (HA!), even if the data I want is in an HTML table.
I'd like to turn the page into a DOM (somewhat reliably - though it doesn't have to be perfect for my uses) and search it with XPath, etc.
Turns out TagSoup seemed to come up in my searches, and I quickly found a way to use it to turn it into a DOM and pull out the bits I care about quite effortlessly with XPath.
Kudos to the author of TagSoup, and thanks for the TagSoup -> DOM writeup. Check write up link for more info and the imports, but it really boils down to this:
HOWEVER: that page is in, of course, HTML, and everyone that's worked in and around web development knows how well-formed that often is (HA!), even if the data I want is in an HTML table.
I'd like to turn the page into a DOM (somewhat reliably - though it doesn't have to be perfect for my uses) and search it with XPath, etc.
Turns out TagSoup seemed to come up in my searches, and I quickly found a way to use it to turn it into a DOM and pull out the bits I care about quite effortlessly with XPath.
Kudos to the author of TagSoup, and thanks for the TagSoup -> DOM writeup. Check write up link for more info and the imports, but it really boils down to this:
URL url = new URL(whatever);
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
DOMResult result = new DOMResult();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())),
result);
// here we go - an DOM built from abitrary HTML
return result.getNode();