Wednesday, April 29, 2009

Screenscraping HTML With TagSoup and XPath

So, long story short: we have something we are trying to use that doesn't work as advertised, and so I had to build a quick n' dirty tool I could use to query one of that app's pages for things and act on that.

HOWEVER: that page is in, of course, HTML, and everyone that's worked in and around web development knows how well-formed that often is (HA!), even if the data I want is in an HTML table.

I'd like to turn the page into a DOM (somewhat reliably - though it doesn't have to be perfect for my uses) and search it with XPath, etc.

Turns out TagSoup seemed to come up in my searches, and I quickly found a way to use it to turn it into a DOM and pull out the bits I care about quite effortlessly with XPath.

Kudos to the author of TagSoup, and thanks for the TagSoup -> DOM writeup. Check write up link for more info and the imports, but it really boils down to this:

URL url = new URL(whatever);
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);

Transformer transformer = TransformerFactory.newInstance().newTransformer();

DOMResult result = new DOMResult();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())),

// here we go - an DOM built from abitrary HTML
return result.getNode();

Sunday, April 26, 2009

Transcendent Man

Prepare to evolve!

Looks like Kurzweil's movie, Transcendent Man, started showing this weekend - unfortunately, the "worldwide" premier seems to be all in NYC, though. Sigh.

Oh well, at least on the site for the movie Transcendent Man, you can sign up for upcoming dates in your neck of the woods, if you don't happen to live in the Big Apple.

Saturday, April 25, 2009

Denver Has a Maker Group!

Denver now has a Makers group! Its first meeting was this Thursday at Club Workshop. The next one is tentatively May 21st. They are going to try for the third Thursday of the month.

Turnout was HUGE, especially given this was the first meeting, and a little out of the way, even if right off of I-25 - i.e., not right in DTC, and not right downtown Denver. I got there right about 7, and parking lot for Club Workshop already appeared to be full, and people were parking on street or other parking lots.

The topic was by John Maushammer, talking about his Pong Watch. He gave an overview of how he decided on components - things like size, battery life, being rechargeable, the cpu, cost, etc...he detailed how he went about designing the watch case, how he first prototyped in wood, how he went from designing in CAD, and then having a sort of low-tech 3d printer (forget the brand - but it was something he got from Ebay and said was much like a Dremel, but hooked up to a computer) carve it out of the wood, then aluminum, (and later, plastic, I think, for the face of the watch) for him.

He talked about how he created the board himself from a kit, how he sautered. He talked about how the instructions he needed barely fit in the thousand instructions the CPU permits. The code was in C, so you don't know ahead of time how many instructions that will result in. I forget the CPU type. In fact, I didn't have any way to take notes except by Blackberry, so all this is from memory.

It was all very interesting. He of course brought in the pieces so people could come up and see them. The down side was that there was no PA system, so it was a bit hard to hear him at times, especially when people were coughing or talking. I had to leave early too - didn't stick around to mingle or take a look at all the things brought, since I just happened to be exhausted that day as it was.

A bunch of Denver Mad Scientists showed up, too. Since people were invited to also bring in their projects and/or talk about what they were working on, one of the Denver Mad Scientists talked about what they do. They are known for many things, but the most famous is having the first robot battles. They were also the first to have pumpkin guns on the Front Range.

One guy jumped up and talked about his experiences with using a laser product to engrave wood. Another jumped up to talk about his net gun.

Before the talks the folks from Club Workshop walked us through a short intro to what they do, and that sounds incredibly interesting, too. They offer all kinds of classes in all sorts of things. They even sound open to starting up classes based on interest. Someone during the meeting if anyone knows anything about patent lawyers, and the guy from Club Workshop (forget his name, but I think he owns it) asked if there was interest in a class on filing patents...

And the classes sound very - get this - affordable, so if/when I'm ready to tackle some of these things, I know right where I'm going. I'd really like to learn to weld, and they offer that. They also offer a yearly membership, in which a few classes seem to be included.

Tuesday, April 21, 2009

PDFTK - The PDF Toolkit

I recently found that I wanted to split a very large PDF document into two smaller documents, and copy the table of contents, or at least the parts relevant to the second half, into the second document, too. That's so I wouldn't have to go looking back and forth between the two documents. You can imagine similar scenarios for an index, too - you may want to copy this to the first document.

So, how does one do that? Well, I started searching around for open source tools, and at first my keywords didn't seem to be turning anything up fruitful. Add in "linux" to the search, and voila, I quickly came upon pdftk.

Splitting a file into two is a two step process. You first write the first part, by giving it a page range. Let's say your doc is 500 pages and you want to split it into two, 250 page, documents.

pdftk orig.pdf cat 1-250 output part1.pdf

Then you do the second part this way:

pdftk orig.pdf cat 251-500 output part2.pdf

Now you have two documents. In my case, I wanted to add the contents to the second part, as well.

I couldn't find a way to do that in one step - say by giving two page ranges - but I did just accomplish it by writing a temp file. Say the relevant parts I wanted to add to the part2.pdf were pages 10-20 of the orig doc. I would save those off this way:

pdftk orig.pdf cat 10-20 output contents.pdf

Then, I merged the contents.pdf and part2.pdf this way:

pdftk contents.pdf part2.pdf cat output final-part2.pdf

And I was done. Not bad, not bad at all.

Sunday, April 19, 2009

Hacking Your Own Printer

This week, the laser printer we have at home decided that it would no longer print - halfway through a 74 page document. The last 20 pages didn't come out. Argh. Send the last 20 pages again. Still nothing. Try a test page. Still nothing.

Double check the lights on the panel and look up what they mean.

So, the toner light has been on the laser printer we have at home here for....well, a long time. Now, the status light also went red, which means "I'm not printing; get a new cartridge".

Which is weird, because the document looked just fine - even the last page. I shake up the toner cartridge, stick it back in. Still no go.

Go online, price the toner replacement (this would be the first time - I've been using the cartridge that came with the printer.) at the local big box retailers. I see that I can get THREE of the high capacity replacements (7500 pages vs. 3500 pages) for less than one of the regular capacity at a local big box.

So, I order some, but...would still like to print. Does the printer supply an override of some kind to let the USER and not the PRINTER decide when it's time to change the toner?

So I google some, and find that a strategically-placed piece of electrical tape lets some people print 500-1000 more pages just fine with the same model. I take a look at the cartridge, and I can see what they are talking about, and after a bit of rummaging, I find some electrical tape at the house. A few seconds later the cartridge is back in, and the status light is no longer red!

Clear out the printer queue, and voila, I have my complete document printed out.

It's not that I'm poor, but I just abhor waste and inefficiency. 500-1000 more pages is substantial. This toner came with the printer and lasted a long time - but getting it to last even longer is just fine with me.

Friday, April 17, 2009

H+ Magazine

It looks like RU Sirius' podcasts went dark for a reason? Apparently, he's been working on big things: h+ magazine.

Okay, what is h+ magazine, you ask? Well, it deals with transhumanism. And if you don't know what transhumanism/extropianism is, Hard to explain that in a soundbite, at least for me. Probably best just to send you off to the Wikipedia entry on transhumanism.

Maybe one way to sum it up, though, is this: fundies are still battling issues culturally that were settled scientifically 150 years ago (like evolution), and they are so busy fighting something that's already lost that they just have no idea what is in store for them.

So far, it looks like H+ magazine is free in PDF form, and they have plans to generate a dead-tree version soon.

This page is powered by Blogger. Isn't yours?