Saturday, March 22, 2008

Obfuscated Proxy Lists

A couple of weeks ago I blogged about a proxy list site that had tried to make it difficult for bots to harvest proxy IP:port information from its pages. It turned out to be easier than scraping HTML. For me, at least.

I also threatened to start my own proxy list. I'm still threatening. I'm basically HTML-impaired when it comes to this Web crap. Making a Web page with a WYSIWYG editor (Ive been using Kompozer under Windows - it sucks ass on Linux) is more maddening than learning Word 2.0 was back in the early 90s. After a frustrating afternoon trying to figure out tables and wrapping (I said I was HTML-impaired), I gave up and went back to harvesting more data for the proxy database, so the list is on the back burner for now. But it is coming.

I found some interesting stuff and after a few days of hacking I now have 5000+ entries in the database, with a success rate of a little more than 10%, which is, truthfully, better than I expected.

These proxy list maintainers have gone to great lengths to keep their data "proprietary", but none of their methods are very effective (as illustrated below).



One list displays IP addresses in GIF files to prevent page-scraping. This is not a big deal. GOCR translates them back to ASCII nicely. You have to watch for zeroes that turn into "letter ohs" and ones that show up as "letter els", and 8's as B's, etc., but it's a finite set of substitutions that's taken care of with a short sed script.

Another site silently changes the URL you submit, prepending a fixed string to the URL displayed in the browser. This place had thousands of entries, most of them dead (very common with most of these sites), but some good. A couple of other sites used this same trick.

(Of course, I have to track the dead sites to prevent them from being checked again, so they go into the database as well.)

Another common trick is escaping the content and using JavaScript to unescape it. Some sites stop there but others use a trivial XOR method to further obfuscate matters. Rhino is a grea tool for this. To send the unobfuscated content to standard output (and from there to wherever you want it) you simply have to replace document.Write() with print(). Simplicity is Schweet.

Then there's a site in Belarus that plays cookie games. One cookie is delivered to you where you'd expect it, in a Set-Cookie: header, but another is stashed away as a META tag inside gzip'd HTML content. You need both to start page scraping.

That took a few hours to figure out. The result was the same: hundreds of dead addresses, but a few good ones. They all went into the database and I scheduled a once-a-day page scrape.

But my favorite so far has been a PHP-based site that tries to limit your query against their database to 100 rows. Apparently these folks never heard of Input Validation and if you hack the HTTP request packet you can get all the rows out of their database (you have to conclude if they are that stupid there's probably some good SQL injection mischief to be had with their site, but that's not my goal here - it's more valuable, to me, alive than dead).

That was pretty cool, and I do admit getting a chuckle out of it, but the bad news is that after about two thousand rows, the data is over seven years old (they've been around for a long time regardless - in spite of? - of their insecure programming techniques) and not very useful.

Still, into the database it goes.

There's a few refinements I need to make on this whole process before I put the list up. "Good" proxies have to be re-checked. Timed-out proxies will need to be re-checked until the port is known to be closed. The database needs to get a little fatter.

And I'm going to have to take some remedial HTML classes.






No comments:

Post a Comment