I thought I had it nailed, but today I noticed I had missed one minor utility, gifsicle.
From the manpage...
gifsicle is a powerful command-line program for creating, editing, manipulating, and getting information about GIF images and animations.I used it mostly on this site, one of those "dicey .ru domains" I've warned you about in the past...
It's not all that obvious at first glance, but the address/ports above are GIFs. A number of proxy lists do this to prevent scraping, but it's ineffective and mostly it pisses off users who would rather simply cut&paste the information.
Since gifsicle was nowhere to be found on the hard drive, this site hadn't been scraped since March. All gifsicle did was scale up the image for further processing by gocr, which converted the image back into text.
Once fixed, I ran my kidscript to see what I was missing.
I wasn't missing Jack Shit. Same old crap. Nothing that wasn't already in the database. And less than 100 proxies total.
On top of the GIF trick this guy requires a cookie, which is a pain but not hard to pull off, but for this crap? Dude, you're gonna get scraped and that guy will put your proxies in a list that will get scraped, so what is the point? You're not guarding Fort Knox here.
Why do I bother?
Because I love doing this shit.
It's an obsession.
No comments:
Post a Comment