I've been hacking away at proxy lists around the globe for the past two weeks and am nearing the 15,000 mark rapidly.
And out of all that there are only 819 "good" proxies, which comes out to about a 5.5% success rate.
That's not very good, but it gets much, much worse. I did a cursory check on the 819 live proxies and found only about 150 of those were still active. I'm starting to wonder if anyone (besides me) even bothers with proxies anymore. The public lists are nearly worthless, except for a few decent standbys that have a handful of new addresses every day (out of hundreds - the casual proxy hunter would give up after a few pages).
Luckily, the dead proxies are good for one thing and one thing only: if you know an address is dead, you don't waste time checking it over and over. All the lists have the same data. A dead proxy in one list is a dead proxy in a hundred lists.
And the deadest proxies are PlanetLab proxies.
PlanetLab is "a global research network that supports the development of new network services." One of their research projects, CoDeeN, runs thousands of proxy servers across the globe. They've been online since 2003. And they've been abused ever since. In fact they have published some excellent research on why running a public proxy is bad (duh).
Since they have been hit so hard in the past, they have learned how not to be abused. Primarily, they don't take requests outside of the academic networks they operate in. This is why all the PlanetLab proxies listed in the proxy lists appear to be dead. If you're some non-academic schmuck (like myself) you'll get nowhere (this is not entirely true, but if you can find an open CoDeeN node you will find it works for GET requests, but not POST requests, which limits the functionality of many Web sites).
However, the network is operational and they have some pretty graphs (like this one) you can watch in Near Real Time™.
Most proxy lists use PlanetLab/CoDeeN proxies as "filler". Almost all of the proxies from .edu domains in my database are PlanetLab proxies. I try to avoid them if I can but they'll all be in there sooner or later.
I am nowhere near exhausting the supply of lists. Today I did a Google search of all occurences of "proxy list" in the .ru domain (Russia). About 38,000 hits in all. I browse the cached result first (after all, they're Rooskies) and if I find anything promising I make sure my anti-virus is running and dive right in.
Sometimes, if the site is dead, I cut & paste from the cached page and do an ad hoc run. Most are dupes, but there can be surprises.
And I've been getting better at sed/grep/cut/tr -ing the data out of these pages. I've been using html2text and links2 for the most part because they work direct-to-text, but sometimes you have to play games with Javascript and cookies.
Now that the data is flooding in, I'm almost ready to work on my own list, like I have threatened many times before.
Stay tuned.
UPDATE! Since I wrote this yesterday, the address count in the database has gone over 20,000.
ReplyDeleteI pulled a Google trick and harvested 80 sites at once, for a total of over 8000 addresses. It was quite an exercise. When the dust settled, out of those 8000 addresses, there were only 750 unique IPs. Most of those were already in the database.