I started this project on March 15. This morning, a month later, there were 122,000 address/port combinations in the database. Lately I've been doing a lot of ad hoc Google harvesting from proxy forums, terrorist Web sites (Hell, I don't know, it's all in Arabic), Chinese blogs, and tons of miscellaneous sites (it was quite a surprise - although I'm not sure why I didn't expect it - to find hundreds of proxy list sites right here on Blogspot).
I had to go into ad hoc mode because Google told me I was violating their Terms of Service by running an automated harvester (that worked really, really well, if I do say so myself). Oops. Sorry about that, Google.
Pickins were gettin slim. I have all the best providers feeding into the database on an hourly basis. I was getting a lot of dupes and 122,000 seemed like it might be the top end.
And then I found two random text files on two random servers. The first had 73,000+ unique entries (I checked!), the second had 52K and change. Absolutely jaw dropping data, with very few dupes, at least so far. In text files.
I am now circling back to the SOCKS project. Sadly, I lost the last version of sockcheck.c I wrote when the BOT House drive died (thank you Hitachi). I did manage to find an earlier copy and I have run the most likely SOCKS servers (port 1080) through it. I found the same results as I did last year: there are very few SOCKS servers out there and they're mostly Windows ISA servers.
I have also found some unique sources of proxy data that don't fit into the "proxy list" category. That is, these sites are collecting proxy data without trying. A side effect of logging, they're not really collecting open proxy addresses. I'm not sure what I'm going to do with this but it's an interesting find and there is a lot of data.
I also found that Wikipedia has an open proxy project (since page defacers use proxies they ban the addresses) that has some harvestable data. They only disclose the address, not the port. Getting the port is trivial with (say) nmap, but so far every address I've tested is already in my database.
The project is wrapping up nicely. We are now entering into the Month Of Figuring Out What The Hell To DO With All This Data.