Tuesday, April 15, 2008

The Month of Proxy Lists

I started this project on March 15. This morning, a month later, there were 122,000 address/port combinations in the database. Lately I've been doing a lot of ad hoc Google harvesting from proxy forums, terrorist Web sites (Hell, I don't know, it's all in Arabic), Chinese blogs, and tons of miscellaneous sites (it was quite a surprise - although I'm not sure why I didn't expect it - to find hundreds of proxy list sites right here on Blogspot).

I had to go into ad hoc mode because Google told me I was violating their Terms of Service by running an automated harvester (that worked really, really well, if I do say so myself). Oops. Sorry about that, Google.

Pickins were gettin slim. I have all the best providers feeding into the database on an hourly basis. I was getting a lot of dupes and 122,000 seemed like it might be the top end.

And then I found two random text files on two random servers. The first had 73,000+ unique entries (I checked!), the second had 52K and change. Absolutely jaw dropping data, with very few dupes, at least so far. In text files.

I am now circling back to the SOCKS project. Sadly, I lost the last version of sockcheck.c I wrote when the BOT House drive died (thank you Hitachi). I did manage to find an earlier copy and I have run the most likely SOCKS servers (port 1080) through it. I found the same results as I did last year: there are very few SOCKS servers out there and they're mostly Windows ISA servers.

I have also found some unique sources of proxy data that don't fit into the "proxy list" category. That is, these sites are collecting proxy data without trying. A side effect of logging, they're not really collecting open proxy addresses. I'm not sure what I'm going to do with this but it's an interesting find and there is a lot of data.

I also found that Wikipedia has an open proxy project (since page defacers use proxies they ban the addresses) that has some harvestable data. They only disclose the address, not the port. Getting the port is trivial with (say) nmap, but so far every address I've tested is already in my database.

The project is wrapping up nicely. We are now entering into the Month Of Figuring Out What The Hell To DO With All This Data.

Friday, April 11, 2008

4/11 Power Outage

This morning at about 1:00AM we had our first extended power outage since October '07.

It was also the first power outage since the BOT House server was rebuilt in January. Everything shut down gracefully, which was also a first. I'm finally starting to get this stuff right. The server got about a half hour of standby power before it shut itself down. Whatever residual power was left over actually kept the telephone, cable modem, switch, and wireless access point running until the power came back on. Since BOT House is the router they were all pretty much useless, but they kept running all the same.

I shut down EXP /// manually because it has a wimpy little UPS. It was never really meant for anything other than power spikes, of which we get a lot of around here in late Spring and early Summer. So it was the first to go.

At any rate I got the place back online at about 5:30AM after I woke up. Now we're all juiced up and ready for the next power outage.

In other news, the Proxy Project will have over 100,000 entries by tonight. Sometimes I scare myself.