(Or, how I learned to stop worrying and love /etc/hosts)
Well, anyone reading this has probably worked out that we go through bouts of speed issues. We finally get one issue fixed, then a couple of months later everything is as slow as molasses again. Unfortunately, there are always a few problems fixing speed issues. (Most of this is obvious if you think about it, but I'm stating it for the record).
1) Timezone. Our office hours are squarely in the quietest part of the day. This is great - we get to monitor the backup fairly closely, since we put the backup in the quiet part of the day. It also sucks - different usage patterns actually get stuck on different bottlenecks in the whole system. It also means that (save for overtime), it's "make a change", followed by "wait a day", followed by "rinse and repeat". Some times (like last night) there's nothing quite as effective as staying up all night watching stats
2) Multiple bottlenecks. I partially touched on this in (1), but only one aspect of it. Not only do the bottlenecks often shift during the day, but as soon as you nail one, there's another one a couple of percent behind it. It's not uncommon to remove a bottleneck, see usage increase a few percent because it's faster, then instantly hit a different bottleneck. Isn't it the way?
3) Telling cause from effect. For example, I can look at the pretty graphs we've got now, and go "wow, so many active network connections! No wonder it's slow, too many concurrent users!". Except that's not the whole story. Half of the time, the number of concurrent requests is so high because something has slowed the requests down. If you have 100 requests per second, and you can finish each request in 0.01 seconds, there will be (on average) one request active at any given time. If that increases for an unrelated reason, you have a higher number of concurrent requests. So 0.1 seconds per request would make it ten requests active per second. So my example wasn't causing problems, it was a side-effect. It only gets worse as you add more statistics and graphs to double-guess.
4) It's a live system. This one is obvious. There's only so much you can do to a live production system before people aren't happy!
In the past few weeks I'd gotten all of our graphs to look really good - server load, memory usage, CPU usage.. everything except the response times! I've added a bunch of graphs (using munin), quite a few of them custom ones to monitor MySQL and Apache. Response time is measured from Apache's "%D" log format option, split into total / showthread / forumdisplay / adxmlrpc, and averaged over 1000 when it's sampled.
On a whim, I ssh'd into our servers and ran tcpdump. For those that don't know, tcpdump is a tool for monitoring TCP packets going over a network connection. It doesn't handle huge volumes of traffic terribly quickly, and wading through the reports aren't fun. Running 'tcpdump port not http and port not mysql and port not ssh' quickly removed all HTTP (ie traffic to you guys), MySQL (traffic to the DB server), and SSH (traffic back to me) data. HTTP and MySQL traffic account for the vast bulk of the data the servers use, and I didn't need tcpdump telling me that it was sending me packets.
What was left quickly provided the clue to how to solve the problem. There were DNS requests. A lot of them. Each of the CGT servers doubles as a DNS server, so you'd expect lots of DNS traffic. What you don't expect is a lot of DNS queries leaving the server. I saw many many queries for "server2.cgnetworks.com" and "ads.cgsociety.org". DNS queries are pretty quick individually. Normally your browser says "Where's forums.cgsociety.org", gets a response after half a second, and the browser goes on its merry way, contentedly caching the DNS information for the next page you load.
Of course, Apache and PHP weren't caching DNS lookups for the DB server, nor for the ads server. Each time a new DB connection was needed (which is fairly frequently in most PHP/MySQL setups) it only did so after a relatively lengthy lookup. The same was true of the XML-RPC call the ad system uses to pick ads. The ads system also managed to get configured to do RDNS lookups for every query, which is ultimately useless and wasteful. (Just to stop anyone from suggesting we not have ads, we get a lot of financial support from selling advertising on the site. As much as everybody wants free content without ads, we have to balance that with the reality of being able to afford to exist!).
So, on to the solution. The subtitle gives it away - the hosts file (/etc/hosts on *nix systems) is a "short-cut" to DNS resolution. You can put name/IP pairs in there and anything that would have previously used DNS to do the lookup, gets its information from there first. Putting information about the ad and DB servers into the hosts file quickly eliminates the DNS queries, and drops the query time. The lower response time drops the number of required Apache processes, dropping the RAM usage, which stops the response time run-away. Volume used goes up, concurrent connections goes down, I get a vertical line pointing downwards on my "response time" graphs, and everybody is a little bit happier than they were the day before.
It's not the end, of course. I still want it faster. Early in the new year we'll rinse-and-repeat, and hopefully get the performance to increase some more.
Merry Christmas!