I was at home strumming the guitar when a message came in saying my wife was on her way back. I grabbed my phone to go meet her at the subway station near our place, but I noticed Slack was piled up with notifications. We use Slack for team chat, so I opened it without thinking. In the channel that lights up whenever something goes wrong with the servers, several new messages were waiting. My heart skipped. When I looked more closely, I saw that a few different errors had fired over and over.
The first error said, “too many open files.” The second said something like “unable to find server blabla.”
The second message even implied it could not find the database server. If the DB really were unreachable, the service should have been dead. Worried, I hit our product directly, and it loaded just fine. Data was coming through too. In a sense, the incident already seemed “over.”
Still, I could not just shrug and go back to the guitar, so I started digging through the logs again. I combed through the WAS logs, syslog, anything that looked remotely suspicious, and googled whatever I found, but nothing meaningful turned up.
Looking at the sequence of errors, “too many open files” appeared first, and then the DB lookup failure followed. On a hunch I checked the other servers behind the load balancer. They were all fine. That made it hard to believe the database host itself had really gone down.
Since “too many open files” had shown up, I checked ulimit, and there it was. Wait, 1024? Why was it set that low?
Then my memory clicked. About a month earlier, we had changed the WAS, which meant the Linux service account had changed too, and I had forgotten to raise the ulimit for the new account.
Oh.