As I alluded to on Tuesday, we had a security incident at work over the July Fourth weekend, so we’ve been spending all week cleaning up after that. There’s a statement about it on my company’s blog. I won’t link to it here, since I don’t necessarily want this post to show up in searches or referral logs related to it. (I’m not going to post anything I shouldn’t be posting, and I don’t know enough about the details to say much anyway, but just in case…)
On Tuesday, we were told to just stay home and chill, basically. They didn’t give us any real info on what happened, so that was basically just a “snow day,” as I mentioned in my last post. On Wednesday, they had made enough progress with the initial mitigation that we could start working on our disaster recovery. So, since then, it’s been a lot of “hurry up and wait” work, where servers are getting restored or rebuilt, scanned, released, and then we can do final setup and testing. I’ve been in the office Wednesday and Thursday, but I’m hoping I can get away with working from home today.
There are a lot of people working very hard on this. For me, I’m primarily a programmer, with only limited admin responsibilities. There are only two servers that I’m “officially” responsible for (out of hundreds total across the company), so mostly I’m just waiting on other stuff, answering questions, and helping out where I can. I feel a little guilty, knowing that some people are sleeping in their offices while I’m going home and sleeping in my own bed, but of course there’s not much I can do to help those folks, other than to stay out of their way.
I had a bunch of other stuff to say about this, but I think I’ll just say that I think we’re doing a pretty good job of handling this thing. Everyone has been calm and professional through it all, at least from my limited vantage point. There’s still a lot of work to do, but we’re making steady progress.
Purely by coincidence, I read an article in Communications of the ACM last week that talked about the best way to handle outages in IT. The article is more about unintentional errors that cause outages, rather than security incidents, but there’s a lot in common. I like the idea that “DevOps celebrates mistakes.” I hope that, when we’re back up & running, we’ll get a good postmortem on this that we can learn from.