The Great Transmitter Outage
How one false move brought down the entire cloud service (but there is a silver lining)
While taking part in a group flight with MyAir last night, a few of the pilots involved began to notice strange behaviour from Transmitter - the application I wrote some time ago that helps people see each other on a map in group flights. Some aircraft were appearing and disappearing, and one in particular was appearing multiple times.
While looking at the source code, I decided to simplify a small part of the back-end processing, and made a small mistake. Within moments the webserver fell to it’s knees, and no matter how quickly I attempted to rectify it, the weight of a hundred people reporting their position every second brought the server down again.
After five minutes fire-fighting - while also listening to a chorus of pilots reporting the same problem repeatedly over the radio (because apparently many pilots only report - they don’t actually listen, it turns out) - I took the entire system offline in order to figure out what on earth was going on.
There’s a somewhat famous t-shirt about debugging. The shirt starts with “It’s never done that before” and ends with “I’m not sure how it ever worked in the first place”. The maxim proved itself.
Somewhere along the road - during one of the many small changes that have been implemented in Transmitter - some validation code was missed. A single function in the code that removes unwanted input. It might not sound important until you think in terms of the real world. What if a small piece of metal got into the gearbox of a car? A piece of metal that wasn’t supposed to be there? The engine would lunch itself pretty quickly - and that’s exactly what happened to Transmitter.
A user (who shall remain nameless, because the situation should never have happened) inadvertantly entered a backslash in his name within the Transmitter client. He didn’t even notice he had done it. That one backslash pulled a card from the bottom of a house of cards in just the right place (I believe that’s called sods law).
Two things came out of the next five minutes. Firstly - that I really needed to review the code, and secondly - the web server needed more headroom in case anything similar should ever happen again. The only problem with that? It would mean taking “VirtualFlight.Online” offline for several minutes.
The server got it’s upgrade early this morning while four pilots were in the air. Within two minutes it was back up - with double the RAM, and double the processing power it had previously.
Anyway.
Apologies to all that might have been using Transmitter at about 20:00 UTC last night. The service was back up and running by 21:00, and then upgraded this morning by about 08:00. Long may the uninterrupted tracking of flights continue!