AWS going AWOL last week is exactly why less is more in cloud server land

Column Serverless is the new hotness. Like so much corporate IT, it’s a complete misnomer. There are just as many servers as before, but your tasks – or microservices, if you need four more syllables – have no idea which ones they’re using. Same meat, different stew.

On Wednesday last week, Amazon decided to take serverless at face value. Its US-EAST-1 region fell out of the cloud like a rain of frogs, leaving hundreds of services crippled or dead. Adobe Spark, Roku, Flickr, iRobot, and many more stuttered and fell. The firm passed the buck as quickly as possible, but to nobody’s satisfaction.

All at once, we were back to the pre-PC days where everything lived on the mainframe and you did what IBM told you to do, and if you didn’t like that then good luck living outside the ecosystem.


AWS admits to ‘severely impaired’ services in US-EAST-1, can’t even post updates to Service Health Dashboard


This is one of the cloud’s less impressive aspects. Amazon is famously reluctant to disclose what goes on in incidents like this – as a friend pointed out, Microsoft is much more open but much less reliable; take yer pick – so we won’t know what happened. More importantly, we won’t know whether it’ll happen again in US-EAST-1, or in one of Amazon’s 20-odd other regions. Why Amazon couldn’t move load elsewhere is also a mystery: cloud is supposed to be all about agile loads, right? Apparently not. Which makes it hard to manage risk.

Managing risk is central to business. It’s not sexy, it’s not easy to understand, but it’s what keeps you alive. Amazon and GCP and Azure and all the other clouds will talk to you forever about scalability, super-duper services, management tools, you name it – but ask for the figures about resiliency and you’ll get a lot of generalities. The company has a lot to say about designing your systems for resiliency, but rather less on the statistics for when you’ll need that. Amazon’s CTO, Werner Vogel, says: “Everything fails, all the time.” Not, you might think, a useful metric. But you have to take him at his word.

There is a solution, of course, which takes this problem out of the hands of the cloud providers, leaving them free to engineer for resiliency or not as they please and accept the commercial consequences.

Back to serverless

Serverless has jobs and data floating around in virtual space, launched through APIs and communicating through messages. It’s a good way of thinking, just as it was in 2002, when it was called grid computing. IBM, HP, and Fujitsu were poised – poised, I tell you – to win, and win big, in this new distributed infrastructure world. They failed, as big companies who have to protect existing revenue and environments will always fail, and AWS won because it didn’t care about that.

That’s out of order

But a part of grid computing that hasn’t made it into serverless is the thing that would have kept Adobe Sparking and Flickr flickering – the digital dial tone. The grid worked between providers – it was seen initially as a way to utilise spare capacity by common interfaces – so when you used the grid, you didn’t necessarily know which company was doing what part of the work. You heard the dial tone, sent in your work, and that was that.

It’s not too much of a reach to see that replicated across the different serverless offerings from the different cloud providers. The decision logic could even be on-premises as a control plane that configures routes rather than carrying load itself, and it could decide on many metrics – price, latency, historic or instantaneous availability – and take on the risk of any particular component going wonky. Or this might be something independent brokers do, in conjunction with the in-cloud routing needed for client traffic. It’s an interesting model to play around with.

Digital dial tone isn’t the cure for everything. It implies new compromises because you’re taking some of the management and traffic out of a cloud provider’s own infrastructure with all that this implies for throughput and latency.

If your company lives by real-time analytics of huge data sets, you’re not going to be able to easily slice and dice that willy-nilly across a sea of subservient serverless compute and storage. Your risk analysis will look very different from, say, consumer digital content storage. But if your service architecture is built to autoscale down to zero, to summon another fashionable trope, then scaling up from cold may well not care about the paths its components will take.

The point isn’t that cloud providers haven’t achieved very high degrees of reliability. They have. It’s that you as their client don’t have the tools to easily decide how much trade-off is good for you, or how much risk you’re happy to cede to a company in short-term resilience or long-term lock-in. Yes, you can do multi-cloud already, but not as-a-service – and as-a-service is where the magic happens.

The final irony would be if a third party with nothing to lose implemented all this on top of the existing cloud, forcing the rigour of a much more competitive environment on the incumbents who’d once used the same moves to overthrow the previous generation. Expect quite the fight if that starts looking likely. As one long-term student of technology’s battles for survival said: “Too bad they won’t live – but then again, who does?” ®

Leave a Reply

Your email address will not be published. Required fields are marked *