Today was a really bad day for us. We just released a few new products and having been throwing a lot of targeted advertising and press efforts behind these new products. The Amazon Web Services shut-down yesterday caused our site to shut down at a really inopportune time. We lost business, visitors, and a lot of the momentum we’ve worked towards and invested heavily to create. Are there more reliable web servers out there? Or is AWS still the best game in town that we just have to live with?
My understanding is that only the US-EAST-1 zone was affected. It's possible to set up cross-region replication, which might have helped to reduce the impact of this issue. (Is there anyone here using cross-region replication, and were you affected?)
This particular outage affected my company's apps as well, as we only use US-EAST-1. As far I as I can recall, this was the first time in three years that we had an outage caused by AWS.
Firstly - I feel your pain. Having a cloud provider go down makes you feel helpless.
Amazon AWS is in fact a lot of services. So you really need to be more specific to get a good answer. Some AWS services do not exist in any other company so if you start using them then you are tied in.
My (not well informed) understanding of the incident is that just one of them (S3) failed in one area. When I have designed systems for AWS in the past I have
a) had multiple installations in different Amazon availability zones in the same site - so that if one goes down then another in the same data centre can be spun up and take over. They typically "share" the same data though.
b) had multiple installations in different Amazon Regions (typically Dublin, Ireland and one of the US regions). The data however is usually a few hours out of sync - eg updated at midnight. I could bring up a new server in minutes, but it would not be up to date.
c) had off-Amazon backups of data with which I could bring up the service again in a few weeks (this was pre-devops days :-) should the whole of Amazon go belly up.
We also invested heavily in CDNs which could cache and deliver content even if the main site went down. Though this may not be relevant to a dynamic app.
Is Amazon unreliable? No - it is one of the best. there may be other choices
Should you have a disaster recovery plan ? Yes. But a DR plan costs and so you might just need to put up with the outage.
Should I assume that any part of my system may fail at any time. YES
PS Someone mentioned Heroku went down. I thought Heroku actually ran on Amazon AWS so I am not surprised.
The company I'm at now was also affected by the outage on AWS and Heroku (which uses AWS). We are looking into some disaster recovery options like: using Dokku/Docker as an alternative to Heroku, backing up our s3 assets so we could deploy elsewhere if needed, getting database dumps locally, so we don't have to rely on RDS snapshots, and other options.
AWS is still the best game in town but having a DR plan that works will protect your company in case of a severe outage and will let you more easily migrate to other platforms if AWS continues to have issues.
try Siteground has been always reliable for me they never had isssue with servers down, and they are the best rated provider out there.
I agree with George. If you think that your services are that much critical for your business, then you should always have another zone added in the mix. Not only it allows your visitors to go to the nearest to them, and if the nearest one is down, the traffic gets redirected to other node in other zones. It is part of the business continuity process. If you don't have it, please include one soon.
If the outage happens, it will affect every provider in that region. I hope it is not isolated to AWS.
There are several other platforms, including Microsoft Azure and Google Cloud Platform. Also it's possible to have your services running in multiple zones on AWS so if there's a failure in one, customers can still access your services.
We still have to see the full 411 on the incident, so before that it is hard to speculate which mitigation would have worked. Full cross region (not just zone) replication is very likely to have done the job. But as a startup you may not want to spend resources on that just now.
My guess is that something went wrong on the internal resource allocation system somewhere, triggering a corrective action that escalated. By AWS design this is isolated from other regions.
Generally, it does not get more reliable than AWS. Maybe Google. But there are not enough events for this to get any statistically relevant data.
I wouldn't spend too much time engineering against the specific failure. Like with the EC2 outage a few years ago, Amazon is not going to stop until they have totally eliminated the root cause.
Do cross region fail-over if you can afford it.
Your only solace is that it is not hard to explain the fall-out of the event.
Once thing though. In my opinion I think it's an overkill to host a website in any cloud provider. I refer to website as the front page of the business not the services it offers. Our website is being hosted in a shared hosting. However our backend is in Azure.
Hosting a landing website in a cloud vendor complicates things a lot. Because these types of contingencies need to be accounted for.
Unless you have a high traffic website or having a high load web service, which will need CDN, then something like AWS makes sense.
That said, no matter the cloud vendor you chose, outages will happen, is the way costs are kept low on our side. You can host you own infrastructure and minimize those outages to things in your pseudo control, however such a setting is really costly. Thus, that's why cloud offers are so logical to use. Lower cost for a relative big infrastructure, with minimized down time and complexity from your part.
My 2 cents.
Hello, I use Heroku and yesterday was shut down too. I am on the European zone, but I also had several bucket on the S3 AWS Storage and they too would be not available. The Engineers were working on some problem between AWS and Heroku. My Webpp has not yet been released so was not too bad.
IMO AWS is still the best option out there. Multiple other competitors have had serious outages in the last year as well.
To help alleviate the issue it would be best to invest in a replication and disaster recovery plan to help alleviate your risk. In this case you could replicate your assets over to a separate region and have a failover to switch if US-EAST-1 goes down. AWS provides you a tremendous amount of tools to prevent issues but depending on your needs it can get costly. Losing revenue, as you know, can cost even more.