Six Ways Agencies can Manage the Risk of Cloud Crashes
When part of the Amazon EC2 Cloud crashed on April 21, the Federal government received a small but unwelcome lesson in the potential downsides of cloud hosting.
The crash of a key data center seized headlines by sending some high-profile commercial websites such as Quora and Reddit temporarily offline. Though less publicized, the U.S. Department of Energy was also a casualty. Its OpenEI website, which promotes collaboration in clean energy research, went down for almost two days.
The Amazon incident is a reminder that even the largest and most sophisticated cloud providers can encounter outages and other mishaps. Through careful planning, government agencies can embrace the cloud while reducing their exposure to unexpected downtime—and the resulting costs for the taxpayer.
There are various facets to a successful plan. Here are six of the most important:
- Incorporate failover for all points in the system. Every server image should be deployable in multiple regions and data centers, so your system can keep running even if there are outages in more than one region. Less visible components of your system require similar attention. For example, consider making failover arrangements for your DNS service so that you can access and change it even if it’s unavailable on certain machines.
- Develop the right architecture for your software. Architectural nuances can make a huge difference to your system’s failover response. If you have a database in one region, for instance, it offers you no real protection to have server instances in another region unless updates to the database in one region carry over to the other. A carefully architected system will keep the database in sync with a copy of the database elsewhere, allowing for seamless failover.
- Take care in negotiating service level agreements. For fully managed services in particular, your SLA should provide reasonable compensation for the business losses you may suffer from a service outage. If you simply receive prorated credit for your hosting costs during downtime, that won’t compensate for the costs of a large system failure—and it doesn’t give your provider a great incentive to ensure steady service, either. This point is especially crucial in negotiating for SaaS, where you have to depend completely on your provider for failover.
- Design, implement and test a disaster recovery strategy. One component of such a strategy is the ability to draw on resources such as failover instances at a secondary provider, in case your main provider runs into trouble. Adequate provisions for data recovery and backup servers are also essential. Once a strategy is in place, you can run simulations to make sure your plans will actually work in a crisis. Without such periodic testing, your disaster recovery plan is likely to fail.
- In coding your software, plan for worst case scenarios. In every part of your code, it’s best to assume that the resources it needs to work might become unavailable at a given time, and that any part of the environment could go haywire. One technique to guard against such events is to simulate potential problems in your code, so that the software will respond correctly to a cloud outage. Otherwise, it might not recognize a failure in the system and could end up corrupting your agency’s data.
- Keep your risks in perspective, and plan accordingly. It can be extremely expensive to eliminate the risk of a temporary failure resulting from an outage or other problem in the cloud. In cases where even a brief downtime would incur massive costs or impair vital government services, multiple redundancies and split-second failover can be worth the investment. If a website or service is not so critical, your company might prefer to invest fewer resources and accept the risk that it could go down for several minutes or hours.
In light of the Amazon experience, should government agencies shy away from cloud hosting? Hardly. An organization is no less vulnerable if it hosts its own data, and it’s likely to be less adept at dealing with an outage than a skilled cloud provider will be. Meanwhile, the cost of preparing for such contingencies is considerably lower with cloud hosting.
Problems occur in all walks of life, and cloud computing is no exception. As long as your agency plans ahead, it should be able to weather a crisis in the cloud.