Resilient Applications Are the Key To Availability In the Cloud
Historically, high-availability applications have been implemented through high-availability infrastructure, giving applications always-on, resilient resources at the network, server and storage layers. High-availability infrastructure ramped up dramatically in the virtualization era, enabling virtual machines that could readily stay online for years at a time without interruption of service. Ultra-reliable infrastructure allowed many applications to evolve with resiliency in some areas, but with a dependency on some core system, such as a shared, central database that everything built on top dies without.
The Cloud is No Place for Pets
Those crucial pieces are often referred to as pets. Pets are system components that require a lot of love to configure and that you repair at any cost when they get sick. The premise of the cloud is built around cattle: dozens or hundreds of equivalent resources that are only special in their sameness. When one dies, you replace it. In the cloud, infrastructure is built to be significantly less reliable, which in turn makes it economical to distribute even a small application across many servers or data centers.
But many workloads have gotten comfortable running on highly reliable infrastructure, creating franken-pets that are elaborate and difficult to fail over. Because the failure scenarios are rare and unlikely, failover is not tested regularly, exacerbating the problem. Paradoxically, the more pet-like these central components become, the more dependencies are placed on top of them. They devolve from architectural foundations to boat anchors, and that risks both your resilience and budget in the cloud.
Simplified Infrastructure Enables Resilient Applications
The underlying premise of the cloud is to have access to more infrastructure, but in a less reliable fashion. Infrastructure will fail, and the more complex it is, the more complex those failures will become. Instead, the cloud simplifies infrastructure and lends itself well to simplified application design built on top of it, achieving resilience through horizontal rather than vertical scaling.
While servers are more likely to fail in the cloud, it’s easier than ever to pool dozens of servers into a single, self-healing group and to scale up and down that group as needed to match demand in real time. This simultaneously allows for cost savings in idle periods and rapid healing in the event of application or infrastructure issues.
Lift-and-Shift Can Expose Problems
As great as it sounds to be able to scale applications on demand, many traditional applications simply aren’t built to take advantage. Simply migrating those applications as they are significantly reduces availability. It is crucial to identify key scale-out areas, reviewing core databases, load balanced pools and even distributed clusters to model how they will behave in the cloud. In many cases, it may be better to leave critical system components back in the data center in a hybrid architecture. In others, it may make sense to re-architect those components as part of the migration effort. It is crucial to understand this before beginning a migration.
The Well Architected Framework Can Guide You
AWS has organized its best practices and community experience into the Well Architected Framework, which serves as a guidebook for building more resilient applications. The Well Architected Framework identifies core activities around availability that can guide a cloud migration, calling attention to areas that may be ill-suited for migration in their current state. The framework identifies five key principles for reliability:
- Test Recovery Procedures. Don’t assume because your servers have been running without interruption for years that they will continue to in the cloud. Instead, expect failure and plan accordingly. As part of your migration process, engage in a little Chaos Engineering — deliberately break key components to see how your application responds. Ensure you know how to rebuild each part of your system, not only in terms of rebuilding resources, but in terms of managing any impact to data integrity and workflows.
- Automatically Recover From Failure. Once you understand how to recover from failure, start automating the process. Lean on autoscaling groups, queues, availability zones, health checks and other resources to identify problems. Use scale-out plans, CloudWatch events and Lambdas to dynamically take action based on those problems.
- Scale Horizontally. Vertical scaling — loading up servers with more and more resources — is expensive in the cloud. As a rule of thumb, vertical scaling improves performance but impedes resilience. When your application is properly designed, horizontal scaling — adding more, smaller resources to a pool — improves performance equally while also improving resilience. Re-tool your components to scale horizontally whenever possible. For components that need to scale vertically, invest heavily in monitoring each instance exhaustively. For horizontally scaled components, monitor instances in aggregate, letting health checks take care of killing off any resource that is not behaving optimally.
- Stop Guessing Capacity. Sizing decisions are often made before the application is ever placed under load. While some level of upfront sizing is unavoidable, it should be minimized. Instead, focus on making potential scale points capable of flexing within a wide range so your application runs as lean as possible at all times. Sizing for peak load is always expensive, but particularly so in the cloud where dollar-for-dollar, compute time is at a slight premium to the data center.
- Manage Change in Automation. Defining infrastructure in code is essential in the cloud, as it is the key that allows you to hit all the other points of the Well Architected Framework efficiently and effectively. Utilize tools such as Terraform, CloudFormation, Ansible and Chef to automate your entire cloud operations, from provisioning to management and sustainment. Logging into servers or consoles, augmenting automated processes with manual configuration steps (e.g., for backups) and even directly using CLIs are all anti-patterns that point toward shortcomings in your code. Though it can seem like an impediment to progress, the fact is that a strong commitment to infrastructure entirely described through code will pay for itself faster than any other investment of time you can make.