AWS Reliability – A Core Pillar of Your Architecture

Voiced by Amazon Polly

In this post I’m going to give an overview of the AWS Well-Architected Framework, then I’ll do a deep dive on the Reliability pillar—one of the five core pillars that should underpin your AWS architecture.

Approved certificate medal icon - AWS reliability

An Overview of the AWS Well-Architected Framework

The AWS Well-Architected Framework comprises five pillars. Designed by AWS, this series of best practise principles is aimed at helping customers compare their AWS environments against these best practises to identify areas for improvement. The framework is based on the extensive experience gleaned by AWS solutions architects over the years, in tens of thousands of AWS deployments.

The framework guides AWS customers through a series of questions that enables them to better understand how well their architecture aligns with best practise. The ultimate goal is to help AWS customers to build environments that are secure, high performance, resilient and efficient.

AWS offers a free-to-use Well-Architected Tool, which guides customers through the questions in relation to their specific AWS workloads and then provides a plan on how best to architect the cloud environment using established best practises.

What Are the AWS Framework Pillars?

Let’s look at the five pillars of the AWS Well-Architected Framework to understand at a high level what each pillar is about.

AWS 5 Pillars

  1. Operational Excellence: the operational excellence pillar focuses on the day-to-day operations of a customer’s AWS infrastructure, including change management, deployment automation, monitoring, responding to events and defining standardized operating models.
  2. Security: the security pillar focuses on protecting systems and data. This includes identity and access management, security information and event management, data confidentiality and integrity and systems access.
  3. Reliability: the reliability pillar focuses on architecting for failure. Rapid recovery from failure is essential for modern businesses.
  4. Performance Efficiency: The performance efficiency pillar focuses on the efficient use of IT and computing resources.
  5. Cost Optimization: the cost optimization pillar focuses on avoiding unnecessary expenditure on AWS resources.

AWS Well-Architected Framework – Reliability Pillar

The Reliability pillar of the Well-Architected Framework looks at how well a system can recover from infrastructure or service failures. It also considers how a system can automatically scale to meet demand and how disruptions, such as misconfigurations or intermittent network issues, can be mitigated.

The Reliability pillar is based on five foundational architectural principles:

PrincipleDescription
Test Recovery ProceduresIt is much easier to simulate failure scenarios in the cloud - automation can be used to simulate failure, to assist with the formulation of recovery plans and procedures
Automatically recover from failureBy defining business level KPIs, it is possible to monitor systems and trigger automation when a KPI threshold is breached. This enables automated recovery processes to repair or work around the failure.
Scale HorizontallySingle large resources should be replaced by multiple smaller resources to minimize the impact of a failure.
Stop guessing capacityA common cause of IT system failure is insufficient capacity. In the cloud, resource utilization can be monitored and additional resources can be added and removed automatically.
Automate change managementAll infrastructure changes should be made via automation.

Foundations of AWS Reliability

Limit Management 

In order to build a reliable application infrastructure, it is essential to understand the potential limitations of that infrastructure. The ability to monitor when those limits are being reached is equally important, so corrective action can be taken. Limits could be CPU or RAM capacity in an instance, network throughput on a particular connection, number of connections available to a database and so on.

EC2 instances are limited to 20 per region by default. And there are many other service limits—the best place to find out the service limits that currently apply is AWS Trusted Advisor. You can also use Amazon Cloudwatch to set alerts for when limits are being approached, as well as for metrics, such as EBS volume capacity, provisioned IOPS and Network IO.

Further reading: How to Change or Upgrade an EC2 Instance Type

Networking

It is important to consider future growth requirements when architecting IP address-based networks. Amazon VPC (Virtual Private Cloud) enables customers to build out complex network architectures. It is recommended to utilize private address ranges (as defined by RFC1918) for VPC CIDR blocks. Be sure to select ranges that will not conflict with ranges in use elsewhere in your network topology.

When allocating CIDR blocks, it is important to: 

  • Allow IP address space for multiple VPCs per region
  • Consider connections between AWS accounts—other parts of the business may operate AWS resources in separate AWS accounts, but need to interconnect with shared services
  • Allow for subnets that span multiple availability zones within a VPC
  • Leave unused CIDR block space within your VPC

It is also important to consider how resilient your network topology will be to failure, misconfiguration, traffic spikes and DDoS attacks.

You need to consider how you will connect the rest of your network with your AWS resources. Will you use VPNs?  If so, how will these terminate in your VPC and how will you ensure that they are resilient and have sufficient throughput? You may wish to use AWS Direct Connect—again, how will you ensure the resilience of this connection? Perhaps you will require multiple connections back to separate locations outside of the AWS cloud.

Key AWS services for network topology include: 

  • Amazon Virtual Private Cloud: for creation of subnets and IP address allocation
  • Amazon EC2: compute service where any required VPN appliances will run
  • Amazon Route 53: Amazon’s DNS service
  • AWS Global Accelerator: a network acceleration service that directs traffic to optimal AWS network endpoints
  • Elastic Load Balancing: layer 7 load balancing that enables autoscaling to cope with increases and decreases in demand
  • AWS Shield: distributed ‘denial of service’ mitigation provided both free of charge and with an optional additional subscription for an enhanced level of protection

Infrastructure High Availability

First of all, you need to decide exactly what ‘high availability’ means for your application. How much downtime for scheduled and unscheduled maintenance? And what budget do you have available to achieve the level of availability you desire?

There is a big difference in how you will approach the architecture of, say, an internal application that requires 99% availability, versus a mission critical customer-facing application that requires ‘five nines’ (99.999%) availability or higher.

If you are looking to achieve five 9s availability, then every single component of your architecture will need to be able to achieve that level of availability to avoid single points of failure. This will mean adding in a lot of redundancy to the solution, which will of course add to the cost.

Five 9s availability only allows for five minutes of downtime per year. This is virtually impossible to achieve without a high degree of automated deployment and automated recovery from failure—human intervention simply won’t be able to keep up. Any changes to the environment need to be thoroughly tested in a full scale non-production environment, which in itself will significantly add to the overall infrastructure cost.

The table below lists common sources of service interruption, which need to be considered in any high availability design:

CategoryDescription
HardwareFailure of any hardware component e.g. storage, server, network
DeploymentFailure of any automated or manual deployments to application code, hardware, network or configuration
LoadSaturated load on any component of the application or of the overall infrastructure itself
DataCorrupt data accepted into the system that cannot be processed
Expired credentialsExpiration of a certificate or credentials e.g. SSL certificate expiry
DependencyFailure of a dependent service
InfrastructurePower supply or HVAC failure impacting hardware availability
Identifier exhaustionExceeding available capacity, hitting throttling limits, etc.

Application High Availability

There’s no point designing a five 9s availability infrastructure if the application itself cannot achieve that level of availability. Here are four things to consider when designing a highly available application:

  1. Fault Isolation Zones: in AWS terms this can mean architecting your application to leverage multiple Regions and Availability Zones. Regions are geographic locations around the globe that contain two or more Availability Zones (AZ). Availability Zones are physically separate datacentres within a region with isolated power, network and cooling. So, in theory, no two Availability Zones should fail at the same time.
  2. Redundant Components: component redundancy starts right down at the hardware level, with redundant power supplies, hard drives and network interfaces. But, it then extends up the stack to the server level e.g. multiple web servers, multi-AZ or multi-region databases and so on.
  3. Microservices: service-oriented architecture, where software applications are broken down into smaller, independent service units. Read more about this on our blog post: AWS Microservices.
  4. Recovery Oriented Computing: Recovery Oriented Computing (ROC) focuses on having the right monitoring in place to detect all types of failure, and then automating recovery procedures to automatically recover from a failure.

Operational Considerations for Reliability

  1. Deployment: where possible, deployments should be automated using a deployment methodology (e.g. Blue-Green, Canary, Feature Toggles or Failure Isolation Zone deployments) to decrease the risk of failure.
  2. Testing: testing should be carried out to match availability goals—one of the most effective testing methods is canary testing, which runs constantly and simulates customer behaviour.
  3. Monitoring and Alerting: deep monitoring of both your infrastructure and your application is essential to meet availability goals. You need to know the status and availability of each component of the infrastructure and application, as well as the overall user experience being delivered.

So, we’ve touched on a number of the different elements of the AWS Reliability pillar to get you thinking about the architecture of your AWS infrastructure and applications. AWS has a great white paper, which goes into a lot more detail and lists out some hypothetical examples to illustrate some of the concepts. 

If you need any help, either in reviewing your current AWS infrastructure against the AWS Well-Architected Framework or in designing highly available AWS systems, then Logicata is more than happy to help. Our AWS Managed Services ensure continuous improvement of your application infrastructure against the Well-Architected Framework. Please reach out to us for more information.

You Might Be Also Interested In These...

Air traffic control tower at sunset

AWS Control Tower: Everything You Need To Know

You may have heard of AWS Control Tower, AWS Organizations and AWS Service Catalog—but what are these services and how do they integrate with one another? What are the benefits of leveraging Control Tower and the underlying services? If you’re interested to learn more, please read on and we’ll answer these questions and more below. […]

View Post

12 New AWS Services Announced by Adam Selipsky at re:Invent 2021

AWS re:Invent is the annual AWS conference held in Las Vegas, which this year celebrates its 10th anniversary. re:Invent is back as an in person event for 2021, after the 2020 event ran online only due to Covid 19. I was unable to attend the event in person this year, but I live streamed the […]

View Post
Error

SSL: CERTIFICATE_VERIFY_FAILED error when trying to install AwsReplicationWindowsInstaller.exe in Windows

In this post, Marc Gadsdon walks you through the steps to install the AWS Application Migration Replication Agent if you hit the SSL: CERTIFICATE_VERIFY_FAILED error. Installing the AWS Application Migration Replication Agent in Windows fails with an SSL Certificate Error Whilst setting up a small migration using AWS Application Migration Service to migrate some legacy […]

View Post
ebook featured image

5 Steps to a Successful

AWS Migration

DOWNLOAD FREE EBOOK