AWS Reliability Pillar

AWS Reliability – A Core Pillar of Your Architecture

  • 8 min read

In this post I’m going to give an overview of the AWS Well Architected Framework, then give a deep dive on the Reliability pillar, which is one of the 5 core pillars that should underpin your AWS architecture.

AWS Well Architected Framework

The AWS Well Architected Framework is a series of best practise principles, designed by AWS, to help customers compare their AWS environments against these best practises and identify areas for improvement.  The Framework is based on the extensive experience gleaned by AWS Solutions Architects over the years, in tens of thousands of AWS deployments.  The Framework guides AWS customers through a series of questions that enables customers to understand how well their architecture aligns with best practise.  The ultimate goal is to help AWS customers to build environments that are secure, high performance, resilient and efficient.

AWS offers a free to use Well Architected Tool which guides customers through the questions in relation to their specific AWS workloads, and then provides a plan on how best to architect the cloud environment using established best practises.

AWS Well Architected Framework Pillars

So let’s look at 5 Pillars of the AWS Well Architected Framework, and understand at a high level what each pillar is about.

AWS 5 Pillars

  1. Operational Excellence – The operational excellence pillar focuses on day to day operations of a customer’s AWS infrastructure, including change management, deployment automation, monitoring, responding to events and defining standardized operating models.
  2. Security – The security pillar focuses on protecting systems & data.  This includes Identity & access management, security information & event management, data confidentiality and integrity, and systems access.
  3. Reliability – The reliability pillar focuses on architecting for failure.  Rapid recovery from failure is essential for modern businesses.
  4. Performance Efficiency – The performance efficiency pillar focuses on the efficient use of IT and computing resources.
  5. Cost Optimization – The Cost Optimization pillar focuses on avoiding unnecessary expenditure on AWS resources.

AWS Well Architected - Reliability Pillar

The Reliability Pillar of the Well Architected Framework looks at how well a system can recover from infrastructure or service failures.  It also considers how a system can automatically scale to meet demand, and how disruptions such as misconfigurations or intermittent network issues can be mitigated.

The Reliability Pillar is based on 5 foundational architectural principles:

PrincipleDescription
Test Recovery ProceduresIt is much easier to simulate failure scenarios in the cloud – automation can be used to simulate failure, to assist with the formulation of recovery plans and procedures
Automatically recover from failureBy defining business level KPIs, it is possible to monitor systems and trigger automation when a KPI threshold is breached. This enables automated recovery processes to repair or work around the failure.
Scale HorizontallySingle large resources should be replaced by multiple smaller resources to minimize the impact of a failure.
Stop guessing capacityA common cause of IT system failure is insufficient capacity. In the cloud, resource utilization can be monitored and additional resources can be added and removed automatically.
Automate change managementAll infrastructure changes should be made via automation.

Foundations of AWS Reliability

Limit Management 

In order to build a reliable application infrastructure, it is essential to understand the potential limitations of that infrastructure, and be able to monitor when those limits are being reached, so corrective action can be taken.  Limits could be CPU or RAM capacity in an instance, network throughput on a particular connection, number of connections available to a database and so on.  EC2 instances are limited to 20 per region by default.  And there are many other service limits – the best place to find out the service limits that currently apply is AWS Trusted Advisor.  You can also use Amazon Cloudwatch to set alerts for when limits are being approached, for metrics such as EBS volume capacity, proviosioned IOPS and Network IO.

Networking

It is important to consider future growth requirements when architecting IP address based networks.  Amazon VPC – Virtual Private Cloud, enables customers to build out complex network architectures.  It is recommended to utilise private address ranges (as defined by RFC1918) for VPC CIDR blocks.  Be sure to select ranges that will not conflict with ranges in use elsewhere in your network topology.  When allocating CIDR blocks, it is important to: 

  •  Allow IP address space for multiple VPCs per region.
  • Consider connections between AWS accounts – other parts of the business may operate AWS resources in separate AWS accounts but need to interconnect with shared services.
  • Allow for subnets that span multiple availability zones within a VPC.
  • Leave unused CIDR block space within your VPC.
 
It is also important to consider how your network topology will be resilient to failure, misconfiguration, traffic spikes and DDoS attacks.
 

You need to consider how you will connect the rest of your network with your AWS resources.  Will you use VPNs?  If so, how will these terminate in your VPC, and how will you ensure that they are resilient and have sufficient throughput?  You may wish to use AWS Direct Connect – again, how will you ensure the resilience of this connection?  Perhaps you will require multiple connections back to separate locations outside of the AWS cloud.

Key AWS Services for Network topology  include: 

  • Amazon Virtual Private Cloud – for creation of subnets and IP address allocation.
  • Amazon EC2 – compute service where any required VPN appliances will run.
  • Amazon Route 53 – Amazon’s DNS service.
  • AWS Global Accelerator – a network acceleration service that directs traffic to optimal AWS network endpoints.
  • Elastic Load Balancing – Layer 7 load balancing that enables autoscaling to cope with increases and decreases in demand.
  • AWS Shield – Distributed Denial of Service mitigation provided both free of charge and with an optional additional subscription for an enhanced level of protection.

Infrastructure High Availability

First of all, you need to decide exactly what high availability means for your application.  How much downtime for scheduled and unscheduled maintenance?  And what budget do you have available to achieve the level of availability you desire?

There is a big difference in how you will approach the architecture of, say, an internal application that requires 99% availability, versus a mission critical customer facing application that requires 5 nines (99.999%) availability or higher.

If you are looking to achieve 5 nines availability, then every single component of your architecture will need to be able to achieve 5 nines availability to avoid single points of failure.  This will mean adding in a lot of redundancy to the solution, which will of course add to the cost.

5 nines availability only allows for 5 minutes of downtime per year.  This is virtually impossible to achieve without a high degree of automated deployment and automated recovery from failure – human intervention simply won’t be able to keep up.  Any changes to the environment need to be thoroughly tested in a full scale non production environment, which in itself will significantly add to the overall infrastructure cost.

The table below lists common sources of service interruption which need to be considered in any high availability design:

CategoryDescription
HardwareFailure of any hardware component, eg Storage, Server, Network
DeploymentFailure of any automated or manual deployments to application code, hardware, network or configuration.
LoadSaturated load on any component of the application, or of the overall infrastructure itself
DataCorrupt data accepted into the system that cannot be processed.
Expired CredentialsExpiration of a certificate or credentials, eg SSL certificate expiry.
DependencyFailure of a dependent service.
InfrastructurePower supply or HVAC failure impacting hardware availability
Identifier exhaustionExceeding available capacity, hitting throttling limits, etc

 

Application High Availability

There’s no point designing a 5 nines availability infrastructure if the application itself cannot achieve 5 nines availability.  Here are 4 things to consider when designing a highly available application:

  1. Fault Isolation Zones – in AWS terms this can mean architecting your application to leverage multiple Regions and Availability Zones.  Regions are geographic locations around the globe that contain 2 or more Availability Zones (AZ).  Availability Zones are physically separate datacentres within a region with isolated power, network and cooling.  So in theory no 2 availability zones should fail at the same time.
  2. Redundant Components – component redundancy starts right down at the hardware level with redundant power supplies, hard drives and network interfaces. But then it extends up the stack to the server level, eg multiple web servers, multi AZ or multi region databases and so on.
  3. Microservices – read more on our blog post about AWS Microservices.
  4. Recovery Oriented Computing – Recovery Oriented Computing, or ROC, focuses on having the right monitoring in place to detect all types of failure, and then automating recovery procedures to automatically recover from a failure.

Operational Considerations for Reliability

  1. Deployment – Deployments should be automated where possible using a deployment methodology to decrease the risk of failure, such as Blue-Green, Canary, Feature Toggles or Failure Isolation Zone deployments.
  2. Testing – Testing should be carried out to match availability goals – one of the most effective testing methods is canary testing which runs constantly and simulates customer behaviour.
  3. Monitoring and Alerting – Deep monitoring of both your infrastructure and your application is essential to meet availability goals.  You need to know the status and availability of each component of the infrastructure and application, and the overall user experience being delivered.
 
So we’ve touched on a number of the different elements of the AWS Reliability pillar to get you thinking about the architecture of your AWS infrastructure and applications.  AWS have a great white paper which goes into a lot more detail and lists out some hypothetical examples to illustrate some of the concepts – you can read the whitepaper here. 

If you need any help, either in reviewing current AWS infrastructure against the AWS Well Architected Framework, or in designing highly available AWS systems, then Logicata will be pleased to help.  Our AWS Managed Services ensure continuous improvement of your application infrastructure against the Well Architected Framework.  Please reach out to us for more information.
nv-author-image

Karl Robinson

Director and Co-Founder of Logicata, an AWS Managed Services Provider. Over 20 years experience in the internet & cloud industry. Closet geek, AWS & Azure certified.