AWS Reliability – A Core Pillar of Your Architecture

Voiced by Amazon Polly

In this post I’m going to give an overview of the AWS Well Architected Framework, then give a deep dive on the Reliability pillar, which is one of the 5 core pillars that should underpin your AWS architecture.

Approved certificate medal icon in flat style. Check mark stamp vector illustration on white isolated background. Accepted, award seal business concept.


AWS Well Architected Framework

The AWS Well Architected Framework is a series of best practise principles, designed by AWS, to help customers compare their AWS environments against these best practises and identify areas for improvement.  The Framework is based on the extensive experience gleaned by AWS Solutions Architects over the years, in tens of thousands of AWS deployments.  The Framework guides AWS customers through a series of questions that enables customers to understand how well their architecture aligns with best practise.  The ultimate goal is to help AWS customers to build environments that are secure, high performance, resilient and efficient.

AWS offers a free to use Well Architected Tool which guides customers through the questions in relation to their specific AWS workloads, and then provides a plan on how best to architect the cloud environment using established best practises.

AWS Well Architected Framework Pillars

So let’s look at 5 Pillars of the AWS Well Architected Framework, and understand at a high level what each pillar is about.

AWS 5 Pillars

  1. Operational Excellence – The operational excellence pillar focuses on day to day operations of a customer’s AWS infrastructure, including change management, deployment automation, monitoring, responding to events and defining standardized operating models.
  2. Security – The security pillar focuses on protecting systems & data.  This includes Identity & access management, security information & event management, data confidentiality and integrity, and systems access.
  3. Reliability – The reliability pillar focuses on architecting for failure.  Rapid recovery from failure is essential for modern businesses.
  4. Performance Efficiency – The performance efficiency pillar focuses on the efficient use of IT and computing resources.
  5. Cost Optimization – The Cost Optimization pillar focuses on avoiding unnecessary expenditure on AWS resources.

AWS Well Architected – Reliability Pillar

The Reliability Pillar of the Well Architected Framework looks at how well a system can recover from infrastructure or service failures.  It also considers how a system can automatically scale to meet demand, and how disruptions such as misconfigurations or intermittent network issues can be mitigated.

The Reliability Pillar is based on 5 foundational architectural principles:

Test Recovery ProceduresIt is much easier to simulate failure scenarios in the cloud - automation can be used to simulate failure, to assist with the formulation of recovery plans and procedures
Automatically recover from failureBy defining business level KPIs, it is possible to monitor systems and trigger automation when a KPI threshold is breached. This enables automated recovery processes to repair or work around the failure.
Scale HorizontallySingle large resources should be replaced by multiple smaller resources to minimize the impact of a failure.
Stop guessing capacityA common cause of IT system failure is insufficient capacity. In the cloud, resource utilization can be monitored and additional resources can be added and removed automatically.
Automate change managementAll infrastructure changes should be made via automation.

Foundations of AWS Reliability

Limit Management 

In order to build a reliable application infrastructure, it is essential to understand the potential limitations of that infrastructure, and be able to monitor when those limits are being reached, so corrective action can be taken.  Limits could be CPU or RAM capacity in an instance, network throughput on a particular connection, number of connections available to a database and so on.  EC2 instances are limited to 20 per region by default.  And there are many other service limits – the best place to find out the service limits that currently apply is AWS Trusted Advisor.  You can also use Amazon Cloudwatch to set alerts for when limits are being approached, for metrics such as EBS volume capacity, proviosioned IOPS and Network IO.


It is important to consider future growth requirements when architecting IP address based networks.  Amazon VPC – Virtual Private Cloud, enables customers to build out complex network architectures.  It is recommended to utilise private address ranges (as defined by RFC1918) for VPC CIDR blocks.  Be sure to select ranges that will not conflict with ranges in use elsewhere in your network topology.  When allocating CIDR blocks, it is important to: 

  •  Allow IP address space for multiple VPCs per region.
  • Consider connections between AWS accounts – other parts of the business may operate AWS resources in separate AWS accounts but need to interconnect with shared services.
  • Allow for subnets that span multiple availability zones within a VPC.
  • Leave unused CIDR block space within your VPC.

It is also important to consider how your network topology will be resilient to failure, misconfiguration, traffic spikes and DDoS attacks.

You need to consider how you will connect the rest of your network with your AWS resources.  Will you use VPNs?  If so, how will these terminate in your VPC, and how will you ensure that they are resilient and have sufficient throughput?  You may wish to use AWS Direct Connect – again, how will you ensure the resilience of this connection?  Perhaps you will require multiple connections back to separate locations outside of the AWS cloud.

Key AWS Services for Network topology  include: 

  • Amazon Virtual Private Cloud – for creation of subnets and IP address allocation.
  • Amazon EC2 – compute service where any required VPN appliances will run.
  • Amazon Route 53 – Amazon’s DNS service.
  • AWS Global Accelerator – a network acceleration service that directs traffic to optimal AWS network endpoints.
  • Elastic Load Balancing – Layer 7 load balancing that enables autoscaling to cope with increases and decreases in demand.
  • AWS Shield – Distributed Denial of Service mitigation provided both free of charge and with an optional additional subscription for an enhanced level of protection.

Infrastructure High Availability

First of all, you need to decide exactly what high availability means for your application.  How much downtime for scheduled and unscheduled maintenance?  And what budget do you have available to achieve the level of availability you desire?

There is a big difference in how you will approach the architecture of, say, an internal application that requires 99% availability, versus a mission critical customer facing application that requires 5 nines (99.999%) availability or higher.

If you are looking to achieve 5 nines availability, then every single component of your architecture will need to be able to achieve 5 nines availability to avoid single points of failure.  This will mean adding in a lot of redundancy to the solution, which will of course add to the cost.

5 nines availability only allows for 5 minutes of downtime per year.  This is virtually impossible to achieve without a high degree of automated deployment and automated recovery from failure – human intervention simply won’t be able to keep up.  Any changes to the environment need to be thoroughly tested in a full scale non production environment, which in itself will significantly add to the overall infrastructure cost.

The table below lists common sources of service interruption which need to be considered in any high availability design:

HardwareFailure of any hardware component, eg Storage, Server, Network
DeploymentFailure of any automated or manual deployments to application code, hardware, network or configuration.
LoadSaturated load on any component of the application, or of the overall infrastructure itself
DataCorrupt data accepted into the system that cannot be processed.
Expired CredentialsExpiration of a certificate or credentials, eg SSL certificate expiry.
DependencyFailure of a dependent service.
InfrastructurePower supply or HVAC failure impacting hardware availability
Identifier exhaustionExceeding available capacity, hitting throttling limits, etc

Application High Availability

There’s no point designing a 5 nines availability infrastructure if the application itself cannot achieve 5 nines availability.  Here are 4 things to consider when designing a highly available application:

  1. Fault Isolation Zones – in AWS terms this can mean architecting your application to leverage multiple Regions and Availability Zones.  Regions are geographic locations around the globe that contain 2 or more Availability Zones (AZ).  Availability Zones are physically separate datacentres within a region with isolated power, network and cooling.  So in theory no 2 availability zones should fail at the same time.
  2. Redundant Components – component redundancy starts right down at the hardware level with redundant power supplies, hard drives and network interfaces. But then it extends up the stack to the server level, eg multiple web servers, multi AZ or multi region databases and so on.
  3. Microservices – read more on our blog post about AWS Microservices.
  4. Recovery Oriented Computing – Recovery Oriented Computing, or ROC, focuses on having the right monitoring in place to detect all types of failure, and then automating recovery procedures to automatically recover from a failure.

Operational Considerations for Reliability

  1. Deployment – Deployments should be automated where possible using a deployment methodology to decrease the risk of failure, such as Blue-Green, Canary, Feature Toggles or Failure Isolation Zone deployments.
  2. Testing – Testing should be carried out to match availability goals – one of the most effective testing methods is canary testing which runs constantly and simulates customer behaviour.
  3. Monitoring and Alerting – Deep monitoring of both your infrastructure and your application is essential to meet availability goals.  You need to know the status and availability of each component of the infrastructure and application, and the overall user experience being delivered.

 So we’ve touched on a number of the different elements of the AWS Reliability pillar to get you thinking about the architecture of your AWS infrastructure and applications.  AWS have a great white paper which goes into a lot more detail and lists out some hypothetical examples to illustrate some of the concepts – you can read the whitepaper here. 
If you need any help, either in reviewing current AWS infrastructure against the AWS Well Architected Framework, or in designing highly available AWS systems, then Logicata will be pleased to help.  Our AWS Managed Services ensure continuous improvement of your application infrastructure against the Well Architected Framework.  Please reach out to us for more information.

You Might Be Also Interested In These...

Data Driven Organization

11 Key Announcements by Swami Sivasubramanian at re:Invent 2021

On the third day of AWS re:Invent 2021, Dr Swami Sivasubramanian took to the stage to delivery his keynote about all things data and machine learning.  Today the crowd were warmed up by dancing DJ Jen Lasher, who was very energetic for 8:30am! After setting the scene with the explosion of both structured and unstructured […]

View Post
Cloud Computing Cost

Cloud Computing Cost Analysis: Expenses Breakdown

Avoid an unexpected cloud bill by understanding AWS costs Cloud computing allows your business to generate more revenue by making your employees more productive. It makes your business run more efficiently which will allow you to scale effortlessly. You don’t have to set up any infrastructure which means no hardware costs, reduced software expenses, and […]

View Post
Emigrated seagull flying in sunrise at Bangpu, Thailand

AWS Migration in 5 Simple Steps

Migrating your on-premises workloads to AWS may seem like a daunting task to the uninitiated.  It is however a process that has been repeated by many AWS customers, and one which has evolved and been refined over the years.  Here I’m going to outline the 5 major steps to an AWS Migration to demystify the […]

View Post
ebook featured image

5 Steps to a Successful

AWS Migration