The AWS Well-Architected Framework provides a consistent approach for customers and partners to evaluate architectures and implement designs that scale over time. This comprehensive guide covers all six pillars and essential design principles for building robust cloud solutions.
The Six Pillars
Operational Excellence
Support development to deliver business value and continuously improve processes
Security
Protect data, systems, and assets using cloud technologies
Reliability
Perform intended functions correctly and consistently when expected
Performance Efficiency
Use computing resources efficiently and maintain efficiency as demand changes
Cost Optimization
Run systems to deliver business value at the lowest price point
Sustainability
Minimize environmental impacts through efficiency and reduced resource consumption
Operational Excellence
The ability to support development in delivering business value, run workloads effectively, gain insights into operations, and continuously improve supporting processes and procedures.
Security
Describes how to leverage cloud technologies to protect data, systems, and assets, and improve your security posture.
Reliability
The ability of a workload to perform its intended function correctly and consistently when it's expected to.
Performance Efficiency
The ability to use computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve.
Cost Optimization
The ability to run systems to deliver business value at the lowest price point.
Sustainability
The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload, maximizing the benefits from the provisioned resources and minimizing the total resources required.
General Design Principles
Stop Guessing Your Capacity Needs
With cloud computing, you can use as much or as little capacity as you need and scale up and down automatically, eliminating expensive idle resources or performance issues from insufficient capacity.
Test Systems at Production Scale
Create a production-scale test environment on demand, complete your testing, and then shut down the resources. You only pay for the test environment when it's running.
Automate with Architectural Experimentation in Mind
Automation allows you to create and replicate workloads at low cost, track changes, audit impacts, and revert to previous parameters when necessary.
Consider Evolutionary Architectures
The cloud enables systems to evolve over time through on-demand automation and testing, allowing businesses to benefit from innovations as a standard practice.
Drive Architectures Using Data
Collect data on how architectural choices affect workload behavior to make fact-based decisions. Your cloud infrastructure is code, so use data to inform improvements over time.
Improve Through Game Days
Regularly schedule Game Days to test architecture and processes by simulating production events, helping identify improvements and build organizational experience.
Architecture Team Roles
Technology architecture teams typically include a range of roles such as Technical Architect (infrastructure), Solution Architect (software), Data Architect, Network Architect, and Security Architect. These teams often utilize TOGAF or similar architectural capability frameworks.
Governance and Compliance
There are two different rule-makers to monitor:
- Internal Governance: Your organization's own policies, standards, and best practices
- External Compliance: Regulatory requirements, industry standards, and legal obligations
Monitoring vs Observability
Monitoring is tracking known issues (Known-Unknowns). You open a dashboard and say "notify me if CPU exceeds 90%". The answer is either yes or no.
Observability, on the other hand, is the ability to debug unknown issues (Unknown-Unknowns). The system can tell you that CPU is high because microservice X is sending too many queries to database Y, and 30% of these queries are creating deadlocks.
The Three Pillars of Observability
Metrics
Numerical data stored as time-series
Logs
Text records of what happened at a specific moment
Traces
The journey of a request through the system
Runbook vs Playbook
- Runbook: Like a recipe - the outcome is known. If you can write a runbook for something (step-by-step instructions are clear), ideally it should be a bash script or AWS Systems Manager Automation document.
- Playbook: The outcome is uncertain. It's more of a guide than a solution, helping with investigation rather than providing definitive answers.
Security Considerations
The Customer Support Dilemma
When a customer comes with an issue like "I can't open my account", one solution is to build an interface for support staff. However, developing, maintaining, and securing this interface creates its own ecosystem with hidden and significant costs.
Alternatively, approaches like banning SSH, using VPN/Private Link instead of Bastion Hosts are examples of the principle of protecting developers from themselves.
AWS Identity Types
- Human Identities: Your administrators, developers, operators, and end users require an identity to access your AWS environments and applications
- Machine Identities: Your service applications, operational tools, and workloads require an identity to make requests to AWS services
Threat Modeling
Threat Modeling means thinking "How would a hacker break this system?" before writing any code.
In the Backend/Cloud world, this is represented by the STRIDE model:
Spoofing
Can someone impersonate an admin?
Tampering
Can someone intercept and modify packets?
Repudiation
Can someone say "I didn't do it"?
Information Disclosure
Can data leak?
Denial of Service
Can someone lock up the system?
Elevation of Privilege
Can a regular user become an admin?
Storage Types in AWS
Object Storage (Amazon S3)
Object storage makes data accessible from any internet location. In S3, there are no files or folders - there are Objects. Each object has a Key and a Value. Access is via REST API.
Block Storage (Amazon EBS)
Block storage is like the SSD you attach to a computer. It's raw disk space. The operating system formats it and installs a file system, reading or writing data in small blocks. Access protocol is direct I/O. It can only be attached to a single EC2 instance.
File Storage (Amazon EFS / FSx)
Classic folder tree structure. Access protocol is NFS for Linux or SMB for Windows. Multiple EC2 instances can mount this disk simultaneously and read/write to the same file.
Blame-Free Culture
In theory, if an error causes the system to crash, the culprit is not person X, but the process that allowed person X to make that error. This is called Psychological Safety. The goal is to ensure people don't hide their mistakes.
This approach encourages:
- Open discussion of failures
- Learning from mistakes
- Improving processes rather than punishing individuals
- Building trust within teams
- Creating an environment where innovation can thrive
Conclusion
The AWS Well-Architected Framework provides a comprehensive foundation for building secure, high-performing, resilient, and efficient infrastructure for applications. By understanding and implementing these six pillars and design principles, organizations can make informed decisions about their cloud architectures and continuously improve their systems over time.