Azure Well-Architected Framework — Solution Architecture in the cloud
How Azure’s Well-Architected Framework differs from that of AWS? Well, not a lot
Both AWS and Azure have exactly the same pillars in their Well-Architected Framework, the framework used to evaluate architectures and is a set of guiding tenets that can be used to improve the quality of a workload. The five pillars included in the framework are: Reliability, Operational excellence, Security, Cost optimization and Performance efficiency.
In this article, I’ll introduce these guiding principles and the Azure products used. However, the same principles can be used to guide your solution design on any cloud, just by replacing the Azure products with other cloud products.
Azure Well-Architected Framework helps you to design, build, and continuously improve a secure, reliable, and efficient application. It is a set of guiding tenants to build high-quality solutions on Azure.
It consists of five pillars:
- Cost optimization — Identify inefficiency and waste in cloud spending to ensure you’re spending money where you can make the greatest use of it.
- Operational excellence — Enable faster development and deployment cycles using DevOps. Detect failures and problems before they happen by having a good monitoring apparatus. And of course, Automation! This will help you reduce your MTTR (Mean Time to Repair/Resolve).
- Performance efficiency — Match resource capacity to demand. Design with performance and scalability in mind.
- Reliability — A good design anticipates failure at all levels. Design a system that can recover from a failure within the time that your stakeholders and customers require. A reliable system has a low MTTD (Mean Time to Detect/Discover) and a high MTBF (Mean Time between Failures) and MTTF (Mean Time to Failure).
- Security — No need to stress why this is critical. The cloud provides protections against a variety of threats, such as network intrusion and DDoS attacks. Think about security throughout for the entire lifecycle of your application, from design and implementation to deployment and operations. And also don’t forget to protect your data integrity!
As you enter the cloud world, it is very important to understand that while you offload some responsibilities to the cloud provider, there are still things that fall on your shoulders. Otherwise known in the cloud world as the shared responsibility model. It basically means the amount of responsibility offloaded off of you is proportional to the type of service you are choosing.
Not to forget here: Even if you offload some responsibilities to the cloud, know the limitations of the products that you are choosing. Recognizing these limitations early in the solution design cycle will help you avoid making silly design choices early on.
Now let’s go through each of these pillars and take a deep dive.
Cost optimization is getting the maximum bang for your buck, plain and simple. This is as a matter of fact, one of the main reasons companies choose cloud. They are moving away from CapEx (Capital expenditure — server / storage / network upfront costs) towards OpEx ( Operational Expenditure — deductible expense on cloud services used). Under CapEx model, their assets depreciate in value over time, whereas in OpEx there are not assets to depreciate. To further optimize costs in your organization’s architecture, you can use the following principles:
Maximize efficiency of cloud spend
Cloud infrastructure can handle fluctuating resource usage scenarios which on-prem has a hard-time with. Unless you over-provision, in which case you provision for your peak-load, you can’t handle fluctuation. Over-provisioning is cost-ineffective and even that can’t handle scenarios where you undergo unforeseeable growth, for e.g. Zoom during Covid pandemic.
However, even on cloud it’s still possible to cause unwarranted costs. Since cloud has a pay-as-you-go service model, avoidable expenses are typically the result of provisioning more capacity than your demand requires. Identify and eliminate waste across your environment. Waste can show up in several ways:
- A virtual machine that’s always 90 percent idle — right size your VM!
- Paying for a license included in a virtual machine when a license is already owned — Get more information about Azure Hybrid Benefit and BYOL.
- Retaining infrequently accessed data on a storage medium optimized for frequent access — Use lifecycle management policy!
- Running a development environment overnight that is used only during business hours — Devise a strategy (Tagging + Automation) to automatically shutdown the machines overnight.
- Manually repeating the build of a non-production environment — This is a time-wasting activity. Use automation, for e.g. Azure Resource Manager!
As you evaluate your cost, take the opportunity to optimize environments. Drive the balance between a well-running application and the most cost-effective size. Identify and decommission unused resources. For workloads that don’t require high reliability and performance disks, you can use the reduced-cost standard storage. If your database has unpredictable bursts or spikes in activity, use elastic pools. The more databases you can add to a pool, the greater your savings become.
However, this will work in specific situations such as: Customer databases that are spread across time zones and have periods of intermittent usage.
Plan and estimate costs
Identify the business and technical requirements of your project. Use cost estimation tools (Azure Pricing Calculator) to provide a more concise estimate of the resources that would be required for your project.
This estimate should involve identifying any current resources to move or redevelop, understanding business objectives that might affect sizing/capacity planning, and selecting the appropriate services for the project. You should also consider the subscription and billing models, pay-as-you-go or enterprise agreement. Other important steps include:
- Use Azure Policy to set up an organizational framework to enable the control, reporting, and attribution of costs throughout your environment.
- Enable your users to view reports and billing as needed by creating roles that allow them to view services such as Azure Cost Management.
- Identify both the initial and ongoing training needs for your organization and include this as part of the cost of your project.
Provision with optimization
This can be achieved in multiple forms:
- Move from IaaS to PaaS services. PaaS services (Azure App Service, Azure SQL Database etc.) typically cost less than IaaS, and they typically reduce your operational costs.
- Use spot instances for low-priority workloads. This can provide you with significant savings!
- Use discounts when they’re available, such as reserved instances (available for Windows and Linux virtual machines, Azure SQL Database, Azure Synapse Analytics, Azure Cosmos DB, Azure Storage) and bring-your-own-license (BYOL) offers. Reservations can reduce your resource costs up to 72 percent on pay-as-you-go prices!
- Carefully evaluate the workload requirements for your application, and select the SKU or tier that matches your resource requirements.
- Use services that provide a consumption billing model. With consumption models, you pay for only the amount of transactions, CPU time, or run time of your application. For e.g. Azure Functions, Logic Apps, API Management, Azure SQL Database in serverless tier etc.
Use monitoring and analytics to gain cost insights
Take advantage of cost management tools (Azure Cost Management, Azure Advisor, Billing Reports) and regularly review billing statements to better understand where money is being spent.
Identify and track down any cost anomalies that might show up on billing statements or through alerts. Configure alerts that are based on spending. These alerts can provide immediate visibility into spending that might be exceeding your budget. You can then take steps to address these costs. There are three types of cost alerts:
- Budget alerts notify you when spending, based on usage or cost, reaches or exceeds the amount defined in the alert condition of the budget.
- Credit alerts notify you when your Azure credit monetary commitments are consumed. Monetary commitments are for organizations with enterprise agreements.
- Department spending quota alerts notify you when department spending reaches a fixed threshold of the quota. You configure spending quotas in the Azure Enterprise Agreement portal.
Ensure you have full visibility into how your application is running and the best experience for your users. It includes making your development and release practices more agile, which allows your business to quickly adjust to changes. You can use the following principles to improve operational excellence in your architecture:
Design, build, and orchestrate with modern practices
DevOps is at the center of the modern architectures. DevOps is as much cultural as it is technical. Creating a culture of sharing, collaboration, and transparency will bring operational excellence to your organization. Azure DevOps is a suite of products and tools that teams adopting DevOps practices can use to plan, develop, deliver, and operate their solutions.
Azure Boards is a part of Azure DevOps that helps teams plan and track work through modern agile tools like Kanban boards, backlogs, dashboards and scrum boards.
Azure Pipelines, also a part of Azure DevOps, is a cloud service that is used to automatically build and test your code project and make it available to others. Azure Pipelines combines continuous integration (CI) and continuous delivery (CD) to constantly and consistently test and build your code and ship it to any target.
Microservice architectures are often adopted for new applications that are adopting DevOps practices. A microservice is small enough that a single, small team of developers can write and maintain it (Also interesting: two-pizza teams). Microservice architectures are technology agnostic, but you often see containers or serverless technologies used for their implementation.
Note: GitHub Actions can also be used to build CI/CD capabilities in your GitHub repositories. With GitHub Actions, you can build workflows that are custom automated processes to build, test, package, release, and deploy code.
Use monitoring and analytics to gain operational insights
Have a thorough monitoring, logging, and instrumentation system. Doing this correctly will help you identify performance issues and cost inefficiencies, correlate events, and gain a greater ability to troubleshoot issues. Gathering data points from components at every layer will help alert you when values are outside acceptable ranges and help you track spending over time.
When it comes to monitoring and analytics on Azure, there are three specific areas of focus:
- Deep infrastructure monitoring — When you’re designing a monitoring strategy, it’s important to include every component in the application chain, so you can correlate events across services and resources. Services that support Azure Monitor can be easily configured to send their data to a Log Analytics workspace. Log Analytics acts as a central hub for monitoring data; Log Analytics receives monitoring data from your Azure resources, and makes that data available to consumers for analysis or visualization.
- Deep application monitoring — Application Insights provides telemetry collection, query, and visualization capabilities, which allows you to better detect and diagnose issues that occur within your web apps and services. It is an application performance management(APM) tool, that can take your monitoring capabilities even further by looking deep into your applications to identify performance issues, usage trends, and the overall availability of services you develop and depend on.
- Core monitoring — provides fundamental, required monitoring across Azure resources. Azure provides services to give you visibility into four key core monitoring areas: activity logging, the health of services, metrics and diagnostics, and recommendations on best practices.
Every change that is submitted to the Azure platform is tracked in the Azure Activity Log, which gives you the ability to trace any action that is taken on your resources. Azure Service Health identifies any issues with Azure services that might affect your application.
Azure Monitor enables core monitoring for Azure services by allowing the collection, aggregation, and visualization of metrics, activity logs, and diagnostic logs.
Azure Advisor helps by keeping an eye out for potential performance, cost, high availability, or security issues within your resources.
Use automation to reduce effort and error
Automate as much of your architecture as possible. Lack of automation results in increased operational costs. Automate deployments by using infrastructure as code (IaC), automate application testing, and build new environments as needed. IaC is a key DevOps practice, and it is often used in conjunction with continuous delivery.
Infrastructure as code (IaC) is the management of infrastructure — such as networks, virtual machines, load balancers, and connection topology — in a descriptive model, using a versioning system. On Azure, Azure Resource Manager (ARM) templates are an example of IaC.
Important: IaC evolved to solve the problem of environment/configuration drift. Without IaC, teams must maintain the settings of individual deployment environments. Over time, each environment becomes a snowflake that is increasingly unique, and cannot be reproduced automatically. The administration and infrastructure maintenance of these snowflake environments involves manual processes that are hard to track and contribute to errors. IaC makes all the deployments reproducible.
Further automations that can be done include configuration automation of VMs using either custom images or post-deployment scripting using Azure Automation Desired State Configuration (DSC). Further, operational activities should also be automated. Automating these tasks with Azure Automation reduces manual workloads. For e.g. automate the task to start and stop a machine at a particular time.
Test, Test, Test
Testing is one of the fundamental components of DevOps and agile development in general. Testing should occur on both application code and infrastructure code, and they should both be subject to the same quality controls. Include testing in your application deployment and your ongoing operations. A robust testing plan can uncover issues with infrastructure deployments that can affect the user experience.
The different types of the testing you should consider include unit testing, smoke testing, integration testing, load testing, security testing, acceptance testing, fault injection etc.
Important: A main tenet of a DevOps practice to achieve system reliability is the shift left principle. If your process for developing and deploying an application is depicted as a series of steps that are listed from left to right, your testing should be shifted as much as possible toward the beginning of your process (e.g. to the left), and not just at the very end of your process (e.g. to the right).
Performance efficiency includes scaling resources, identifying and optimizing potential bottlenecks, and optimizing your application code for peak performance. Use following patterns to improve performance efficiency:
Scale up and scale out — Use Autoscaling
Compute resources can be scaled up and scaled out. Scale up means giving extra capabilities to your existing resource. Scaling out means add more similar resources.
The advantage of scaling out is that you can conceivably scale out forever if you have more machines to add to the architecture. Scaling out requires some type of load distribution. This might be in the form of a load balancer that distributes requests across available servers. Or it might be a service discovery mechanism for identifying active servers to send requests to.
Autoscaling is the process of dynamically allocating resources to match performance requirements. As the volume of work grows, an application might need more resources to maintain the desired performance levels and satisfy service-level agreements (SLAs). As demand slackens and the additional resources are no longer needed, they can be deallocated to minimize costs.
You can also use a throttling mechanism to limit the number of requests from a source. Further, Serverless and Containers can also be used for scaling purposes. For e.g. Scaling and performance are handled automatically in Azure Functions (Serverless), Autoscaling can be enabled for containers in AKS (Containers)
Optimize network and storage performance
Review the network and storage performance to ensure that their levels are within acceptable limits. These performance levels can affect the response time of your application.
You can use the Traffic Manager as a DNS-based load balancer that you can use to distribute traffic within and across Azure regions. Use a CDN to cache the content closer to the user. In case of really low latency requirements for connectivity from on-premises to Azure, use ExpressRoute.
Optimize virtual machine storage by choosing the right disk for your workload:
- Use Local SSD storage for temporary and high-performance disk requirements
- Use Standard storage HDD when your application isn’t bound by inconsistent latency or lower levels of throughput (dev/test workloads).
- Use Standard storage SSD and Premium storage SSD for higher levels of throughput requirements. Premium storage SSD is recommended for most production workloads and is the more reliable one of the two and provides higher IOPS.
Use polyglot persistence whenever possible. It’s important to know that different data stores are designed for certain use cases or might be more accessible because of cost.
Adding a messaging layer between services can have a benefit to performance and scalability. Data Partitioning can help improve scalability, reduce contention, and optimize performance. Use caching in your architecture to help improve performance.
Identify performance bottlenecks in your application
It’s important to be able to track the way in which users utilize your system, trace resource utilization, and generally monitor the health and performance of your system. You can use this information as a diagnostic aid to detect and correct issues. Look across all layers of your application and identify and remediate performance bottlenecks.
Designing for reliability includes maintaining uptime through small-scale incidents and temporary conditions like partial network outages. It focuses on recovery from data loss and from larger-scale disasters. Basically it’s about ensuring that your systems are available to end users and can recover from any failures.
Build a highly available architecture
A highly available service is a service that absorbs fluctuations in availability, load, and temporary failures in dependent services and hardware. The application remains online and available (or maintains the appearance of it) while performing acceptably. You should evaluate high-availability for your architecture by taking the following step:
- Determine the service-level agreement of your application — A service-level agreement (SLA) is an agreement between a service provider and a service consumer in which the service provider commits to a standard of service based on measurable metrics and defined responsibilities. Service-level objectives (SLO) are the values of measurable target metrics that are used to measure performance, reliability, or availability. These SLOs will be used to determine if your overall SLA is being met. For e.g., cumulative downtime for various SLA levels is shown below:
- Evaluate the HA capabilities of the application — Focus on single points of failure and critical components that would have a large impact on the application if they were unreachable, misconfigured, or started behaving unexpectedly.
- Evaluate the HA capabilities of dependent applications — If you are committing an uptime to your customers of 99.9%, but a service your application depends on only has an uptime commitment of 99%, this could put you at risk of not meeting your SLA to your customers.
Important: To improve your availability, you can use Azure products/features such as availability set, availability zones, Load balancing (Azure Traffic manager, Azure load balancer, Azure Application Gateway) and PaaS (Azure SQL Database, Azure App Service, and Azure Service Bus) HA features.
Add redundancy to components of the architecture so that you’re less likely to experience an outage. Identify your service-level agreement (SLA) and use a load balancer to spread requests across many instances of a service, detecting failed instances and preventing requests from being routed to them.
Build an architecture that can recover from failure
Understand your RPO and RTO:
- Recovery point objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume. Examples are “30 minutes of data,” “four hours of data,” and so on. RPO is about limiting and recovering from data loss, not data theft.
- Recovery time objective (RTO): The maximum duration of acceptable downtime, where “downtime” is defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.
The Azure products that can be used for such disaster recovery are Azure Site Recovery, Azure Storage, Azure Backup, Azure SQL Database, Azure CosmosDB etc.
Sensitive data like data about customers, financial information about your organization, or critical line-of-business data that supports your organization has always a target of rogue activities. Securing the infrastructure that the data exists on, along with the identities used to access it, is critically important as well. Following are some of the well-established strategies to improve your security posture:
Defense in depth
It consists of seven layers as shown below. Each layer focuses on a different area where attacks can happen and creates a depth of protection, if one layer fails or is bypassed by an attacker. Each layer will have different security controls, technologies, and capabilities that will apply.
You should also be aware of the term Zero Trust model. It states that you should never assume trust but instead continually validate trust. It relies on verifiable user and device trust claims to grant access to organizational resources.
Identity Management and Infrastructure Protection
Use identity as a security layer for internal and external applications, take benefits of single sign-on (SSO) and multi-factor authentication (MFA) to provide identity security, and consider replicating on-premises identities to Azure Active Directory using Azure AD Connect and conditional access policies.
Use RBAC, Management groups, Privileged Identity management and Managed Identities for further security.
Encryption and Network Security
Identify and classify your data to confidential/restricted, private and public. This helps you get a better picture of where sensitive data may be stored and where existing encryption may or may not be happening. Encrypt all data at rest and in transit. All Azure products where data is stored support encryption. You should also encrypt your secrets (Azure Key Vault) and backups.
Use Azure Security Center to identify internet-facing resources that don’t have network security groups associated with them, as well as resources that are not secured behind a firewall. Use web application firewall (WAF) included in Application Gateway to provide advanced security for your HTTP-based services. Isolate applications between environments, tiers, and services using Network Security Groups.