Architecting Storage Infrastructure in Azure
Get to know the Storage Infrastucture in Azure — for newbies or the AZ-300 / AZ-303 exam
This article introduces you to the storage infrastructure available in Azure and guides in the architectural decisions. It includes a lot of capability matrices comparing the products to help you make your storage decisions. We start with Azure Storage with all its components, Data migration, moving on to Azure disks and finally finishing with a comparion of products available for bulk data ingestion, batch processing, analytical data stores and real-time streaming ingestion. So, let’s get started!
Azure Storage includes four services: Azure Blobs, Azure Files, Azure Queues, and Azure Tables. You need a storage account on Azure to use these services. A storage account is a container that groups a set of Azure Storage services together. It is an Azure resource and is included in a resource group.
Note: Deleting the storage account deletes all of the data stored inside it.
Azure Queue is used to store and retrieve messages. Queue messages can be up to 64 KB in size, and a queue can contain millions of messages. Queues are generally used to store lists of messages to be processed asynchronously.
Azure Storage supports three kinds of blobs:
Important: A storage account by itself has no financial cost; however, the settings you choose for the account do influence the cost of services in the account. Geo-redundant storage costs more than locally-redundant storage. Premium performance and the Hot access tier increase the cost of blobs.
A storage account includes the following information:
- Subscription: The Azure subscription that will be billed for the services in the account.
- Location: The datacenter that will store the services in the account.
- Performance: Standard allows you to have any data service (Blob, File, Queue, Table) and uses magnetic disk drives. Premium introduces additional services for storing data. For example, storing unstructured object data as block blobs or append blobs, and specialized file storage used to store and create premium file shares. These storage accounts use solid-state drives (SSD) for storage.
- Replication: At a minimum, Azure will automatically maintain three copies of your data within the data center associated with the storage account. This is called locally-redundant storage (LRS). There are other options as well, such as Zone-redundant storage (ZRS), geo-redundant storage (GRS) and Read-access geo-redundant storage (RA-GRS). These will be discussed later under data redundancy.
- Access tier: Controls how quickly you will be able to access the blobs in this storage account. Hot gives quicker access than Cool, but at increased cost. This applies only to blobs, and serves as the default value for new blobs.
- Secure transfer required: A security feature that determines the supported protocols for access. Enabled requires HTTPs, while disabled allows HTTP.
- Virtual networks: A security feature that allows inbound access requests only from the virtual network(s) you specify.
You can create a storage account with the following command. Note how the sku is the combination of “performance” and “replication”.
While storage accounts support both deployment models (Classic and Resource Manager), it is recommended that you use Resource Manager for all new resources, as it has the support for resource group which the classic model doesn’t.
There are three kinds of storage accounts:
- StorageV2 (general purpose v2): Current offering that supports all storage types and all of the latest features.
- Storage (general purpose v1): a legacy kind that supports all storage types but may not support all features.
- Blob storage: a legacy kind that allows only block blobs and append blobs.
General-purpose v2 should be used for any new storage accounts.
Data in Azure is replicated to ensure that it’s always available, even if a datacenter or region becomes inaccessible or a specific piece of hardware fails. You have four replication options:
- Locally redundant storage (LRS) — Locally redundant storage replicates data and stores three copies across fault domains, or racks of hardware, within a single datacenter facility in one region.
- Zone-redundant storage (ZRS) — Zone-redundant storage replicates your data across three storage clusters in a region. Each cluster is physically separated from the other two.
- Geographically redundant storage (GRS) — Geographically redundant, or geo-redundant, storage provides multiple levels of replication. Your data is replicated three times within the primary region, and then this set is replicated to a secondary region. Keep in mind that your data in the secondary region is inaccessible until the primary region has failed across to the secondary region. At this point, the secondary region becomes the active region (primary), and your data becomes accessible. GRS provides the highest level of durability (99.9999999999999999 — that’s 18 9s!).
- Read-access geo-redundant storage (RA-GRS) — It’s similar to GRS, but the secondary location is read-only. Microsoft handles the failover to the secondary. When you use RA-GRS, you need to ensure that your application knows which endpoint it’s interacting with. The secondary region has “-secondary” appended to the name of the endpoint. After the failover and DNS endpoint updates are complete, the storage account is set back to LRS. You’re responsible for reverting the replication settings from LRS to RA-GRS or GRS after the primary region becomes available again.
Handle failed writes
RA-GRS replicates writes across locations. If the primary location fails, you can direct read operations toward the secondary location. However, this secondary location is read-only. If a long-lasting outage (more than a few seconds) occurs at the primary location, your application must run in read-only mode. You can achieve read-only mode in several ways:
- Temporarily return an error from all write operations until write capability is restored.
- Buffer write operations, perhaps by using a queue, and enact them later when the write location becomes available.
- Write updates to a different storage account in another location. Merge these changes into the storage account at the primary location when it becomes available.
Hint: To prevent an application from retrying operations that have failed, you can implement the Circuit Breaker pattern.
An application that uses the Azure Storage client library can set the LocationMode of a read request to one of the following values:
- PrimaryOnly: The read request fails if the primary location is unavailable. This failure is the default behavior.
- PrimaryThenSecondary: Try the primary location first, and then try the secondary location if the primary location is unavailable. Fail if the secondary location is also unavailable.
- SecondaryOnly: Try only the secondary location, and fail if it’s not available.
- SecondaryThenPrimary: Try the secondary location first, and then try the primary location.
Use Advanced Threat Protection to detect anomalies in account activity. It notifies you of potentially harmful attempts to access your account.
Azure Storage accounts provide several high-level security benefits for the data in the cloud:
- Protect the data at rest — All data written to Azure Storage is automatically encrypted by Storage Service Encryption (SSE) with a 256-bit Advanced Encryption Standard (AES) cipher. For virtual machines (VMs), Azure lets you encrypt virtual hard disks (VHDs) by using Azure Disk Encryption.
- Protect the data in transit — Enable TLS
- Support browser cross-domain access — Using CORS
- Control who can access data — Azure Storage supports Azure Active Directory and role-based access control (RBAC) for both resource management and data operations.
- Audit storage access — Audit Azure Storage access by using the built-in Storage Analytics service.
Each storage account has two access keys that are used to securely access the storage account. These access keys provide full access to anything in the storage account, similar to a root password on a computer. Storage accounts offer a separate authentication mechanism called shared access signatures that support expiration and limited permissions for scenarios where you need to grant limited access.
A shared access signature is a string that contains a security token that can be attached to a URI. Use a service-level shared access signature to allow access to specific resources in a storage account. Use an account-level shared access signature to allow access to anything that a service-level shared access signature can allow, plus additional resources and abilities.
Important: Access keys are critical to providing access to your storage account and as a result, should not be given to any system or person that you do not wish to have access to your storage account. Access keys are the equivalent of a username and password to access your computer.
Azure Files enables you to set up highly available network file shares that can be accessed by using the standard Server Message Block (SMB) protocol. This means that multiple VMs can share the same files with both read and write access. Azure Files can be used to add to or replace a company’s existing on-premises NAS devices or file servers. Because Azure Files stores files in a storage account, you can choose between standard or premium performance storage accounts:
- Standard performance: Double-digit ms latency, 10,000 IOPS, 300-MBps bandwidth
- Premium performance: Single-digit ms latency, 100,000 IOPS, 5-GBps bandwidth
Following table compares the different characteristics of each storage option:
Depending upon the amount of data to transfer and the strength of your network connection, you can make use of different Azures services / utilities as shown below:
The utilities mentioned in the above figure are described below:
You can use Storage Explorer for the management of, data stored in Azure storage accounts. It lets you access blob, table, file and data lake storage. It gives you the ability to manage the data stored in multiple Azure storage accounts and across Azure subscriptions.
You can even use Storage Explorer to access and manage data stored in Azure Cosmos DB and Data Lake. In order to avoid data costs, you can use a locally based emulator. Storage Explorer supports two emulators: Azure Storage Emulator and Azurite.
- Azure Storage Emulator uses a local instance of Microsoft SQL Server 2012 Express LocalDB. It emulates Azure Table, Queue, and Blob storage.
- Azurite, which is based on Node.js, is an open-source emulator that supports most Azure Storage commands through an API.
There are three main roles for disks:
- OS disk. The OS disk, as the name suggests, is the disk with the OS on it. It has a maximum capacity of 2,048 GB.
- Data disk. You can add one or more data virtual disks to each virtual machine to store data. The number of data disks you can add depends on the virtual machine size. Each data disk has a maximum capacity of 32,767 GB.
- Temporary disk. Each virtual machine contains a single temporary disk, which is used for short-term storage applications such as page files and swap files. The contents of temporary disks are lost during maintenance events.
Most disks that you use with virtual machines in Azure are managed disks. A managed disk is a virtual hard disk for which Azure manages all the required physical infrastructure. Virtual hard disks in Azure are stored as page blobs in an Azure Storage account as mentioned earlier, but you don’t have to create storage accounts, blob containers, and page blobs yourself or maintain this infrastructure later. Managed disks are scalable, highly available (99.999%), have support for encryption and Azure backup.
An unmanaged disk, like a managed disk, is stored as a page blob in an Azure Storage account. The difference is that with unmanaged disks, you create and maintain this storage account manually. This requirement means that you have to keep track of IOPS limits within a storage account and ensures that you don’t overprovision throughput of your storage account. They don’t support any of the scalability and management features.
Ephemeral OS disks
An ephemeral OS disk is a virtual disk that saves data on the local virtual machine storage. It has lower read-and-write latency than a managed disk. They incur no storage costs and are free. Ephemeral disks work well for applications that are tolerant of individual virtual machine failures.
To choose the right disk type, it’s critical to understand its performance. Performance is expressed in two key measures:
- Input/output operations per second (IOPS). IOPS measure the rate at which the disk can complete a mix of read and write operations. Higher performance disks have higher IOPS values.
- Throughput. Throughput measures the rate at which data can be moved onto the disk from the host computer and off the disk to the host computer. It is generally measured in MB/s.
Virtual disks that you can choose for an Azure virtual machine are based on SSDs of several types or HDDs.
Ultra SSD — Ultra SSDs provide the highest disk performance (high throughput, high IOPS, and low latency) available in Azure. Ultra disks can have capacities from 4 GB up to 64 TB. You can adjust their IOPS and throughput values while they’re running and without detaching them. They don’t support disk snapshots, virtual machine images, scale sets, Azure Disk Encryption, Azure Backup, or Azure Site Recovery, and are only availabe in/with certain regions and VMs, plus the VM must be in an availability zone for them to work. They are a good fit for top-tier databases and SAP HANA.
Premium SSD — Premium disks don’t have the current limitations of ultra disks and provide high enough throughput and IOPS with low latency. You can’t adjust performance without detaching these disks from their virtual machine. You can only use premium SSDs with larger virtual machine sizes.
With premium SSDs, these performance figures are guaranteed. There’s no such guarantee for standard tier disks. They are a good fit for mission-critical workloads in medium and large organizations.
Standard SSD — Standard SSDs aren’t as fast as premium or ultra SSDs, but they still have latencies in the range of 1 millisecond to 10 milliseconds and up to 6,000 IOPS. They’re available to attach to any virtual machine, no matter what size it is.
They are a good fit for web servers, lightly used enterprise applications, and test servers.
Standard HDD — Standard HDDs are slower and speeds are more variable than for SSDs, but latencies are under 10 ms for write operations and 20 ms for reads. As for standard SSDs, you can use standard HDDs for any virtual machine. They are a good fit for when you want to minimize costs for less critical workloads and development or test environments.
Choose a data storage approach
Each data set has different requirements, and it’s your job to figure out which storage solution is best. The key factors to consider in deciding on the optimal storage solution are: how to classify your data, how your data will be used, and how you can get the best performance for your application.
Classify your data
Application data can be classified in one of three ways: structured, semi-structured, and unstructured. Structured data is what you see in your favorite SQL databases, otherwise also known as relational data. This is the data that adheres to a strict schema, so all of the data has the same fields or properties.
Semi-structured data is less organized than structured data, and is not stored in a relational format but rather contains tags that make the organization and hierarchy of the data apparent — for example, key/value pairs, also referred to as non-relational or NoSQL data. Most common formats for this format are xml, json and yaml.
Unstructured data is basically all data where there is no usable/discernable structure, for example, media (photos, videos). The video file itself may have an overall structure and come with semi-structured metadata, but the data that comprises the video itself is unstructured.
Determine operational needs
Once you’ve identified the kind of data you’re dealing with (structured, semi-structured, or unstructured), the next step is to determine how you’ll use the data. What are the main operations you’ll be completing on each data type, and what are the performance requirements?
For e.g. in an e-commerce application, the update operations would need to happen just as quickly as the read operations so that users don’t put an item in their shopping carts when that item has just sold out. For unstructured data like photos and videos used in the application, update operation is not a critical operation but the whole file would rather will be read by an id. Fast retrieval times would however be an important operation here.
Also consider if you’d need transactional operations (group a series of data updates together, because a change to one piece of data needs to result in a change to another piece of data). Mostly you’d need transactional support only in the case of structured data. In case you do, you’d need to select an OLTP or OLAP database.
OLTP (Online Transaction Processing) databases are optimized for small or relatively simple transactions whereas OLAP (Online Analytical Processing) databases handle large and complex transactions. However OLAP databases have a relatively longer response time than OLTP and are not recommended for your daily operations. Their use cases has usually been analysis of historical data.
So to summarize, in order to select the correct storage solution, use these four factors: data classification, operations, latency and throughput, transactional support. For e.g., use CosmosDB in case you have semi-structured data, with high number of read and write operations requirement, with transactional support required and low latency.
Azure Storage Infrastructure products
Azure database products
Azure offers a choice of fully managed relational, NoSQL, and in-memory databases where scalability, availability, and security are automated
Capabilities of each of the product mentioned above:
Comparison between other data-related technologies
Big Data batch processing and bulk data ingestion
Analytical data store
General feature comparison
Scalability feature comparison
Security features comparison
Real Time Streaming Ingestion
You gotta know!
- Azure Data Lake Storage Gen2 authenticates through Azure Active Directory OAuth 2.0 bearer tokens.
- It’s highly recommended that you periodically rotate your access keys to ensure they remain private, just like changing your passwords. If you are using the key in a server application, you can use an Azure Key Vault to store the access key for you. Key Vaults include support to synchronize directly to the Storage Account and automatically rotate the keys periodically. Using a Key Vault provides an additional layer of security, so your app never has to work directly with an access key.
- Along with role-based access control (RBAC), Azure Data Lake Storage Gen2 provides access control lists (ACLs) that are POSIX-compliant and that restrict access to only authorized users, groups, or service principals.