Windows Azure Storage(WAS) is a cloud-based storage system that supports durable storage of customer’s data. The underlying storage infrastructure provides support for storing data in form of tables, blob & queues. Typical usage of data abstractions provided by WAS is incoming/outgoing data in form of blob, queues being used for workflow between applications & persistent state being stored in tables or blobs.
WAS(at the time of this paper) is used inside Microsoft for various applications such as social-media search, game content etc. WAS provides strong consistency & even goes on to claim support for all three properties of CAP theorem. It provides a global namespace so that the data can be accessed from multiple geographical locations & with its multi-tenant architecture, it achieves all the functionalities with reduced cost. In this paper we will take a look at how WAS achieves durability amidst well-known failures associated with a distributed system.
High level architecture
Let us start by exploring WAS from a hawk-eye view. We will look into globally partitioned namespace that allows data stored under WAS from multiple locations & see the core components that form the high-level architecture of WAS.
Global partitioned namespace
Clients using WAS can access the storage with single namespace regardless of the global location from where they are accessing their data. WAS makes use of DNS in the namespace & divides the namespace into three different components i.e. account name, partition name & object name. So a typical format of namespace looks as below:
Account name is part of DNS host name & is used to locate the primary cluster & data center where customer’s data resides. Partition name is used to distribute access to data across multiple storage nodes & object name is used to identify the actual object within the partition. WAS supports all three data abstractions using this namespace as below:
- Blobs have their partition name equal to the blob name & no specific object name
- Tables store their primary key as partition name & object name
- Queues have their name as partition name & each message in the queue maps to an object name
WAS architectural components
The two main components that build the WAS system are Storage stamps & location service.
Storage stamp is a cluster of racks of storage nodes & this is the component which actually stores the data. Typical stamp uses 70% of its capacity for storage & 20% for maintaining the health of the stamp by accounting for node failures.
Location service(LS) manages the storage stamps along with managing account namespaces across stamps by allocating storage stamps to a given account. It is also responsible for scaling existing storage nodes by provisioning additional capacity both in existing as well as new regions. When a new account requests for storing data, LS chooses a stamp as the primary by storing account metadata in the stamp so that stamp can start processing requests from that account. It also updates the DNS to allow requests for that account.
Layers within storage stamp
In the high-level architecture, you must have noticed that there are three components that form a single storage stamp which are stream layer, partition layer & front-end layer. Let us get introduced to them on a high-level & we will dive into them individually in upcoming sections.
Stream layer is what actually stores the bits on disks & is responsible for replicating them across multiple nodes for fault-tolerance. It acts as a distributed file-system layer(akin to GFS) & considers files as streams or ordered list of storage chunks(also known as extents by WAS). Both partition server & stream server are located on same storage node in a stamp. Though stream layer doesn’t understands the concepts of data abstractions such as blobs & this is where partition layer comes into picture.
Partition layer is responsible for building the data abstractions that we talked about. It also supports a scalable object level namespace & transaction support for objects. It also caches data to reduce unnecessary I/O. Living up to its name, it also scales the stamp by partitioning data objects & provides load balancing across partition servers.
Front-end servers are stateless servers that are the first point of contact for incoming requests. They are responsible for authenticating, authorizing requests & then routing request to partition server. System maintains a mapping of partition name range to respective partition server & front-end servers cache this map in-memory.
Two layer replication
WAS has two different replications engines, one for stream layer & another for partition layer.
Intra-stamp replication for the stream layer does synchronous replication for the data written to a stamp across the primary as well as replica nodes. Being synchronous in nature, it blocks the client request until replication is acknowledged by all replica nodes. Hence it is important to make sure that intra-stamp replication is done with minimal latency.
Inter-stamp replication for the partition layer does asynchronous replication to replicate data across stamps. This replication is done to replicate data across two different geographical locations to prevent from data loss in scenarios of geo-disaster. It can also be used in use-cases to migrate data across geo-regions. Due to its asynchronous nature, it does not impacts the latency of client requests.
Stream layer
Stream layer is used only by the partition layer & is not exposed to the external users. As discussed in the previous section, the stream layer does mutation operations on top of files it calls as streams. All these mutation operations are append only resulting in an immutable storage system. Each stream writes to storage locations(called as extents) & it only writes to one unsealed extent at a time. Consider the following figure, where E4 is the only unsealed extent whereas remaining extents are sealed & hence cannot be modified. If you are familiar with a LSM tree, you will find the design of streams pretty similar to that of SSTable & Memtable.
Minimum unit of data read or written is called a block & client has control over the size of a block. A client read operation results in an offset to the stream & entire contents of block are read in order for checksum verification. Extents are unit of replication in a stream layer with a default replication policy of 3 nodes. A stream is what the partition layer interacts with & underlying it is an ordered list of pointers to different extents. Now let’s take a look into the internal workings of various components of the stream layer.
Stream manager & extent nodes
Stream manager(SM) & extent nodes(EN) form the core of stream layer architecture. SM manages stream namespace, maps extents to stream & manages extent allocation to extent nodes. SM runs Paxos on top of the cluster & performs maintenance of extent nodes with tasks such as health monitoring, creating & assigning new EN, garbage collection & re-replication in case of node failure. SM syncs with ENs on a regular basis to determine what extents they are storing. If a particular extent is replicated on lesser number of nodes than the threshold, then SM triggers lazy replication to bring the replication factor on desired level.
Extent nodes store for the extents assigned to it by SM & it contains the knowledge of what extents it is storing along with replica nodes for a given extent. An EN can communicate with other ENs to replicate a block. If an extent is not referenced by any stream then SM notifies EN to reclaim the space associated to the extent.
Immutability
Immutability plays a major role in simplifying the overall design of WAS & stream layer is no exception to that. Each stream can only be appended to & no in-place modification is allowed. Also appends are atomic regardless if its a single block append or an operation containing multi-block appends. Each extent also has a target size & once that size is reached, the extent is marked as sealed. A sealed extent can no longer be appended to & optimizations such as erasure coding can be applied on it.
Replication flow
Stream layer provides strong consistency guarantees by ensuring that once the record append is acknowledged back to the client, any read from any of the replica will return the same data. Also any reads from a replica of a sealed extent also yields the same result. Stream layer has to account for node & power failures along with disk corruption & software bugs that can cause data corruption. Also it has to support upgrades in either software or hardware with no downtime. We will look how stream layer uses replication to account for these failures.
During the creation of a stream, SM assigns three replicas(1 primary & 2 secondary) spread across different fault-tolerance & upgrade domains. This information is sent to the client & therefore client knows which ENs holds 3 replicas & which replica is the primary. Clients perform write by sending the request to primary replica & primary is responsible for replicating the write to other replicas. When extent is sealed, similar process is repeated for the newly allocated extent. As primary is responsible for replicating writes, it can decide the order in which appends are performed & therefore the order of append across replicas remain the same. The last append position is considered to be the commit length of the replica.
This covers the happy path for replication but we do know that failure is the norm & not an exception in distributed systems. So in case one of the replica nodes go down during an append, a write failure is returned back to the client. Client then reaches out to SM which seals the underlying extent & allocates a new extent along with new replicas. We can see that with this flow, there will be certain replicas that will perform the append even when a failure is sent to the client & this results in mismatch of data present on 3 replicas. To account for this, SM asks for commit length from all replicas & chooses the smallest commit length ensuring that all the replicas have identical data until this commit length.
WAS also works towards improving space utilization by erasure coding sealed extents for blobs. This ends up reducing the replication factor from 3x to 1.3x-1.5x.
Load balancing
A read request to an EN is sent along with a read deadline i.e. maximum time under which the read should be fulfilled. If the EN decides that the read cannot be fulfilled within the deadline, it immediately returns an error so that the client can try the same request at another EN. WAS also ensures that hardware doesn’t becomes a bottleneck in performance due to certain hardware-level optimizations. To do this, a custom I/O scheduling mechanism is used to spread out requests based upon the current I/O load on the hardware.
Journaling
With the guarantee of durability that stream layer provides, it needs to ensure that there is no data loss after the client is sent a confirmation of data being stored successfully. To do this, WAS makes sure all three replicas acknowledge the write so that one of the replica can takeover in case of a failure.
But performing the write to disk on all 3 replicas means that the latency is dominated by the slowest replica. To improve this, stream layer reserves a disk for journaling. Now when each EN performs a write, it sends two requests in parallel. One for appending the operation to journal & another for queuing to disk in memory. If either of them succeed, it can consider the write operation as successful. If a read request comes in & the data is not written to the disk but just appended to the journal then the read request is served from memory.
Journalling ends up improving the overall latency of write operations without compromising on durability.
Partition layer
Partition layer is what understands the abstractions of queue, tables & blobs. It provides data model for different abstractions along with scalable namespace & load-balancing across partition servers. Let us dive deeper into various components that make the partition layer.
Data model
Partition layer provides an internal data structure called the Object Table(OT) which is broken down to range partitions & mapped to multiple partition servers. A range partition is a contiguous set of rows from the OT represented by a low key & high key with no overlap amongst peer range partitions.
Partition layer uses OT for storing metadata & configuration for storage accounts under account OT. Blob tables store the blob objects, entity tables store the rows & message table store the messages. Each of these tables have a fixed schema which is stored in a schema table. OT support all the well-known data types along with 2 custom types as dictionary & blob. Dictionary type is for storing data with flexible schema & blob types are used for storing blobs. As blobs are larger in size, they are not directly stored in the rows but rather a pointer to the storage location is stored along with metadata to access the blob.
Architecture
Core architecture of partition layer consists of a partition manager(PM), partition servers(PS) & a lock service. The architecture diagram for partition layer is as below:
PM is responsible for assigning range partitions to the PS & splitting OTs into range partitions. PM stores the partitioning information in a partition map table & guarantees that each range partition is assigned to only one PS. A storage stamp has multiple instances of PM running & PM that is successfully able to acquire the lock from lock-service takes over the control for partition layer.
PS is responsible for serving requests for a set of range partitions. It stores all the persistent state in streams & also maintains a cache for the same. As we have initially seen that the range partitions are non-overlapping, it simplifies the architecture for concurrent transactions without compromising on consistency guarantees. Lock service is based upon Paxos protocol for leader election & also serves leases for each PS. In case a partition server fails, all the range partitions mapped to it are assigned to another PS & mapping is updated in the partition map table by the PM.
Data structure for RangePartition
PS consists of a variety of both in-memory as well as persistent data structures that help it in serving range partitions. It makes use of LSM tree for its persistent data structures. Each range partition is mapped to a stream in the stream layer. Following is an overview of these data structures:
Let us start by exploring all persistent data structures. Metadata stream is kind of a brain for range partition as PM assigns a range partition to a PS by providing the name of metadata stream of the range partition. PS can get all the information about a range partition from its metadata stream. The stream is also used to keep track of pending split/merge operations that are happening on a range partition. Commit log stream keeps track of recent operations applied to the range partition. Row data stream is used to store checkpoints for row data & index. Blob data stream is used for storing blob data.
In-memory data structures are described in the top section of the overview diagram. Memory table is an in-memory version of commit log. It is the first point of lookup for fulfilling any read operation. Index cache stores the checkpoint indexes of row data stream & row data cache store row data pages. Both these caches are separated to increase the amount of data that can be kept in memory. Row data cache is the second point of lookup if memory table is unable to fulfill the read request. If both memory table & row data cache is unable to fulfill the lookup request, we need to start scanning the persisted data streams which can be a costly operation. So a bloom filter for each checkpoint is also kept in memory to avoid scanning the complete checkpoint in case the data is not part of it.
Data flow
For processing a write request, PS appends the operation in the commit log & resulting row in the memory table for quick lookup. Once either the memory table or the commit log reaches its capacity, PS writes the contents of memory table into a row data stream checkpoint. PS also performs regular compaction to consolidate checkpoints & keep the number of checkpoints low. This is in parallel to what any LSM based storage engine such as LevelDB does. For serving a range partition, a PS needs to finish loading the partition by reading metadata stream to find checkpoints & replay transactions in commit log to become updated.
Load balancing
The scale at which WAS operates, it is expected that one or more range partitions will suffer from a hot-spot problem or a certain range partition will see lower than the usual traffic. Load balancing is one of the major goals of partition layer in order to ensure the scalability as well as efficient resource utilization. It does this by balancing the load across PS, splitting or merging a range partition.
Load is tracked for each PS as well as range partitions served by the PS. Various metrics around load is propagated back to PM from PS by piggy-backing on top of heartbeat response. If a range partition is turning into a hot-spot then PM will split the range partition & spread it on different PS. If a PS is facing increased load then PM will reassign certain range partitions to another PS.
PM does this by sending an offload command to PS for a specific range partition which will result in checkpointing the current state of range partition. PM then reassigns it to another PS & updates the partition map table. To split a range partition, PM instructs PS to split it based on either load or size. PS is responsible for deciding the split key. PS does this by checkpointing the range partition & then perform a MultiModify
operation to split it into two or more range partitions. PS then returns this information to PM which updates the mapping. Merging of two or more range partitions that are seeing reduced traffic load also happens in the similar fashion.
Replication
An account in WAS has one or more secondary stamps for performing geo-replication. This is done to handle data center failures located in a region. For any request coming for an account, the request is replicated to the secondary location. Once the operation is performed successfully in a stamp, the partition layer will asynchronously replicate it to the secondary stamp. Note that recent changes might not be fully replicated in case the primary data center goes down immediately after performing the operation. Location service is responsible for the switchover in case the primary node goes down & URI used to access storage won’t change.
Conclusion
WAS provides a scalable storage along with data abstractions to support blob, table & messaging. If we draw a parallel with Facebook’s tectonic filesystem, WAS also aims to mix different workloads to get the most out of its storage & compute resources. It provides ease of usage for end customers & that is proven by the fact that initial version of social media ingestion engine for Bing took two months for launch by leveraging WAS. It is a brilliant system that makes use of efficient concepts such as tiered storage, erasure coding, journaling, LSM based storage, separation of storage from compute & geo-replication.
Hope you would have gained something from reading this & happy learning!