As part of the previous post, we saw how Facebook makes use of Haystack for object storage. But with increase in amount of blob(Binary large object) being stored, Haystack was proving to be inefficient to store older blobs. Efficiency of a storage system is calculated using its replication factor & this is where distinguishing between which blobs to replicate more vs replicate less makes the difference in efficiency. f4 is a system built by Facebook that categorizes blobs as warm & uses a special warm storage system for them in turn improving the efficiency of the storage system. As part of this post, we will look deeper into the design of f4 & also understand why it is more efficient than Haystack as an object store.
Case for warm storage
One thing that differentiates blob being stored in f4 from the one stored in Haystack is that it covers much more than just photos. A blob can be used to store media files, documents, heap dumps and even source code. Blob just follows simple rules of immutability where they can be created once with no support for modification.
Haystack solved the problem of minimizing number of disk seeks to retrieve a blob & performed 3-way replication to achieve fault-tolerance. With increase in size as well as variety of data stored as part of blob, there is a percentage of data that is created & very rarely accessed(Consider a video that you uploaded to Facebook 5 years. How often do you access that content on a daily basis?). In simple words a warm blob is one which receives less than a threshold number of requests. Any blob starts as hot when it is created & then with time it transitions to warm as the interest dies down in its underlying content. Consider the following data plot that shows decreasing number of requests for content with increase in its age:
This is where 3-way replication turns out to be costly when we are dealing with the data stored at Facebook’s scale. f4 solves this by achieving the same fault-tolerance but with a lower-replication factor using erasure-coding technique. Any new blob first lands in Haystack & when the request rate for that blob dies down below the threshold, it is migrated to the warm storage. The overall architecture abstracts this from the client and client ends up interacting with just a blob-storage system.
Blob storage design
Blob storage system is built in such a manner that it abstracts away the concept of hot/warm storage from the client. Requests enter the system through a router layer where create requests are sent to the hot layer whereas delete requests are sent to the host where the blob is currently being stored. For read requests, router extracts the logical volume id & map it to the physical volume where blob is stored.
Controller is responsible for making sure all things run smoothly by provisioning new store machines, ensuring logical volumes have enough space etc. Transformer tier is for computation which takes care of resizing/cropping the blob which was read. Blob storage also consists of a caching mechanism which prevents read request of popular blobs from reaching to the storage system.
Blobs are aggregated together in units known as logical volumes which consist of file-system metadata. Volumes are created with unlocked state where they support CRD(Create, read, delete) operations. Once they reach a threshold in size, they are transitioned to locked state supporting only read & delete operations. A single volume consists of a data file, an index file & a journal file. Data & index file are same as that generated by the haystack whereas the journal file keeps track of delete operations. For a locked volume, journal file is the only one that allows write operations.
For hot storage, a logical volume is replicated twice in a datacenter but on different racks. This provides fault-tolerance from rack failure but still is subject to datacenter outage. To overcome this a third replica is stored in another datacenter and overall provides better tolerance against failures. Though this leads to a higher replication factor which means higher storage cost. Even though incurring this cost seems alright for hot storage, an application whose most of the content eventually ends up in warm storage needs to optimize cost for storage. This is where f4 as a warm storage system comes into picture.
f4 design
This is the core of this paper i.e. the warm storage system. The main goal of the warm storage is to improve storage efficiency without compromising the fault-tolerance that the Haystack provided. f4 introduces an abstraction of a cell which is responsible for storing a number of locked volumes with low storage overhead. Cell is responsible for serving only read & delete operations.
Erasure coding
The low storage overhead is result of a technique that f4 implements called erasure coding. Erasure coding is a very interesting concept where instead of storing multiple replicas, parity blocks are stored for the original data. These parity blocks can be used to reconstruct the data in case one of the replica goes down. f4 uses Reed-Solomon coding for implementing the erasure coding. So n
bits of data is encoded with k
parity bits which in turn means that the system can tolerate k
failures. The n + k
blocks for a stripe. (Note: I will cover a separate post for erasure coding in future).
Cell architecture
Now let us look into the cell architecture of f4 into more details & see what components end up forming a cell.
Name node is responsible for maintaining the mapping between storage blocks & parity blocks. It also stores the mapping of actual storage nodes that hold these blocks. Storage nodes are responsible for serving the requests from the router tier. It exposes two APIs as Index
for providing the location information about a volume & File
for providing access to data. Storage nodes keep the information of blobs it manages in memory using a custom data structure.
BLOBs in f4 are encrypted with a key & this comes in handy during delete operations where only the encryption is deleted as part of processing the delete request. This makes the BLOB unreadable & delete it later. Reads are handled in two different manners. Normal case reads are done by reading the data node that consists the BLOB through Data
API. For scenarios where there is a failure, Data
API is used to perform the recompilation of BLOB using erasure coding. Note that this recompilation is done on the router layer so storage node is free to work on normal storage operations.
As recompilation is CPU heavy operation, router layer sends this request to Backoff node which exposes the File
API similar to storage nodes. This just rebuilds the missing BLOB which was requested as rebuilding a BLOB is much quicker. In case the complete block is missing, its reconstruction is handled by rebuilder nodes. In case the failure ends up impacting a complete block such as in case of disk failure, rebuilder nodes are CPU heavy nodes which rebuild the complete block. Rebuilder node detects failure by probing storage blocks & in case of a failure communicates it to coordinator node. Coordinator nodes are responsible for maintenance tasks such as block rebuilding and rebalancing.
Replication & fault-tolerance
A single cell in f4 is situated in a single data center & hence is not tolerant to datacenter failure. The initial approach by Facebook was to replicate the cell across data centers. Though this ends up making a cell fault-tolerant, data center failures are not so common. Hence f4 came up with an optimization where instead of replicating the complete cell, they replicate XOR encoding of a cell. So if two blocks are situated in two different data centers their XOR is situated in a third data center.
f4 system also provides fault-tolerance from 4 levels of failure i.e. disk, host, rack & data center failure. It does this by distributing data & parity blocks across racks which insulate the underlying BLOB from disk & rack failures. As we have seen above, f4 uses XOR encoding to replicate across data centers to tolerate datacenter failure. This way it achieves the required fault-tolerance along with improving storage capacity.
Conclusion
Similar to Haystack, simplicity of design is at the core of f4. Underlying f4 is built on top of HDFS & leverages erasure coding to improve storage performance. This type of hierarchal storage architecture has proven to be efficient in scaling storage systems & we saw similar usage in Kora where older data was pushed to S3.
It is interesting to see improvements done on a system like Haystack that we saw in the previous post. Though this is not the end & there is a final iteration coming up. Stay tuned.
References