Paper Notes: Facebook’s Tectonic Filesystem – Efficiency from Exascale

As part of last two posts we have seen how Facebook created Haystack as their blob-storage solution & then further improved it with a tiered-storage solution by creating f4. Facebook also has a data warehousing system that runs large number of HDFS nodes. These are two main specialized systems that fulfill separate needs for Facebook.

In this paper we will take a look at a centralized solution that fulfills both the above needs known as Tectonic. Tectonic is a exabyte distributed file-system which provides performance comparable to both blob-storage & data warehouse system. This reduces the complexity of maintaining & improving multiple specialized systems & improves the resource utilization of the overall system. A single tectonic cluster is deployed on an entire datacenter due to its exabyte scale & hence tenants using blob-storage & data warehouse can be hosted on the same cluster.

Need for a centralized system

Blob-storage & data warehouse are both specialized systems that solve a specific problem. They both have their own requirements but they fall short in terms of efficient resource utilizations in their own way.

Problem with blob-storage

Blob-storage has a requirement of fulfilling read/write operations with minimal latency due to the fact that it is often used for interactive applications. A peak in latency will end up degrading the user experience who is either trying to upload a photo or view a video on Facebook application. Also blobs being stored are often smaller in size.

To solve for this requirement, extra machines need to provisioned as blob-storage ends up becoming IOPS-bound. Ratio of IOPS to per terabyte of storage is often lower as IOPS are mostly concentrated on a small set of blobs. As we saw in the f4 post, blobs are stored either under Haystack as hot blobs or under warm f4 storage. In case of f4 storage because the blobs are not being so frequently accessed, the IOPS capacity ends up getting wasted. This is where blob-storage falls short in terms of IOPS under-utilization.

Problem with data warehouse

Data warehouse stores large amount of data as the underlying use case is mostly in the analytics domain. Therefore both read & write operations end up working with large amount of data when compared to blob-storage. HDFS clusters are used for storing the data & resulted in lot of operational overhead. It is often the case that the data being stored needs to be partitioned across multiple HDFS nodes which results in maintaining large amount of metadata & also introduces the bin-packing problem.

Considering the various challenges of these specialized systems along with operational overhead of maintaining them, Tectonic filesystem was developed which solves for both the use cases along with making the most use of underlying resources. Having a single system makes it easier to allocate resources to the correct system which is not possible in separately deployed specialized systems such as blob-storage & data warehouse.

Architecture

In this section we take a look at the architecture & implementation details of the tectonic filesystem. This will help us in understanding about how does Tectonic solves for the problems introduced by specialized storage systems.

High-level overview

Tectonic is deployed in a data-center with a unit called cluster. Each cluster is resistant to host, rack & power failure. Clusters can be replicated across data centers to provide protection against a data center failure.

Each cluster consists of storage nodes called chunk store, metadata nodes & nodes which perform background operations such as garbage collection, health checks etc. Client libraries interact with both storage as well as metadata nodes but metadata nodes do not communicate with chunk nodes and vice-versa.

Due to the exabyte scale of tectonic, a single cluster can cater to multiple tenants. This means a single cluster can serve requests for blob-storage tenants as well as data warehouse tenants. Tenants do not share data among themselves & each tenant owns its own namespace. Applications under these tenants interact with tectonic through a filesystem API which can be configured at runtime.

Underneath tectonic is powered by micro-service based system where both chunk store & metadata store run services to read/write data or metadata. Client interact with these services to communicate with the tectonic filesystem. Background services run on separate nodes & take care of maintenance work. Let us dive deeper into each of these components individually.

Chunk store

Chunk store is where the data resides in tectonic file system. As the name suggests, the unit of storage in a chunk store is a chunk. Multiple chunks form a block & multiple blocks make up a tectonic file. So the storage hierarchy looks something as below:

Chunk -> Blocks -> File

Number of chunk stored in the store grow with number of nodes allocated to the store & therefore it can scale to exabytes. Also the chunk store does not have any knowledge about higher level abstractions of blocks & files. These abstractions are taken care by the client using metadata store. This simplifies the design of chunk store & client has the control of what unit it wants to use for storage based upon the specific use case.

Chunks are stored on nodes each running XFS & the nodes expose typical IO APIs to get, put, append or delete chunk. Blocks are logical unit of storage expressed as array of bytes & it abstracts away the raw-primitives of data storage. Tectonic provides block level durability by either RS-encoding or replication.

Metadata store

Metadata store is responsible for storing filesystem hierarchy & metadata which maps blocks to individual chunks. Metadata is divided into naming, file & block layers where each of the layer is hash-partitioned. Underneath ZippyDB takes the responsibility of storing the data as key-value format. It is a good example of leveraging a well-maintained abstraction of a distributed system problem instead of building something from ground up. ZippyDB overtakes the responsibility of replication & fault-tolerance.

There are multiple layers for metadata store. Starting with the name layer which maps each directory to sub-directories or files. File layer maps file to list of blocks & block layer map each block to list of chunk locations(individual disks). All three layers are hash-partitioned by directory, file & block id respectively.

At scale such metadata stores can be impacted by hotspots. This is where tectonics layered metadata comes in handy & avoids any hotspot issues. It is because different layer serves listing directory content(name layer) & different layer lists file & block contents.

Tectonic allows blocks, files & directories to be sealed(immutable) which helps the metadata store as it can cache the associated metadata to improve performance. As the underlying value is immutable, there is no risk of inconsistency while caching metadata. Tectonic also guarantees strong consistency & it relies on ZippyDB for fulfilling these guarantees.

Client library

Client library is responsible for orchestrating chunk & metadata store services. It performs read/write operations at chunk level by first contacting the metadata store for chunk information & then communicating with chunk store for the original data. For read operations, it identifies the correct disks on which chunks are located & then reads from those disks. For write operations it communicates with metadata store to first allocate the required chunk locations & then write on those locations.

Tectonic allows single writer per file which means clients can directly write to storage nodes without tackling the complexity of serializing writes from multiple writes. It does this by assigning a write token for each file which it validates during the time of the operation.

Background services

Background services make sure that the tectonic system runs smoothly & perform maintenance tasks. It rebalances data across storage nodes along with maintaining consistency on metadata store, repairs lost or corrupted data & publish metrics around filesystem usage. These services operate on the same layers as that of metadata store.

Garbage collector is one such service that cleans inconsistencies in metadata & removes objects which are marked for deletion through the lazy deletion process. Rebalancer & repair service work together to relocate or delete chunks. Rebalancer identifies these chunks whereas repair service handles actual data movement.

Multi-tenancy

Tectonic faces two major challenges while supporting multiple tenants on its filesystem. First that each tenant should share its fair share of resources & no single tenant ends up resulting in the noisy-neighbor problem. Also tenants expect similar performance as that of the specialized system. In this section we will see how tectonic overcomes these two challenges in a multi-tenant system.

Storage capacity is maintained at tenant granularity & each tenant gets a specified quota with strict isolation guarantees. There is a manual process in place to reconfigure this quota but tenants are responsible for tracking their storage capacity.

Ephemeral resources such as IOPS & querying capacity are managed within each tenant at the granularity of group of applications known as TrafficGroups. Applications under a TrafficGroup have same type of resource & latency requirements. A TrafficGroup is assigned a TrafficClass that consist of latency requirement & decides which request should get existing resources. These TrafficClass are categorized as GOLD, SILVER or BRONZE considering if the application is latency-sensitive, normal or background application. Any surplus of resource within a tenant is shared among the TrafficGroup in descending order of TrafficClass. If there is any surplus remaining then it is shared to TrafficGroup of other tenants.

Client library makes use of a rate-limiter to ensure tenants use fair share of resources. For each request client library first checks if there is spare capacity in its own TrafficGroup & if not then in TrafficGroups under same tenant & then with other tenants. If client finds spare capacity then it routes the request to backend or else delays/rejects the request.

Metadata & storage nodes also need to manage resources & for doing so they use a weighted round robin scheduler that skips a particular TrafficGroup if it has exceeded its quota. Also requests from higher TrafficClass can jump the queue to improve the performance of the application. To make sure tenant boundaries are respected, tectonic uses a token-based authorization where the client requests are authorized by an authorization service by providing a token. Each layer in the system verifies the token & token’s payload describes which layer does the authorization maps to. This way tectonic operates multiple tenants across a single cluster ensuring optimal resource utilization.

Tenant specific optimizations

Tectonic cluster can support around 10 tenants where each tenant can have different requirements around storage & query performance. Giving clients control over reading data at most granular level i.e. chunk allows them to make perform operations based upon their requirements. Client also have control over the configurations such as durability on per-call basis. These two mechanism permit optimizations for two specific classes of tenants we have looked till now.

Optimization for Data warehouse

Data is usually written once but read multiple times in a typical data warehouse application. As the amount of data being written is usually large, applications prefer lower write time. For this requirement tectonic makes use of write-once-read-many pattern to improve write time.

Applications first buffer writes up to the capacity of the block & then block is RS-encoded for durability across multiple blocks. RS-encoding instead of usual replication saves storage space & disk IO as the amount of data being written is lower. This pattern also allows applications to write a file in parallel that in turn decreases the overall write time. Once all blocks are written, file metadata is updated. There can be no form of inconsistency as file is visible only after it is completely written.

For replication, tectonic makes use of quorum where it first sends reservation request instead of sending data chunks. Once it has received the confirmation of reservation only then it sends the data chunks over network. This in turn avoids sending chunks to nodes which have insufficient resources. For example, client library sends reservation request to 19 nodes for RS(9, 6) encoded block which is 4 extra than the required nodes. Client writes 9 data & 6 parity chunks to the first 15 nodes that respond to reservation request. This form of hedged quorum ends up contributing to a better tail latency for large writes in a data warehouse.

Optimization for Blob-storage

Blob-storage poses a challenge for a typical filesystem because objects need to be indexed for quicker retrieval. Tectonic stores multiple blobs on a single log-structured file & blob’s metadata maps blob id to the location of blob in the file. As blob-storage is latency sensitive, tectonic’s client library confirms a write after a subset of storage nodes acknowledge the write. One of the subset is located in another data center so the write is tolerant to all forms of failures even with partial quorum.

Blobs are also RS-encoded similar to the way they were done in f4. But instead of encoding after each blob append, client library encodes complete block after it is sealed. Encoding with each blob append will result in more IOPS as you end up writing to multiple disks for small amount of data making the overall operation IO-inefficient.

This way it supports faster read/write operations along with making the most use of underlying resources.

Conclusion

The journey from haystack to f4 to tectonic teaches a very important lesson in solving the problem in hand correctly first & optimizing later. It is very difficult to think of a tectonic like system while building haystack as you don’t know how the system is going to evolve. It is only when you deploy it in production & study the areas where the system falls short that you can come up with a better solution.

Stepping back & seeing systems like f4 & data warehouse makes you realize their shortcomings & common patterns on which both are built. This is the first series of technical publications where you get a chance to witness the journey of a product along with its various weaknesses & strengths. Happy learning!

References