Paper Notes: Finding a needle in Haystack – Facebook’s photo storage

Photo storage is one of the key functionalities of any social media platform & Facebook is no exception to this. But the scale at which Facebook operates, retrieving a photo from the storage can be analogous to finding a needle in haystack. At the time of this paper Facebook was storing 260 billion images with growth of around 1 billion new photos each week. So now whenever a user asks for a specific photo, a cost-effective technique is required to retrieve the photo with minimal latency & this has led to creation of Haystack.

Existing storage system

Facebook generates & stores 4 different sized images for each photo uploaded on the social media platform. To store & retrieve these images, Facebook built an object store known as Haystack. But before jumping into the internals of Haystack let us try to understand what led to its creation & why the existing solution failed at scale.

Initially they used a network attached storage build on top of a NFS. But that results in around 3 disk operations to retrieve a photo i.e. finding the inode that maps to given filename, reading the inode from disk & then finally reading the actual image file from disk. This degrades the throughput & latency for retrieving the photos which leads to unpleasant user-experience. Even using CDN didn’t resolved for latency as CDNs only take care of hot keys whereas a social media application should be prepared to serve older contents to the user which has not been queried in long time.

The solution that replaces the existing system needed to decrease the latency for retrieval & at the same time should remain fault-tolerant, cost-effective & simple enough to reason about as it is facilitating such a core functionality of social media application. Haystack ends up checking all the boxes & that is why Facebook built it to replace their existing photo serving system.

Haystack Design

With Haystack Facebook makes use of CDN to serve popular images & haystack helps in retrieving older content using lower number of disk seeks & reducing the memory spent in storing filesystem metadata. Haystack architecture consists of 3 major components.

Haystack Store: Encapsulates the storage & manages filesystem level metadata for photos.
Haystack Directory: Maintains mapping of physical storage volumes to logical volume.
Haystack Cache: Acts as internal CDN & avoids unnecessary calls to Haystack store

Core Components

In this section we will dive deeper into the three Haystack components. Starting with the haystack directory that performs four main operations:

Provide mapping from logical to physical volume
Balancing load across logical & physical volume
Decide whether a request is fulfilled by CDN or cache
Identifying read-only logical volumes

Whenever a machine reaches its capacity, it is turned into a read-only volume & they just serve read traffic. Newly provisioned machines are write-enabled machines that support upload(write) operations. If a machine goes down, its mapping is removed from the directory & an entry is added when a new machine is provisioned.

Haystack cache is distributed hash-table with photo-id as the key that serves request both from the CDN & from the user. If the cache does not contain the key then it fetches it from the Haystack store & then return it as part of the response.

For the queried object it stores the image against the key only if the request came from user(not from the CDN) & photo is fetched from a write-enabled store. It is because if the request is sent from the CDN then CDN will anyhow store it & next time the request won’t reach the cache and we also want to reduce the amount of read traffic going to write stores. If the request is fulfilled by a read-only store then it can be directed to the store next time too as it doesn’t impact the write traffic. We also store these images in the cache as this maps to a user behavior where an image most heavily read immediately after the image is uploaded(Verifying if the filter ended up working as we expected :P).

Haystack store exposes a very simple interface for querying photos. Request needs to specify the photo-id along with logical volume & physical machine from which photo needs to be read. If the photo is not found, the store returns an error. All the metadata for a photo is queried initially from Haystack directory.

Store machine consists of physical volumes containing a super-block followed by sequence of needles. Each needle represents a photo & to retrieve the needles store machine keeps an in-memory structure. Once a needle is mapped to the requested key, the store can retrieve the photo in just one disk seek.

Journey of request

Now that we have gone through various components of the Haystack system, let us see how they work together to fulfill a user request. A request to retrieve a photo is first directed to the haystack directory that translates it to below format:

First part of the URL describes which CDN to contact for the photo & in-case the CDN doesn’t consists the photo, the url is modified to remove the CDN information & contact the cache machine. Cache acts in same manner where if the photo is not found, it remove the cache information & contacts the logical volume on machine specified by machine_id. The store machine uses this metadata to lookup in its memory mapping. If the deleted flag is not set then it performs the disk seek to fetch the complete needle & returns the photo.

When the app server gets a request to upload a photo it reaches out to the directory to get a write-enabled logical volume. It then assigns a unique id to the photo & uploads it to the physical volume mapped to the logical volume. The store machine appends the needle tied to this image & updates its memory mappings. Haystack does not support modifying a needle so any form of update results in a new needle generation. This essentially means one photo can be mapped to different needles & haystack picks up the needle with highest offset so that it retrieves the photo with latest changes. Deleting a photo is done by setting the deleted flag in the in-memory mapping as well as the volume file.

Handling failures & Optimizations

Store machine makes use of an index file for a faster reboot time to reconstruct the in-memory mapping. Otherwise a new machine will need to read all the physical volumes which will turn out to be time-consuming.

Index files are updated asynchronously whenever there is a new upload. Also index files are not updated for delete operations. This allows for better response time for write operations but this also means that the index file cannot be considered a source of truth. Therefore needles without an index file entry are known as orphans & they can be easily detected as last record in index file maps to last non-orphan record & any additional record in volumes will be considered orphans. During restart, these orphan records are also added to the index file.

Being a backend for such a mission-critical application, haystack needs to be prepared to handle failures. It runs a background task called pitch-fork that regularly checks the health of store machines. If a machine consistently fails these health-checks then the process marks all the volumes in this machine as read-only.

Store machine also performs compaction to reclaim the space used by deleted images. This helps in saving space which can be used for new photos.

Conclusion

At the time of writing this paper, news feed & albums were the two feature that accounted for majority of photo requests. System that fulfills such a critical functionality needs to be fault-tolerant & simple to understand & Haystack achieves both these goals with a simple design. Understanding user behavior & underlying system flow is important for building such a system. This knowledge has led to decisions such as not caching the request coming from CDN or not synchronously updating the index file. Ideas introduced in Haystack can be easily transferred to build a generic object storage which aims at faster retrieval time.

Related Posts

Paper Notes: Windows Azure Storage – A Highly Available Cloud Storage Service with Strong Consistency

Paper Notes: WiscKey – Separating Keys from Values in SSD-conscious Storage

Paper Notes: Amazon Aurora – Design Considerations for High Throughput Cloud-Native Relational Databases

Paper Notes: Bigtable – A Distributed Storage System for Structured Data