Working on File Systems
I’ve been working around storage and file systems within my PhD for the last 3 years, and ended up even creating my own in kernel file system in FreeBSD. From this experience I would like to try and summarize the learning I’ve done throughout as an exercise for myself and hopefully help others in this space.
File systems are difficult as they interact both with chaotic users, and chaotic storage devices. Users often execute actions in sequences not anticipated by developers and devices often lie whether these actions even occur, and data is even saved to disk.
Failures like misdirected writes, SSD bit rot, and code failures can quickly make file systems a nightmare to develop and question whether you made the right choice as a developer to be in this space and your own sanity.
Storage devices are also notorious for “early life failures” where manufacturing inconsistencies can lead to devices failing early in their use. Often storage companies mitigate this by purchasing devices from different factory runs. Imagine buying hundreds of the same factory run of a device only for them to all to fail in the same way due to the same issue. Correlated failures is the death of data.
This chaos is what makes file systems so interesting. It feels like developing on these systems is directing chaos like some form of fire bender.
Currently its expected file systems offer structure. File systems present objects to the user in the form of Inodes, these can be regular files or something like a directory. File systems often are structured in a tree like manner with a root Inode (similar to the root directory “/”) at the top. Data the user does not control or see but is used by the system to organize itself is known as metadata. For example when building larger files its often necessary to build tree like structures to represent a dynamic grow-able structure like a file. This tree’s leaf nodes point directly to data, but its internal nodes are considered metadata.
File systems often sit on top of object storage systems which are agnostic to inter-object structure and focus on intra-object structure. For example a directory is an object in which its data is a list of name-inode pairs. File systems often enforce standards such as POSIX using the API given to them by the underlying object store.
Within storage there are four main points in which developers can design around: Throughput, Latency, Consistency/Fault Tolerance, and Cost.
Throughput is the measurement of how much data can be shoved through before the storage device can’t handle any more. For example depending on the form factor of your SSD, throughput can be be around 850mb/s for SATA3 connections, or close to 2–3GB/s for m.2 connections. This is due to the proximity and type of channels these two different types of SSD’s have. Throughput is also measured in input operations a second as well (IOPS). Having a workload that does hundreds of thousands of small writes a second means you are likely to hit a bottleneck on operations over actual data throughput.
Latency is the measurement of how fast the storage device responds to a request (given that you are not throughput bottle-necked). This is an important metric for types of workloads that do small reads and writes, as often there is some service waiting for this response.
Entire abstractions (the page cache in Linux, and buffer cache in FreeBSD) have been made around reducing latency and load onto disks by caching data into kernel memory. For example, when you read the first part of a file, the kernel reads the file on your behalf and places the data into its cache in memory. If the block is ever read again, then this cached copy is used. Writes are done to this block in memory instead of directly to the device, reducing latency for the user and throughput requirements for the disk.
This cache also allows write to possibly aggregate as small writes can lead to write amplification as the kernel must write out the entire page (4Kb>) to disk rather than just the bytes that were changed. This means small writes are amplified to larger ones by the kernel. This is different for new NVDIMM storage devices, which Ill discuss in later articles.
This abstraction comes at a cost. Users may expect that when they call “write” it means their data is on disk, which is not true, these writes aggregate in memory before being flushed by the kernel or directly by the user using “fsync” or “sync”. This caching layer now introduces a time in which data is volatile leading to the possibility of an inconsistent file system. For example, suppose we have the segment “ABCD” on disk, where each letter represents a block. Suppose now I overwrite “BC” with “EF”, for our file system to be consistent, a reader should only ever read “ABCD” or “AEFD”. Suppose now block E is flushed but the system crashes before flushing block F. The system is now in a state where a reader could potentially read “AECD” which at no point existed in our system. This is one form of inconsistency, and sometimes can be allowed by designers.
Consistency and fault tolerance represents how well your file system stays logically sane. Many operations can leave your file system in an inconsistent state (like the one described above). Another example is moving a file between directory's. This action involves first removing the file within the source and adding it to the target directory. If a crash were to occur between the two operations, the file would be removed but not placed within the target, seemingly having the file disappear from the system, but still consuming space on disk. This state is inconsistent with the history of operations that occurred within the file system.
As a summary of the above: latency is how fast I get data, throughput is how much data I can get, consistency is the upper bound of how corrupt my data is after a crash. Generally reading a small file is latency sensitive, while streaming a movie is throughput sensitive. Lets go over two of the major strategies file systems have used to remain consistent.
Write Ahead Logging and Journaling
File systems have achieved consistency and fault tolerance in many ways and often this defines the file system itself from others. LogFS’ entire file system is structured in a way that operations and data is written and flushed in a sequential manner to underlying storage . This allows the system to quickly see the last actions done and verify consistency if a crash had occurred.
This process of writing an action (possibly paired with the data) and flushing it to a sequential log before executing the action is known as “Write Ahead Logging” (WAL). File systems like ZFS and even key value stores like Redis use WAL’s to help with fault tolerance (ZFS calls it their ZIL — ZFS Intent Log). Persistent logging can be used in multiple ways, for example ZFS logs allocation decisions to disk allowing them to have a simpler in memory disk allocator. When a crash occurs, the system replays the decisions its made in the past to achieve its pre-crashed state (Yay for determinism).
WAL does not come without its own set of flaws, this style of consistency and recovery increases the overall amounts of writes to disk, and if logs are not stored on special hardware like battery backed RAM or new persistent memory then operations must again wait for its data to hit disk before proceeding. Why not just store it to the file system structure to begin with. With these types of systems its necessary to place logs on fast, low latency, persistent hardware in case of a crash.
Journaling is somewhat separate as it works to record only meta data operations. This helps solve the directory example explained above, while also not requiring to be placed on separate mediums as actual data is not handled. Operations that are logged are much smaller, and thus waiting shortened. This does however mean data-corruption can occur, but file system structure is always maintained and recovery protocols are much easier and faster with the assistance of this journal. Journaling is used by systems like FFS and XFS.
Copy on Write
In this style of consistency, writes are never done in place. The block is first copied, a new block allocated and the write is done to this copied block (copy on write). Any editing to the object tree, like in the figure below must also result in a copy on write. This strategy is used by file systems like ZFS, Btrfs, and WAFL.
This provides consistency as due to the tree like nature of this structure, writes occur from the bottom moving up. Meaning to be able to see the write that occurred at 2 in the Figure below, you must be be able to see the parent that was also created in step 3, and its parent in step 4. Finally as we transition from 4 to 5, the root of the object is overwritten in one atomic write. If a crash were to occur before reach step 5, a reader would start from the old root block and this would result in them only seeing the original tree. In this style of system its necessary to track dependent writes, meaning the data blocks should be flushed before the indirect blocks.
The entire file system is this tree structure and we can use this same logic and flush objects to disk before flushing metadata objects that point to them. We then keep an array of roots, and when a new root is added to this array we know that every write its dependent on has been flushed to disk so by reading the root we know the entire file system is consistent.
Copy-on-write (COW) has its cost. Its expensive, and the internals of these file system structures can get complicated and difficult to implement. I’ve had to implement copy on write B-Trees which have papers written on them. They are hard. Copy on write does add strain to the system due to write amplification when writing data blocks as metadata blocks that point to them must also be COW’d and flushed. This hurts smaller writes. For example, say I write 64 bytes to a block. The data block must now flush 4kiB as we have dirtied the entire block. In many file systems it would stop there, but with copy on write, the parent and its parents are flushed as well up to the height of the tree. If the height of our tree is 4, like in the example above then this 64 bytes may have turned into 16KiB of writing. This cost is amortized over larger writes meaning if we write more than a block or many small writes throughout a file, then blocks start sharing the same parents, reducing this cost.
COW also supports file system snapshots just because of COW! When we want a snapshot, we COW our file system (often known as a transaction), and the new root that is eventually created is pinned and named.
These transactions are often pipelined to help with performance, meaning writes are constantly in transit to disk. This however can be confusing for users as it can be difficult to know when their data actually is safe on disk or in the file system structure. ZFS uses a special WAL’s for faster flushing due to this pipeline.
When Data Corrupts
Although these strategies help support when file systems crash, they don’t actually help when it comes to other errors that can occur like misdirected reads or writes, or any form of data corruption. Other features must be used to catch these errors like Merkel Trees. However this comes at the cost of CPU, as check-summing can be very CPU expensive and heavy write workloads can end up consuming entire cores just to checksum!
Its important to realize how cost plays a role. We cannot build systems on top of battery backed RAM (not completely that is) or new 3D-XPoint NVDIMM’s as this is unreasonable for everyone. As cost becomes more of a concern, the main storage load becomes more dependent on spinning disks over faster flash storage. As we are allowed more of a budget we may opt for tiered storage allowing hotter data to be placed on our SSD’s with colder, archived data being placed on hard disk (See ZFS’s L2ARC). This tiering, is done by cloud providers for storage as well, which also has a tier below hard disk — Tape, which is VERY cheap per GB.
Tape: $0.008 / GB
HDD: $0.025 / GB
SSD SATA-3: $0.16 / GB
SSD m.2: $0.17 / GB
NVDIMM: $6.97 / GB
As you jump up tiers, you gain throughput, reduce latency, but increase cost. Code that is optimized for one storage medium, can become a hindrance for others. For example FFS has an optimized allocation strategies for HDD and takes into account velocity on the needle on the spinning splatter and how the outside will rotate faster than the inside as well as placing files within directories close on disk while placing separate directories farther away.
In this article I outlined design points for file systems. I introduced consistency strategies like WAL and COW, and the cost these consistency strategies have on the overall system.
In the future I want to dive further and talk about recent publications within storage and file systems and my thoughts around them as well as explore any other thoughts I have in this space. For example future articles may be around Zoned namespace SSD’s and HDD’s as well as the move to better extract the performance from underlying storage devices by reducing kernel involvement and allowing to directly use storage devices from user space (io_uring, kernel bypass etc). I will probably sporadically talk about distributed file systems and how object stores are created for databases depending on interest.