Designing data-intensive applications — Storage

This is going to be a series of post taken from my learnings of “Designing data-intensive applications” by “Martin Kleppmann

picture from

Bare minimal

We will start by making a simpler version of a database and will continue our analysis by building over it.

echo "$1,$2" >> database
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1

The bash DB takes a key and value pair as $1 and $2 for saving a data pair using the db_set. This writes the “key, value” to a file called “database”.Since this is an append-only implementation for storage if the value needs to be updated it just appends a new line of the updated “key, value” to the file end. The db_get makes use of the tail to get the last occurrence of the key to show the latest value of the key

Write: The db_set being a simple append-only implementation work effectively for a bare minimal implementation even though it does not factor into the real-world issues we run into like concurrency control, reclaiming unused disk space, error handling, and partial writes, cost of write is O(1)

Read: The db_get has a bit of performance problem when a large number of records are written this takes time to lookup for the key scanning the entire file for each read, cost of reading is O(n)

To make effective reads we could go for a data structure called index, lets take a look at an implementation of this with Hash indexes

Hash indexes

The simplest approach to start with is to keep an in-memory hash map with every key mapped to a byte offset in the data file, when we want to read we just find the offset, seek the location, and get the value for the key. (see Bitcask).

Since this shifts the picture to a key-offset paradigm rather than the previous key-value, we can address the issue of disk space overfull, by breaking the logs into segments and removing the duplicate keys(compaction, done in background). The append and merge combo being one sequential write operation works faster than the random writes.

However, we are bottlenecked with the fact that our keys being stored as an in-memory needs to be able to fit in the RAM. This also fails in performance over any range-queries for data.

Delete: requires a special tombstone record to the log as a flag for the merging process to be notified that the key has been removed when compaction is done.

Crash recovery: In such cases, rather than rebuilding the segment to generate the hash map a better idea would be to go for having snapshots of each segment’s hash map to be stored on disk.

Partially written records: Adding a checksum alongside the write could address this issue.

Concurrency: Being sequential in writes, it allows only one write thread but supports multiple read threads concurrently.

SSTable (Sorted s̶e̶g̶m̶e̶n̶t̶e̶d String Table)

This makes the use of the log-structured sequential writes with hash index added with one tweak of having the key-value pairs sorted by the key. When merging n segments this makes use of the famous merge-sort algorithm to have one merged segment with the keys being sorted.

Now to leverage the power of sorting, rather than having all the key-offset maps for all data stored, we can have a sparse index of the key-offset map with one key every few kilobytes of the segment. Even further the segments can be compressed before writing to disk, thus each entry of the sparse maps to the beginning index of one compressed block that gets written to disk.

To maintain the sorted structure in-memory, we can make use of the AVL tree(memtable), which provides key insertion in any order but reads back in sorted order. To handle the case of a DB crash, we can have a backup mechanism of our initial approach working in parallel on disk, keeping a separate unsorted log of every writes that is immediately appended.

The LSM (Log-Structured Merge) Tree is the one which is used in LvelDB and RocksDB, the storage engines built using this are called LSM storage engines. The Lucene indexing of Elasticsearch and Solar uses a similar methodology.


Let us look into the famous B tree design which is used for structuring the storage in the disk (note the previous were in-memory).

Rather than having to break the database into segments of variable size, the B tree breaks down the database into fixed-size blocks or Pages of approx 4kb in size to performs read and write. Each page has an address and one location in it can refer to another page on disk. one page will be made as a root of the B tree and can refer to number of child pages. When there is no space for a new key in a page, it splits into two half-filled pages and updates the parent page. B tree with a n keys will have a depth of O(log n).

Let's take a closer look at the case when the page is full and a new key gets added, not the actual page gets split into two half-filled pages by copying all the references to these 2 pages and the parent page has to be overwritten to update the 2 child references. With lots of moving parts, a single DB crash could throw this design out. To make it resilient to crashes every tree modification is logged to an append-only file prior to applying, this is called the redo log(write-ahead log). Careful concurrency control needs to be added to prevent the tree from being read at an inconsistent state when using multi-thread read.

Rather than have a redo log, there can be a copy on write mechanism that writes the modified page to a different location (data copied), and a new version of the parent page is created pointing to the new location. Making use of abbreviations for the key rather saves some page size allowing for more keys per page.

LSM trees are faster for writes, B trees are faster for reads.

#devloper ,likes art, origami and instrumental music