2022-12-23 15:52:09 -05:00
|
|
|
# Overview
|
|
|
|
|
|
2022-12-24 00:26:11 -05:00
|
|
|
<!-- toc -->
|
|
|
|
|
|
2022-12-23 15:52:09 -05:00
|
|
|
In this tutorial, you will learn how to build a simple LSM-Tree storage engine in the Rust programming language.
|
|
|
|
|
|
|
|
|
|
## What is LSM, and Why LSM?
|
|
|
|
|
|
2022-12-23 18:44:59 -05:00
|
|
|
Log-structured merge tree is a data structure to maintain key-value pairs. This data structure is widely used in
|
|
|
|
|
distributed database systems like [TiDB](https://www.pingcap.com) and [CockroachDB](https://www.cockroachlabs.com) as
|
|
|
|
|
their underlying storage engine. [RocksDB](http://rocksdb.org), based on [LevelDB](https://github.com/google/leveldb),
|
|
|
|
|
is an implementation of LSM-Tree storage engine. It provides a wide range of key-value access functionalities and is
|
|
|
|
|
used in a lot of production systems.
|
|
|
|
|
|
|
|
|
|
Generally speaking, LSM Tree is an append-friendly data structure. It is more intuitive to compare LSM to other
|
|
|
|
|
key-value data structure like RB-Tree and B-Tree. For RB-Tree and B-Tree, all data operations are in-place. That is to
|
|
|
|
|
say, when you update the value corresponding to the key, the value will be overwritten at its original memory or disk
|
|
|
|
|
space. But in an LSM Tree, all write operations, i.e., insertions, updates, deletions, are performed in somewhere else.
|
|
|
|
|
These operations will be batched into SST (sorted string table) files and be written to the disk. Once written to the
|
|
|
|
|
disk, the file will not be changed. These operations are applied lazily on disk with a special task called compaction.
|
|
|
|
|
The compaction job will merge multiple SST files and remove unused data.
|
|
|
|
|
|
|
|
|
|
This architectural design makes LSM tree easy to work with.
|
|
|
|
|
|
|
|
|
|
1. Data are immutable on persistent storage, which means that it is easier to offload the background tasks (compaction)
|
|
|
|
|
to remote servers. It is also feasible to directly store and serve data from cloud-native storage systems like S3.
|
|
|
|
|
2. An LSM tree can balance between read, write and space amplification by changing the compaction algorithm. The data
|
|
|
|
|
structure itself is super versatile and can be optimized for different workloads.
|
|
|
|
|
|
|
|
|
|
In this tutorial, we will learn how to build an LSM-Tree-based storage engine in the Rust programming language.
|
|
|
|
|
|
2022-12-24 17:13:52 -05:00
|
|
|
## Prerequisites of this Tutorial
|
|
|
|
|
|
|
|
|
|
* You should know the basics of the Rust programming language. Reading [the Rust book](https://doc.rust-lang.org/book/)
|
|
|
|
|
is enough.
|
|
|
|
|
* You should know the basic concepts of key-value storage engines, i.e., why we need somehow complex design to achieve
|
|
|
|
|
persistence. If you have no experience with database systems and storage systems before, you can implement Bitcask
|
|
|
|
|
in [PingCAP Talent Plan](https://github.com/pingcap/talent-plan/tree/master/courses/rust/projects/project-2).
|
|
|
|
|
* Knowing the basics of an LSM tree is not a requirement but we recommend you to read something about it, e.g., the
|
|
|
|
|
overall idea of LevelDB. This would familiarize you with concepts like mutable and immutable mem-tables, SST,
|
|
|
|
|
compaction, WAL, etc.
|
|
|
|
|
|
2022-12-23 18:44:59 -05:00
|
|
|
## Overview of LSM
|
|
|
|
|
|
|
|
|
|
An LSM storage engine generally contains 3 parts:
|
|
|
|
|
|
|
|
|
|
1. Write-ahead log to persist temporary data for recovery.
|
|
|
|
|
2. SSTs on the disk for maintaining a tree structure.
|
|
|
|
|
3. Mem-tables in memory for batching small writes.
|
|
|
|
|
|
|
|
|
|
The storage engine generally provides the following interfaces:
|
|
|
|
|
|
|
|
|
|
* `Put(key, value)`: store a key-value pair in the LSM tree.
|
|
|
|
|
* `Delete(key)`: remove a key and its corresponding value.
|
|
|
|
|
* `Get(key)`: get the value corresponding to a key.
|
2022-12-24 10:11:06 -05:00
|
|
|
* `Scan(range)`: get a range of key-value pairs.
|
2022-12-23 18:44:59 -05:00
|
|
|
|
|
|
|
|
To ensure persistence,
|
|
|
|
|
|
|
|
|
|
* `Sync()`: ensure all the operations before `sync` are persisted to the disk.
|
|
|
|
|
|
|
|
|
|
Some engines choose to combine `Put` and `Delete` into a single operation called `WriteBatch`, which accepts a batch
|
|
|
|
|
of key value pairs.
|
|
|
|
|
|
|
|
|
|
In this tutorial, we assume the LSM tree is using leveled compaction algorithm, which is commonly used in real-world
|
|
|
|
|
systems.
|
|
|
|
|
|
|
|
|
|
## Write Flow
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
The write flow of LSM contains 4 steps:
|
|
|
|
|
|
|
|
|
|
1. Write the key-value pair to write-ahead log, so that it can be recovered after the storage engine crashes.
|
|
|
|
|
2. Write the key-value pair to memtable. After (1) and (2) completes, we can notify the user that the write operation
|
|
|
|
|
is completed.
|
|
|
|
|
3. When a memtable is full, we will flush it to the disk as an SST file in the background.
|
|
|
|
|
4. We will compact some files in some level into lower levels to maintain a good shape for the LSM tree, so that read
|
|
|
|
|
amplification is low.
|
|
|
|
|
|
|
|
|
|
## Read Flow
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
When we want to read a key,
|
|
|
|
|
|
|
|
|
|
1. We will first probe all the memtables from latest to oldest.
|
|
|
|
|
2. If the key is not found, we will then search the entire LSM tree containing SSTs to find the data.
|
|
|
|
|
|
|
|
|
|
## Tutorial Overview
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
In this tutorial, we will build the LSM tree structure in 7 days:
|
|
|
|
|
|
|
|
|
|
* Day 1: Block encoding. SSTs are composed of multiple data blocks. We will implement the block encoding.
|
|
|
|
|
* Day 2: SST encoding.
|
2022-12-24 15:34:34 -05:00
|
|
|
* Day 3: MemTable and Merge Iterators.
|
|
|
|
|
* Day 4: Block cache and Engine. To reduce disk I/O and maximize performance, we will use moka-rs to build a block cache
|
|
|
|
|
* for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with `get`, `put`, `scan`,
|
|
|
|
|
`delete` API.
|
2022-12-23 18:44:59 -05:00
|
|
|
* Day 5: Compaction. Now it's time to maintain a leveled structure for SSTs.
|
|
|
|
|
* Day 6: Recovery. We will implement WAL and manifest so that the engine can recover after restart.
|
|
|
|
|
* Day 7: Bloom filter and key compression. They are widely-used optimizations in LSM tree structures.
|
|
|
|
|
|
2022-12-23 22:35:38 -05:00
|
|
|
## Development Guide
|
|
|
|
|
|
2022-12-23 18:44:59 -05:00
|
|
|
We provide you starter code (see `mini-lsm-starter` crate), where we simply replace all function body with
|
|
|
|
|
`unimplemented!()`. You can start your project based on this starter code. We provide test cases, but they are very
|
|
|
|
|
simple. We recommend you to think carefully about your implementation and write test cases by yourself.
|
2022-12-23 22:35:38 -05:00
|
|
|
|
2022-12-23 23:45:09 -05:00
|
|
|
* You can use `cargo x scheck` to run all test cases and do style check in your codebase.
|
|
|
|
|
* You can use `cargo x copy-test dayX` to copy test cases to the starter code.
|
|
|
|
|
|
|
|
|
|
## About the Author
|
|
|
|
|
|
|
|
|
|
As of writing (at the end of 2022), Chi is a first-year master's student in Carnegie Mellon University. He has 5 years'
|
|
|
|
|
experience with the Rust programming language since 2018. He has been working on a variety of database systems including
|
|
|
|
|
[TiKV][db1], [AgateDB][db2], [TerarkDB][db3], [RisingLight][db4], and [RisingWave][db5]. In his first semester in CMU,
|
2022-12-24 00:22:52 -05:00
|
|
|
he worked as a teaching assistant for CMU's [15-445/645 Intro to Database Systems][15445-course] course, where he built
|
|
|
|
|
a new SQL processing layer for the [BusTub][bustub] educational database system, added more query optimization stuff into
|
2022-12-23 23:45:09 -05:00
|
|
|
the course, and made the course [more challenging than ever before][tweet]. Chi is interested in exploring how the Rust
|
2022-12-24 00:22:52 -05:00
|
|
|
programming language can fit in the database world. Check out his [previous tutorial][type-exercise] on building a
|
2022-12-23 23:45:09 -05:00
|
|
|
vectorized expression framework if you are also interested in that topic.
|
|
|
|
|
|
|
|
|
|
[db1]: https://github.com/tikv/tikv
|
|
|
|
|
[db2]: https://github.com/tikv/agatedb
|
|
|
|
|
[db3]: https://github.com/bytedance/terarkdb
|
|
|
|
|
[db4]: https://github.com/risinglightdb/risinglight
|
|
|
|
|
[db5]: https://github.com/risingwavelabs/risingwave
|
|
|
|
|
[15445-course]: https://15445.courses.cs.cmu.edu/fall2022/
|
|
|
|
|
[tweet]: https://twitter.com/andy_pavlo/status/1598137241016360961
|
|
|
|
|
[type-exercise]: https://github.com/skyzh/type-exercise-in-rust
|
2022-12-24 00:22:52 -05:00
|
|
|
[bustub]: https://github.com/cmu-db/bustub
|