@@ -4,11 +4,11 @@
|
||||
|
||||
# Mini-LSM Course Overview
|
||||
|
||||
## Tutorial Structure
|
||||
## Course Structure
|
||||
|
||||

|
||||

|
||||
|
||||
We have three parts (weeks) for this tutorial. In the first week, we will focus on the storage structure and the storage format of an LSM storage engine. In the second week, we will deeply dive into compactions and implement persistence support for the storage engine. In the third week, we will implement multi-version concurrency control.
|
||||
We have three parts (weeks) for this course. In the first week, we will focus on the storage structure and the storage format of an LSM storage engine. In the second week, we will deeply dive into compactions and implement persistence support for the storage engine. In the third week, we will implement multi-version concurrency control.
|
||||
|
||||
* [The First Week: Mini-LSM](./week1-overview.md)
|
||||
* [The Second Week: Compaction and Persistence](./week2-overview.md)
|
||||
@@ -37,7 +37,7 @@ To ensure persistence,
|
||||
|
||||
Some engines choose to combine `Put` and `Delete` into a single operation called `WriteBatch`, which accepts a batch of key-value pairs.
|
||||
|
||||
In this tutorial, we assume the LSM tree is using a leveled compaction algorithm, which is commonly used in real-world systems.
|
||||
In this course, we assume the LSM tree is using a leveled compaction algorithm, which is commonly used in real-world systems.
|
||||
|
||||
### Write Path
|
||||
|
||||
@@ -59,6 +59,6 @@ When we want to read a key,
|
||||
1. We will first probe all the mem-tables from the latest to the oldest.
|
||||
2. If the key is not found, we will then search the entire LSM tree containing SSTs to find the data.
|
||||
|
||||
There are two types of read: lookup and scan. Lookup finds one key in the LSM tree, while scan iterates all keys within a range in the storage engine. We will cover both of them throughout the tutorial.
|
||||
There are two types of read: lookup and scan. Lookup finds one key in the LSM tree, while scan iterates all keys within a range in the storage engine. We will cover both of them throughout the course.
|
||||
|
||||
{{#include copyright.md}}
|
||||
|
@@ -36,13 +36,13 @@ This course will teach you how to build an LSM-tree-based storage engine in the
|
||||
* You should know the basic concepts of key-value storage engines, i.e., why we need a complex design to achieve persistence. If you have no experience with database systems and storage systems before, you can implement Bitcask in [PingCAP Talent Plan](https://github.com/pingcap/talent-plan/tree/master/courses/rust/projects/project-2).
|
||||
* Knowing the basics of an LSM tree is not a requirement, but we recommend you read something about it, e.g., the overall idea of LevelDB. Knowing them beforehand would familiarize you with concepts like mutable and immutable mem-tables, SST, compaction, WAL, etc.
|
||||
|
||||
## What should you expect from this tutorial
|
||||
## What should you expect from this course
|
||||
|
||||
After taking this course, you should deeply understand how an LSM-based storage system works, gain hands-on experience in designing such systems, and apply what you have learned in your study and career. You will understand the design tradeoffs in such storage systems and find optimal ways to design an LSM-based storage system to meet your workload requirements/goals. This very in-depth tutorial covers all the essential implementation details and design choices of modern storage systems (i.e., RocksDB) based on the author's experience in several LSM-like storage systems, and you will be able to directly apply what you have learned in both industry and academia.
|
||||
After taking this course, you should deeply understand how an LSM-based storage system works, gain hands-on experience in designing such systems, and apply what you have learned in your study and career. You will understand the design tradeoffs in such storage systems and find optimal ways to design an LSM-based storage system to meet your workload requirements/goals. This very in-depth course covers all the essential implementation details and design choices of modern storage systems (i.e., RocksDB) based on the author's experience in several LSM-like storage systems, and you will be able to directly apply what you have learned in both industry and academia.
|
||||
|
||||
### Structure
|
||||
|
||||
The tutorial is an extensive course with several parts (weeks). Each week has seven chapters; you can finish each within 2 to 3 hours. The first six chapters of each part will instruct you to build a working system, and the last chapter of each week will be a *snack time* chapter that implements some easy things over what you have built in the previous six days. Each chapter will have required tasks, *check your understanding* questions, and bonus tasks.
|
||||
The course is an extensive course with several parts (weeks). Each week has seven chapters; you can finish each within 2 to 3 hours. The first six chapters of each part will instruct you to build a working system, and the last chapter of each week will be a *snack time* chapter that implements some easy things over what you have built in the previous six days. Each chapter will have required tasks, *check your understanding* questions, and bonus tasks.
|
||||
|
||||
### Testing
|
||||
|
||||
@@ -50,27 +50,27 @@ We provide a full test suite and some CLI tools for you to validate if your solu
|
||||
|
||||
### Solution
|
||||
|
||||
We have a solution that implements all the functionalities as required in the tutorial in the mini-lsm main repo. At the same time, we also have a mini-lsm solution checkpoint repo where each commit corresponds to a chapter in the tutorial.
|
||||
We have a solution that implements all the functionalities as required in the course in the mini-lsm main repo. At the same time, we also have a mini-lsm solution checkpoint repo where each commit corresponds to a chapter in the course.
|
||||
|
||||
Keeping such a checkpoint repo up-to-date with the mini-lsm tutorial is challenging because each bug fix or new feature must go through all commits (or checkpoints). Therefore, this repo might not use the latest starter code or incorporate the latest features from the mini-lsm tutorial.
|
||||
Keeping such a checkpoint repo up-to-date with the mini-lsm course is challenging because each bug fix or new feature must go through all commits (or checkpoints). Therefore, this repo might not use the latest starter code or incorporate the latest features from the mini-lsm course.
|
||||
|
||||
**TL;DR: We do not guarantee the solution checkpoint repo contains a correct solution, passes all tests, or has the correct doc comments.** For a correct implementation and the solution after implementing everything, please look at the solution in the main repo instead. [https://github.com/skyzh/mini-lsm/tree/main/mini-lsm](https://github.com/skyzh/mini-lsm/tree/main/mini-lsm).
|
||||
|
||||
If you are stuck at some part of the tutorial or need help determining where to implement functionality, you can refer to this repo for help. You may compare the diff between commits to know what has been changed. You might need to modify some functions in the mini-lsm tutorial multiple times throughout the chapters, and you can understand what exactly is expected to be implemented for each chapter in this repo.
|
||||
If you are stuck at some part of the course or need help determining where to implement functionality, you can refer to this repo for help. You may compare the diff between commits to know what has been changed. You might need to modify some functions in the mini-lsm course multiple times throughout the chapters, and you can understand what exactly is expected to be implemented for each chapter in this repo.
|
||||
|
||||
You may access the solution checkpoint repo at [https://github.com/skyzh/mini-lsm-solution-checkpoint](https://github.com/skyzh/mini-lsm-solution-checkpoint).
|
||||
|
||||
### Feedbacks
|
||||
|
||||
Your feedback is greatly appreciated. We have rewritten the whole course from scratch in 2024 based on the feedback from the students. Please share your learning experience and help us continuously improve the tutorial. Welcome to the [Discord community](https://skyzh.dev/join/discord) and share your experience.
|
||||
Your feedback is greatly appreciated. We have rewritten the whole course from scratch in 2024 based on the feedback from the students. Please share your learning experience and help us continuously improve the course. Welcome to the [Discord community](https://skyzh.dev/join/discord) and share your experience.
|
||||
|
||||
The long story of why we rewrote it: The tutorial was originally planned as a general guidance that students start from an empty directory and implement whatever they want based on the specifications we had. We had minimal tests that checked if the behavior was correct. However, the original tutorial was too open-ended, which caused huge obstacles to the learning experience. As students do not have an overview of the whole system beforehand and the instructions are vague, sometimes it is hard for them to know why a design decision is made and what they need to achieve a goal. Some parts of the course were so compact that delivering the expected contents within just one chapter was impossible. Therefore, we completely redesigned the course for an easier learning curve and clearer learning goals. The original one-week tutorial is now split into two weeks (the first week on storage format and the second week on deep-dive compaction), with an extra part on MVCC. We hope you find this course interesting and helpful in your study and career. We want to thank everyone who commented in [Feedback after coding day 1](https://github.com/skyzh/mini-lsm/issues/11) and [Hello, when is the next update plan for the tutorial?](https://github.com/skyzh/mini-lsm/issues/7) -- Your feedback greatly helped us improve the course.
|
||||
The long story of why we rewrote it: The course was originally planned as a general guidance that students start from an empty directory and implement whatever they want based on the specifications we had. We had minimal tests that checked if the behavior was correct. However, the original course was too open-ended, which caused huge obstacles to the learning experience. As students do not have an overview of the whole system beforehand and the instructions are vague, sometimes it is hard for them to know why a design decision is made and what they need to achieve a goal. Some parts of the course were so compact that delivering the expected contents within just one chapter was impossible. Therefore, we completely redesigned the course for an easier learning curve and clearer learning goals. The original one-week course is now split into two weeks (the first week on storage format and the second week on deep-dive compaction), with an extra part on MVCC. We hope you find this course interesting and helpful in your study and career. We want to thank everyone who commented in [Feedback after coding day 1](https://github.com/skyzh/mini-lsm/issues/11) and [Hello, when is the next update plan for the course?](https://github.com/skyzh/mini-lsm/issues/7) -- Your feedback greatly helped us improve the course.
|
||||
|
||||
### License
|
||||
|
||||
The source code of this course is licensed under Apache 2.0, while the book is licensed under CC BY-NC-SA 4.0.
|
||||
|
||||
### Will this tutorial be free forever?
|
||||
### Will this course be free forever?
|
||||
|
||||
Yes! Everything publicly available now will be free forever and receive lifetime updates and bug fixes. Meanwhile, we might provide paid code review and office hour services. For the DLC part (*rest of your life* chapters), we do not have plans to finish them as of 2024 and have yet to decide whether they will be publicly available.
|
||||
|
||||
@@ -86,7 +86,7 @@ Now, you can get an overview of the LSM structure in [Mini-LSM Course Overview](
|
||||
|
||||
## About the Author
|
||||
|
||||
As of writing (at the beginning of 2024), Chi obtained his master's degree in Computer Science from Carnegie Mellon University and his bachelor's degree from Shanghai Jiao Tong University. He has been working on a variety of database systems, including [TiKV][db1], [AgateDB][db2], [TerarkDB][db3], [RisingWave][db4], and [Neon][db5]. Since 2022, he has worked as a teaching assistant for [CMU's Database Systems course](https://15445.courses.cs.cmu) for three semesters on the BusTub educational system, where he added a lot of new features and more challenges to the course (check out the redesigned [query execution](https://15445.courses.cs.cmu.edu/fall2022/project3/) project and the super challenging [multi-version concurrency control](https://15445.courses.cs.cmu.edu/fall2023/project4/) project). Besides working on the BusTub educational system, he also maintains the [RisingLight](https://github.com/risinglightdb/risinglight) educational database system. Chi is interested in exploring how the Rust programming language can fit into the database world. Check out his previous tutorial on building a vectorized expression framework [type-exercise-in-rust](https://github.com/skyzh/type-exercise-in-rust) and on building a vector database [write-you-a-vector-db](https://github.com/skyzh/write-you-a-vector-db) if you are also interested in that topic.
|
||||
As of writing (at the beginning of 2024), Chi obtained his master's degree in Computer Science from Carnegie Mellon University and his bachelor's degree from Shanghai Jiao Tong University. He has been working on a variety of database systems, including [TiKV][db1], [AgateDB][db2], [TerarkDB][db3], [RisingWave][db4], and [Neon][db5]. Since 2022, he has worked as a teaching assistant for [CMU's Database Systems course](https://15445.courses.cs.cmu) for three semesters on the BusTub educational system, where he added a lot of new features and more challenges to the course (check out the redesigned [query execution](https://15445.courses.cs.cmu.edu/fall2022/project3/) project and the super challenging [multi-version concurrency control](https://15445.courses.cs.cmu.edu/fall2023/project4/) project). Besides working on the BusTub educational system, he also maintains the [RisingLight](https://github.com/risinglightdb/risinglight) educational database system. Chi is interested in exploring how the Rust programming language can fit into the database world. Check out his previous course on building a vectorized expression framework [type-exercise-in-rust](https://github.com/skyzh/type-exercise-in-rust) and on building a vector database [write-you-a-vector-db](https://github.com/skyzh/write-you-a-vector-db) if you are also interested in that topic.
|
||||
|
||||
[db1]: https://github.com/tikv/tikv
|
||||
[db2]: https://github.com/tikv/agatedb
|
||||
|
@@ -4,13 +4,13 @@
|
||||
|
||||
# Mini-LSM v1
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a new version of this tutorial. We keep the legacy version in this book so that the search engine can keep the pages in the index and users can follow the links to the new version of the tutorial.
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a new version of this course. We keep the legacy version in this book so that the search engine can keep the pages in the index and users can follow the links to the new version of the course.
|
||||
|
||||
## V1 Tutorial Overview
|
||||
## V1 Course Overview
|
||||
|
||||

|
||||

|
||||
|
||||
In this tutorial, we will build the LSM tree structure in 7 days:
|
||||
In this course, we will build the LSM tree structure in 7 days:
|
||||
|
||||
* Day 1: Block encoding. SSTs are composed of multiple data blocks. We will implement the block encoding.
|
||||
* Day 2: SST encoding.
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial and this chapter is now part of [Mini-LSM Week 1 Day 3: Blocks](./week1-03-block.md).
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course and this chapter is now part of [Mini-LSM Week 1 Day 3: Blocks](./week1-03-block.md).
|
||||
|
||||
</div>
|
||||
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial and this chapter is now part of [Mini-LSM Week 1 Day 4: Sorted String Table (SST)](./week1-04-sst.md).
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course and this chapter is now part of [Mini-LSM Week 1 Day 4: Sorted String Table (SST)](./week1-04-sst.md).
|
||||
|
||||
</div>
|
||||
|
||||
@@ -26,7 +26,7 @@ test cases, write a new module `#[cfg(test)] mod user_tests { /* your test cases
|
||||
## Task 1 - SST Builder
|
||||
|
||||
SST is composed of data blocks and index blocks stored on the disk. Usually, data blocks are lazily loaded -- they will
|
||||
not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this tutorial,
|
||||
not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this course,
|
||||
we make simple assumptions that all SST index blocks (meta blocks) can fit in memory. Generally, an SST file is of 256MB
|
||||
size.
|
||||
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial and this chapter is now part of [Mini-LSM Week 1 Day 1: Memtable](./week1-01-memtable.md) and [Mini-LSM Week 1 Day 2: Merge Iterator](./week1-02-merge-iterator.md)
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course and this chapter is now part of [Mini-LSM Week 1 Day 1: Memtable](./week1-01-memtable.md) and [Mini-LSM Week 1 Day 2: Merge Iterator](./week1-02-merge-iterator.md)
|
||||
|
||||
</div>
|
||||
|
||||
@@ -29,7 +29,7 @@ in part 4, we will compose all these things together to make a real storage engi
|
||||
|
||||
## Task 1 - Mem Table
|
||||
|
||||
In this tutorial, we use [crossbeam-skiplist](https://docs.rs/crossbeam-skiplist) as the implementation of memtable.
|
||||
In this course, we use [crossbeam-skiplist](https://docs.rs/crossbeam-skiplist) as the implementation of memtable.
|
||||
Skiplist is like linked-list, where data is stored in a list node and will not be moved in memory. Instead of using
|
||||
a single pointer for the next element, the nodes in skiplists contain multiple pointers and allow user to "skip some
|
||||
elements", so that we can achieve `O(log n)` search, insertion, and deletion.
|
||||
@@ -69,7 +69,7 @@ pub struct MemTableIterator {
|
||||
```
|
||||
|
||||
You will also need to convert the Rust-style iterator API to our storage iterator. In Rust, we use `next() -> Data`. But
|
||||
in this tutorial, `next` doesn't have a return value, and the data should be fetched by `key()` and `value()`. You will
|
||||
in this course, `next` doesn't have a return value, and the data should be fetched by `key()` and `value()`. You will
|
||||
need to think a way to implement this.
|
||||
|
||||
<details>
|
||||
@@ -95,7 +95,7 @@ the inner iter to the next position.
|
||||
</details>
|
||||
|
||||
In this design, you might have noticed that as long as we have the iterator object, the mem-table cannot be freed from
|
||||
the memory. In this tutorial, we assume user operations are short, so that this will not cause big problems. See extra
|
||||
the memory. In this course, we assume user operations are short, so that this will not cause big problems. See extra
|
||||
task for possible improvements.
|
||||
|
||||
You can also consider using [AgateDB's skiplist](https://github.com/tikv/agatedb/tree/master/skiplist) implementation,
|
||||
@@ -140,7 +140,7 @@ types. That is `TwoMergeIterator`.
|
||||
You can implement `TwoMergeIterator` in `two_merge_iter.rs`. Similar to `MergeIterator`, if the same key is found in
|
||||
both of the iterator, the first iterator takes precedence.
|
||||
|
||||
In this tutorial, we explicitly did not use something like `Box<dyn StorageIter>` to avoid dynamic dispatch. This is a
|
||||
In this course, we explicitly did not use something like `Box<dyn StorageIter>` to avoid dynamic dispatch. This is a
|
||||
common optimization in LSM storage engines.
|
||||
|
||||
## Extra Tasks
|
||||
@@ -150,7 +150,7 @@ common optimization in LSM storage engines.
|
||||
to think of smart ways of solving this.
|
||||
* Async iterator. One interesting thing to explore is to see if it is possible to asynchronize everything in the storage
|
||||
engine. You might find some lifetime related problems and need to workaround them.
|
||||
* Foreground iterator. In this tutorial we assumed that all operations are short, so that we can hold reference to
|
||||
* Foreground iterator. In this course we assumed that all operations are short, so that we can hold reference to
|
||||
mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB)
|
||||
will stay in the memory even if it has been flushed to disk. To solve this, we can provide a `ForegroundIterator` /
|
||||
`LongIterator` to our user. The iterator will periodically create new underlying storage iterator so as to allow
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial and this chapter is now part of [Mini-LSM Week 1 Day 5: Read Path](./week1-05-read-path.md) and [Mini-LSM Week 1 Day 6: Write Path](./week1-06-write-path.md)
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course and this chapter is now part of [Mini-LSM Week 1 Day 5: Read Path](./week1-05-read-path.md) and [Mini-LSM Week 1 Day 6: Write Path](./week1-06-write-path.md)
|
||||
|
||||
</div>
|
||||
|
||||
|
@@ -7,7 +7,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course
|
||||
and this chapter is now part of:
|
||||
|
||||
- [Mini-LSM Week 2 Day 1: Compaction Implementation](./week2-01-compaction.md)
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course
|
||||
and this chapter is now part of:
|
||||
|
||||
- [Mini-LSM Week 2 Day 5: Manifest](./week2-05-manifest.md)
|
||||
|
@@ -7,7 +7,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course
|
||||
and this chapter is now part of [Mini LSM Week 1 Day 7: SST Optimizations](./week1-07-sst-optimizations.md).
|
||||
|
||||
</div>
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||
<div class="warning">
|
||||
|
||||
This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We now have a better version of this tutorial
|
||||
This is a legacy version of the Mini-LSM course and we will not maintain it anymore. We now have a better version of this course
|
||||
and this chapter is now part of [Mini LSM Week 1 Day 7: SST Optimizations](./week1-07-sst-optimizations.md).
|
||||
|
||||
</div>
|
||||
|
@@ -161,7 +161,7 @@ Now that you have multiple memtables, you may modify your read path `get` functi
|
||||
* Why do we need a combination of `state` and `state_lock`? Can we only use `state.read()` and `state.write()`?
|
||||
* Why does the order to store and to probe the memtables matter? If a key appears in multiple memtables, which version should you return to the user?
|
||||
* Is the memory layout of the memtable efficient / does it have good data locality? (Think of how `Byte` is implemented and stored in the skiplist...) What are the possible optimizations to make the memtable more efficient?
|
||||
* So we are using `parking_lot` locks in this tutorial. Is its read-write lock a fair lock? What might happen to the readers trying to acquire the lock if there is one writer waiting for existing readers to stop?
|
||||
* So we are using `parking_lot` locks in this course. Is its read-write lock a fair lock? What might happen to the readers trying to acquire the lock if there is one writer waiting for existing readers to stop?
|
||||
* After freezing the memtable, is it possible that some threads still hold the old LSM state and wrote into these immutable memtables? How does your solution prevent it from happening?
|
||||
* There are several places that you might first acquire a read lock on state, then drop it and acquire a write lock (these two operations might be in different functions but they happened sequentially due to one function calls the other). How does it differ from directly upgrading the read lock to a write lock? Is it necessary to upgrade instead of acquiring and dropping and what is the cost of doing the upgrade?
|
||||
|
||||
|
@@ -47,7 +47,7 @@ pub struct MemtableIterator {
|
||||
|
||||
Okay, here is the problem: we want to express that the lifetime of the iterator is the same as the `map` in the structure. How can we do that?
|
||||
|
||||
This is the first and most tricky Rust language thing that you will ever meet in this tutorial -- self-referential structure. If it is possible to write something like:
|
||||
This is the first and most tricky Rust language thing that you will ever meet in this course -- self-referential structure. If it is possible to write something like:
|
||||
|
||||
```rust,no_run
|
||||
pub struct MemtableIterator { // <- with lifetime 'this
|
||||
@@ -123,7 +123,7 @@ In this task, you will need to modify:
|
||||
src/lsm_iterator.rs
|
||||
```
|
||||
|
||||
We use the `LsmIterator` structure to represent the internal LSM iterators. You will need to modify this structure multiple times throughout the tutorial when more iterators are added into the system. For now, because we only have multiple memtables, it should be defined as:
|
||||
We use the `LsmIterator` structure to represent the internal LSM iterators. You will need to modify this structure multiple times throughout the course when more iterators are added into the system. For now, because we only have multiple memtables, it should be defined as:
|
||||
|
||||
```rust,no_run
|
||||
type LsmIteratorInner = MergeIterator<MemTableIterator>;
|
||||
@@ -163,6 +163,6 @@ We do not provide reference answers to the questions, and feel free to discuss a
|
||||
|
||||
## Bonus Tasks
|
||||
|
||||
* **Foreground Iterator.** In this tutorial we assumed that all operations are short, so that we can hold reference to mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) will stay in the memory even if it has been flushed to disk. To solve this, we can provide a `ForegroundIterator` / `LongIterator` to our user. The iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources.
|
||||
* **Foreground Iterator.** In this course we assumed that all operations are short, so that we can hold reference to mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) will stay in the memory even if it has been flushed to disk. To solve this, we can provide a `ForegroundIterator` / `LongIterator` to our user. The iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources.
|
||||
|
||||
{{#include copyright.md}}
|
||||
|
@@ -30,7 +30,7 @@ src/block/builder.rs
|
||||
src/block.rs
|
||||
```
|
||||
|
||||
The block encoding format in our tutorial is as follows:
|
||||
The block encoding format in our course is as follows:
|
||||
|
||||
```plaintext
|
||||
----------------------------------------------------------------------------------------------------
|
||||
|
@@ -28,7 +28,7 @@ src/table/builder.rs
|
||||
src/table.rs
|
||||
```
|
||||
|
||||
SSTs are composed of data blocks and index blocks stored on the disk. Usually, data blocks are lazily loaded -- they will not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this tutorial, we make simple assumptions that all SST index blocks (meta blocks) can fit in memory (actually we do not have a designated index block implementation.) Generally, an SST file is of 256MB size.
|
||||
SSTs are composed of data blocks and index blocks stored on the disk. Usually, data blocks are lazily loaded -- they will not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this course, we make simple assumptions that all SST index blocks (meta blocks) can fit in memory (actually we do not have a designated index block implementation.) Generally, an SST file is of 256MB size.
|
||||
|
||||
The SST builder is similar to block builder -- users will call `add` on the builder. You should maintain a `BlockBuilder` inside SST builder and split blocks when necessary. Also, you will need to maintain block metadata `BlockMeta`, which includes the first/last keys in each block and the offsets of each block. The `build` function will encode the SST, write everything to disk using `FileObject::create`, and return an `SsTable` object.
|
||||
|
||||
|
@@ -81,7 +81,7 @@ For get requests, it will be processed as lookups in the memtables, and then sca
|
||||
|
||||
## Test Your Understanding
|
||||
|
||||
* Consider the case that a user has an iterator that iterates the whole storage engine, and the storage engine is 1TB large, so that it takes ~1 hour to scan all the data. What would be the problems if the user does so? (This is a good question and we will ask it several times at different points of the tutorial...)
|
||||
* Consider the case that a user has an iterator that iterates the whole storage engine, and the storage engine is 1TB large, so that it takes ~1 hour to scan all the data. What would be the problems if the user does so? (This is a good question and we will ask it several times at different points of the course...)
|
||||
* Another popular interface provided by some LSM-tree storage engines is multi-get (or vectored get). The user can pass a list of keys that they want to retrieve. The interface returns the value of each of the key. For example, `multi_get(vec!["a", "b", "c", "d"]) -> a=1,b=2,c=3,d=4`. Obviously, an easy implementation is to simply doing a single get for each of the key. How will you implement the multi-get interface, and what optimizations you can do to make it more efficient? (Hint: some operations during the get process will only need to be done once for all keys, and besides that, you can think of an improved disk I/O interface to better support this multi-get interface).
|
||||
|
||||
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
|
||||
|
@@ -21,7 +21,7 @@ cargo x scheck
|
||||
|
||||
## Task 1: Flush Memtable to SST
|
||||
|
||||
At this point, we have all in-memory things and on-disk files ready, and the storage engine is able to read and merge the data from all these structures. Now, we are going to implement the logic to move things from memory to the disk (so-called flush), and complete the Mini-LSM week 1 tutorial.
|
||||
At this point, we have all in-memory things and on-disk files ready, and the storage engine is able to read and merge the data from all these structures. Now, we are going to implement the logic to move things from memory to the disk (so-called flush), and complete the Mini-LSM week 1 course.
|
||||
|
||||
In this task, you will need to modify:
|
||||
|
||||
@@ -36,7 +36,7 @@ You will need to modify `LSMStorageInner::force_flush_next_imm_memtable` and `Me
|
||||
* Create an SST file corresponding to a memtable.
|
||||
* Remove the memtable from the immutable memtable list and add the SST file to L0 SSTs.
|
||||
|
||||
We have not explained what is L0 (level-0) SSTs for now. In general, they are the set of SSTs files directly created as a result of memtable flush. In week 1 of this tutorial, we will only have L0 SSTs on the disk. We will dive into how to organize them efficiently using leveled or tiered structure on the disk in week 2.
|
||||
We have not explained what is L0 (level-0) SSTs for now. In general, they are the set of SSTs files directly created as a result of memtable flush. In week 1 of this course, we will only have L0 SSTs on the disk. We will dive into how to organize them efficiently using leveled or tiered structure on the disk in week 2.
|
||||
|
||||
Note that creating an SST file is a compute-heavy and a costly operation. Again, we do not want to hold the `state` read/write lock for a long time, as it might block other operations and create huge latency spikes in the LSM operations. Also, we use the `state_lock` mutex to serialize state modification operations in the LSM tree. In this task, you will need to think carefully how to use these locks to make the LSM state modification race-condition free while minimizing critical sections.
|
||||
|
||||
|
@@ -6,7 +6,7 @@
|
||||
|
||||

|
||||
|
||||
In the first week of the tutorial, you will build necessary storage formats for the storage engine, the read path and the write path of the system, and have a working implementation of an LSM-based key-value store. There are 7 chapters (days) for this part.
|
||||
In the first week of the course, you will build necessary storage formats for the storage engine, the read path and the write path of the system, and have a working implementation of an LSM-based key-value store. There are 7 chapters (days) for this part.
|
||||
|
||||
* [Day 1: Memtable](./week1-01-memtable.md). You will implement the in-memory read and write path of the system.
|
||||
* [Day 2: Merge Iterator](./week1-02-merge-iterator.md). You will extend what you have built in day 1 and implement a `scan` interface for your system.
|
||||
|
@@ -17,7 +17,7 @@ In this chapter, you will:
|
||||
|
||||
## Task 1: Write Batch Interface
|
||||
|
||||
In this task, we will prepare for week 3 of this tutorial by adding a write batch API. You will need to modify:
|
||||
In this task, we will prepare for week 3 of this course by adding a write batch API. You will need to modify:
|
||||
|
||||
```
|
||||
src/lsm_storage.rs
|
||||
@@ -51,7 +51,7 @@ The format of the SST will be changed to:
|
||||
|
||||
We use crc32 as our checksum algorithm. You can use `crc32fast::hash` to generate the checksum for the block after building a block.
|
||||
|
||||
Usually, when user specify the target block size in the storage options, the size should include both block content and checksum. For example, if the target block size is 4096, and the checksum takes 4 bytes, the actual block content target size should be 4092. However, to avoid breaking previous test cases and for simplicity, in our tutorial, we will **still** use the target block size as the target content size, and simply append the checksum at the end of the block.
|
||||
Usually, when user specify the target block size in the storage options, the size should include both block content and checksum. For example, if the target block size is 4096, and the checksum takes 4 bytes, the actual block content target size should be 4092. However, to avoid breaking previous test cases and for simplicity, in our course, we will **still** use the target block size as the target content size, and simply append the checksum at the end of the block.
|
||||
|
||||
When you read the block, you should verify the checksum in `read_block` correctly generate the slices for the block content. You should pass all test cases in previous chapters after implementing this functionality.
|
||||
|
||||
|
@@ -75,7 +75,7 @@ In leveled compaction, the user can specify a maximum number of levels, which is
|
||||
|
||||

|
||||
|
||||
In tiered compaction, the engine will dynamically adjust the number of sorted runs by merging them or letting new SSTs flushed as new sorted run (a tier) to minimize write amplification. In this strategy, you will usually see the engine merge two equally-sized sorted runs. The number of tiers can be high if the compaction strategy does not choose to merge tiers, therefore making read amplification high. In this tutorial, we will implement RocksDB's universal compaction, which is a kind of tiered compaction strategy.
|
||||
In tiered compaction, the engine will dynamically adjust the number of sorted runs by merging them or letting new SSTs flushed as new sorted run (a tier) to minimize write amplification. In this strategy, you will usually see the engine merge two equally-sized sorted runs. The number of tiers can be high if the compaction strategy does not choose to merge tiers, therefore making read amplification high. In this course, we will implement RocksDB's universal compaction, which is a kind of tiered compaction strategy.
|
||||
|
||||

|
||||
|
||||
|
@@ -69,7 +69,7 @@ In this task, you will need to modify:
|
||||
src/mvcc/txn.rs
|
||||
```
|
||||
|
||||
In this tutorial, we only guarantee full serializability for `get` requests. You still need to track the read set for scans, but in some specific cases, you might still get non-serializable result.
|
||||
In this course, we only guarantee full serializability for `get` requests. You still need to track the read set for scans, but in some specific cases, you might still get non-serializable result.
|
||||
|
||||
To understand why this is hard, let us go through the following example.
|
||||
|
||||
|
@@ -26,7 +26,7 @@ There are a lot of ways to achieve the goal. The user of Mini-LSM can scan all t
|
||||
|
||||
Or, they can create column families (we will talk about this in *rest of your life* chapter). They store each table in a column family, which is a standalone LSM state, and directly remove the SST files corresponding to the column family when the user drop the table.
|
||||
|
||||
In this tutorial, we will implement the third approach: compaction filters. Compaction filters can be dynamically added to the engine at runtime. During the compaction, if a key matching the compaction filter is found, we can silently remove it in the background. Therefore, the user can attach a compaction filter of `prefix=table1` to the engine, and all these keys will be removed during compaction.
|
||||
In this course, we will implement the third approach: compaction filters. Compaction filters can be dynamically added to the engine at runtime. During the compaction, if a key matching the compaction filter is found, we can silently remove it in the background. Therefore, the user can attach a compaction filter of `prefix=table1` to the engine, and all these keys will be removed during compaction.
|
||||
|
||||
## Task 1: Compaction Filter
|
||||
|
||||
|
@@ -6,11 +6,11 @@
|
||||
|
||||
In this part, you will implement MVCC over the LSM engine that you have built in the previous two weeks. We will add timestamp encoding in the keys to maintain multiple versions of a key, and change some part of the engine to ensure old data are either retained or garbage-collected based on whether there are users reading an old version.
|
||||
|
||||
The general approach of the MVCC part in this tutorial is inspired and partially based on [BadgerDB](https://github.com/dgraph-io/badger).
|
||||
The general approach of the MVCC part in this course is inspired and partially based on [BadgerDB](https://github.com/dgraph-io/badger).
|
||||
|
||||
The key of MVCC is to store and access multiple versions of a key in the storage engine. Therefore, we will need to change the key format to `user_key + timestamp (u64)`. And on the user interface side, we will need to have new APIs to help users to gain access to a history version. In summary, we will add a monotonically-increasing timestamp to the key.
|
||||
|
||||
In previous parts, we assumed that newer keys are in the upper level of the LSM tree, and older keys are in the lower level of the LSM tree. During compaction, we only keep the latest version of a key if multiple versions are found in multiple levels, and the compaction process will ensure that newer keys will be kept on the upper level by only merging adjacent levels/tiers. In the MVCC implementation, the key with a larger timestamp is the newest key. During compaction, we can only remove the key if no user is accessing an older version of the database. Though not keeping the latest version of key in the upper level may still yield a correct result for the MVCC LSM implementation, in our tutorial, we choose to keep the invariant, and if there are multiple versions of a key, a later version will always appear in a upper level.
|
||||
In previous parts, we assumed that newer keys are in the upper level of the LSM tree, and older keys are in the lower level of the LSM tree. During compaction, we only keep the latest version of a key if multiple versions are found in multiple levels, and the compaction process will ensure that newer keys will be kept on the upper level by only merging adjacent levels/tiers. In the MVCC implementation, the key with a larger timestamp is the newest key. During compaction, we can only remove the key if no user is accessing an older version of the database. Though not keeping the latest version of key in the upper level may still yield a correct result for the MVCC LSM implementation, in our course, we choose to keep the invariant, and if there are multiple versions of a key, a later version will always appear in a upper level.
|
||||
|
||||
Generally, there are two ways of utilizing a storage engine with MVCC support. If the user uses the engine as a standalone component and do not want to manually assign the timestamps of the keys, they will use transaction APIs to store and retrieve data from the storage engine. Timestamps are transparent to the users. The other way is to integrate the storage engine into the system, where the user manages the timestamps by themselves. To compare these two approaches, we can look at the APIs they provide. We use the terminologies of BadgerDB to describe these two usages: the one that hides the timestamp is *un-managed mode*, and the one that gives the user full control is *managed mode*.
|
||||
|
||||
|
Reference in New Issue
Block a user