diff --git a/README.md b/README.md index 23d8a35..2ac6023 100644 --- a/README.md +++ b/README.md @@ -28,10 +28,10 @@ The tutorial has 8 parts (which can be finished in 7 days): * Day 2: SST encoding. * Day 3: MemTable and Merge Iterators. * Day 4: Block cache and Engine. To reduce disk I/O and maximize performance, we will use moka-rs to build a block cache -* for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with `get`, `put`, `scan`, + for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with `get`, `put`, `scan`, `delete` API. * Day 5: Compaction. Now it's time to maintain a leveled structure for SSTs. * Day 6: Recovery. We will implement WAL and manifest so that the engine can recover after restart. * Day 7: Bloom filter and key compression. They are widely-used optimizations in LSM tree structures. -We have reference solution up to day 4 and tutorial up to day 2 for now. +We have reference solution up to day 4 and tutorial up to day 3 for now. diff --git a/mini-lsm-book/src/00-overview.md b/mini-lsm-book/src/00-overview.md index d73112b..f0b1a56 100644 --- a/mini-lsm-book/src/00-overview.md +++ b/mini-lsm-book/src/00-overview.md @@ -97,7 +97,7 @@ In this tutorial, we will build the LSM tree structure in 7 days: * Day 2: SST encoding. * Day 3: MemTable and Merge Iterators. * Day 4: Block cache and Engine. To reduce disk I/O and maximize performance, we will use moka-rs to build a block cache -* for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with `get`, `put`, `scan`, + for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with `get`, `put`, `scan`, `delete` API. * Day 5: Compaction. Now it's time to maintain a leveled structure for SSTs. * Day 6: Recovery. We will implement WAL and manifest so that the engine can recover after restart. diff --git a/mini-lsm-book/src/03-memtable.md b/mini-lsm-book/src/03-memtable.md index f24382f..4c70953 100644 --- a/mini-lsm-book/src/03-memtable.md +++ b/mini-lsm-book/src/03-memtable.md @@ -19,10 +19,121 @@ in part 4, we will compose all these things together to make a real storage engi ## Task 1 - Mem Table +In this tutorial, we use [crossbeam-skiplist](https://docs.rs/crossbeam-skiplist) as the implementation of memtable. +Skiplist is like linked-list, where data is stored in a list node and will not be moved in memory. Instead of using +a single pointer for the next element, the nodes in skiplists contain multiple pointers and allow user to "skip some +elements", so that we can achieve `O(log n)` search, insertion, and deletion. + +In storage engine, users will create iterators over the data structure. Generally, once user modifies the data structure, +the iterator will become invalid (which is the case for C++ STL and Rust containers). However, skiplists allow us to +access and modify the data structure at the same time, therefore potentially improving the performance when there is +concurrent access. There are some papers argue that skiplists are bad, but the good property that data stays in its +place in memory can make the implementation easier for us. + +In `mem_table.rs`, you will need to implement a mem-table based on crossbeam-skiplist. Note that the memtable only +supports `get`, `scan`, and `put` without `delete`. The deletion is represented as a tombstone `key -> empty value`, +and the actual data will be deleted during the compaction process (day 5). Note that all `get`, `scan`, `put` functions +only need `&self`, which means that we can concurrently call these operations. + ## Task 2 - Mem Table Iterator -## Task 3 - Two-Merge Iterator +You can now implement an iterator `MemTableIterator` for `MemTable`. `memtable.iter(start, end)` will create an iterator +that returns all elements within the range `start, end`. Here, start is `std::ops::Bound`, which contains 3 variants: +`Unbounded`, `Included(key)`, `Excluded(key)`. The expresiveness of `std::ops::Bound` eliminates the need to memorizing +whether an API has a closed range or open range. -## Task 4 - Merge Iterator +Note that `crossbeam-skiplist`'s iterator has the same lifetime as the skiplist itself, which means that we will always +need to provide a lifetime when using the iterator. This is very hard to use. You can use the `ouroboros` crate to +create a self-referential struct that erases the lifetime. You will find the [ouroboros examples][ouroboros-example] +helpful. + +[ouroboros-example]: https://github.com/joshua-maros/ouroboros/blob/main/examples/src/ok_tests.rs + +```rust +pub struct MemTableIterator { + /// hold the reference to the skiplist so that the iterator will be valid. + map: Arc + /// then the lifetime of the iterator should be the same as the `MemTableIterator` struct itself + iter: SkipList::Iter<'this> +} +``` + +You will also need to convert the Rust-style iterator API to our storage iterator. In Rust, we use `next() -> Data`. But +in this tutorial, `next` doesn't have a return value, and the data should be fetched by `key()` and `value()`. You will +need to think a way to implement this. + +
+Spoiler: the MemTableIterator struct + +```rust +#[self_referencing] +pub struct MemTableIterator { + map: Arc>, + #[borrows(map)] + #[not_covariant] + iter: SkipMapRangeIter<'this>, + item: (Bytes, Bytes), +} +``` + +We have `map` serving as a reference to the skipmap, `iter` as a self-referential item of the struct, and `item` as the +last item from the iterator. You might have thought of using something like `iter::Peekable`, but it requires `&mut self` +when retrieving the key and value. Therefore, one approach is to (1) get the element from the iterator on initializing +the `MemTableIterator`, store it in `item` (2) when calling `next`, we get the element from inner iter's `next` and move +the inner iter to the next position. + +
+ +## Task 3 - Merge Iterator + +Now that you have a lot of mem-tables and SSTs, you might want to merge them to get the latest occurence of a key. +In `merge_iterator.rs`, we have `MergeIterator`, which is an iterator that merges all iterators *of the same type*. +The iterator at the lower index position of the `new` function has higher priority, that is to say, if we have: + +``` +iter1: 1->a, 2->b, 3->c +iter2: 1->d +iter: MergeIterator::create(vec![iter1, iter2]) +``` + +The final iterator will produce `1->a, 2->b, 3->c`. The data in iter1 will overwrite the data in other iterators. + +You can use a `BinaryHeap` to implement this merge iterator. Note that you should never put any invalid iterator inside +the binary heap. One common pitfall is on error handling. For example, + +```rust +let Some(mut inner_iter) = self.iters.peek_mut() { + inner_iter.next()?; // <- will cause problem +} +``` + +If `next` returns an error (i.e., due to disk failure, network failure, checksum error, etc.), it is no longer valid. +However, when we go out of the if condition and return the error to the caller, `PeekMut`'s drop will try move the +element within the heap, which causes an access to an invalid iterator. Therefore, you will need to do all error +handling by yourself instead of using `?` within the scope of `PeekMut`. + +You will also need to define a wrapper for the storage iterator so that `BinaryHeap` can compare across all iterators. + +## Task 4 - Two Merge Iterator + +The LSM has two structures for storing data: the mem-tables in memory, and the SSTs on disk. After we constructed the +iterator for all SSTs and all mem-tables respectively, we will need a new iterator to merge iterators of two different +types. That is `TwoMergeIterator`. + +You can implement `TwoMergeIterator` in `two_merge_iter.rs`. Similar to `MergeIterator`, if the same key is found in +both of the iterator, the first iterator takes precedence. + +In this tutorial, we explicitly did not use something like `Box` to avoid dynamic dispatch. This is a +common optimization in LSM storage engines. ## Extra Tasks + +* Implement different mem-table and see how it differs from skiplist. i.e., BTree mem-table. You will notice that it is + hard to get an iterator over the B+ tree without holding a lock of the same timespan as the iterator. You might need + to think of smart ways of solving this. +* Async iterator. One interesting thing to explore is to see if it is possible to asynchronize everything in the storage + engine. You might find some lifetime related problems and need to workaround them. +* Foreground iterator. In this tutorial we assumed that all operations are short, so that we can hold reference to + mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) + will not stay in the memory. To solve this, we can provide a `ForegroundIterator` / `LongIterator` to our user. The + iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources. diff --git a/mini-lsm-book/src/SUMMARY.md b/mini-lsm-book/src/SUMMARY.md index 88a8e39..5375285 100644 --- a/mini-lsm-book/src/SUMMARY.md +++ b/mini-lsm-book/src/SUMMARY.md @@ -7,7 +7,7 @@ - [Store key-value pairs in little blocks](./01-block.md) - [And make them into an SST](./02-sst.md) - [Now it's time to merge everything](./03-memtable.md) -- [The engine starts](./04-engine.md) +- [The engine on fire](./04-engine.md) - [Let's do something in the background](./05-compaction.md) - [Be careful when the system crashes](./06-recovery.md) - [A good bloom filter makes life easier](./07-bloom-filter.md)