i love questions

Signed-off-by: Alex Chi <iskyzh@gmail.com>
This commit is contained in:
Alex Chi
2024-01-21 00:45:10 +08:00
parent 7b025687ff
commit df35a954c9
13 changed files with 80 additions and 2 deletions

View File

@@ -30,6 +30,8 @@ error handling, order requirement
* Is it possible to implement a Rust-style iterator (i.e., `next(&self) -> (Key, Value)`) for LSM iterators? What are the pros/cons? * Is it possible to implement a Rust-style iterator (i.e., `next(&self) -> (Key, Value)`) for LSM iterators? What are the pros/cons?
* The scan interface is like `fn scan(&self, lower: Bound<&[u8]>, upper: Bound<&[u8]>)`. How to make this API compatible with Rust-style range (i.e., `key_a..key_b`)? If you implement this, try to pass a full range `..` to the interface and see what will happen. * The scan interface is like `fn scan(&self, lower: Bound<&[u8]>, upper: Bound<&[u8]>)`. How to make this API compatible with Rust-style range (i.e., `key_a..key_b`)? If you implement this, try to pass a full range `..` to the interface and see what will happen.
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
## Bonus Task ## Bonus Task
* **Foreground Iterator.** In this tutorial we assumed that all operations are short, so that we can hold reference to mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) will stay in the memory even if it has been flushed to disk. To solve this, we can provide a `ForegroundIterator` / `LongIterator` to our user. The iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources. * **Foreground Iterator.** In this tutorial we assumed that all operations are short, so that we can hold reference to mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) will stay in the memory even if it has been flushed to disk. To solve this, we can provide a `ForegroundIterator` / `LongIterator` to our user. The iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources.

View File

@@ -16,6 +16,14 @@ In this chapter, you will:
* So `Block` is simply a vector of raw data and a vector of offsets. Can we change them to `Byte` and `Arc<[u16]>`, and change all the iterator interfaces to return `Byte` instead of `&[u8]`? What are the pros/cons? * So `Block` is simply a vector of raw data and a vector of offsets. Can we change them to `Byte` and `Arc<[u16]>`, and change all the iterator interfaces to return `Byte` instead of `&[u8]`? What are the pros/cons?
* What is the endian of the numbers written into the blocks in your implementation? * What is the endian of the numbers written into the blocks in your implementation?
* Is your implementation prune to a maliciously-built block? Will there be invalid memory access, or OOMs, if a user deliberately construct an invalid block? * Is your implementation prune to a maliciously-built block? Will there be invalid memory access, or OOMs, if a user deliberately construct an invalid block?
* What happens if the user adds a key larger than the target block size?
* Consider the case that the LSM engine is built on object store services (S3). How would you optimize/change the block format and parameters to make it suitable for such services?
* Do you love bubble tea? Why or why not? * Do you love bubble tea? Why or why not?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
## Bonus Tasks
* **Backward Iterators.**
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -17,5 +17,16 @@ In this chapter, you will:
* An SST is usually large (i.e., 256MB). In this case, the cost of copying/expanding the `Vec` would be significant. Does your implementation allocate enough space for your SST builder in advance? How did you implement it? * An SST is usually large (i.e., 256MB). In this case, the cost of copying/expanding the `Vec` would be significant. Does your implementation allocate enough space for your SST builder in advance? How did you implement it?
* Looking at the `moka` block cache, why does it return `Arc<Error>` instead of the original `Error`? * Looking at the `moka` block cache, why does it return `Arc<Error>` instead of the original `Error`?
* Does the usage of a block cache guarantee that there will be at most a fixed number of blocks in memory? For example, if you have a `moka` block cache of 4GB and block size of 4KB, will there be more than 4GB/4KB number of blocks in memory at the same time?
* Is it possible to store columnar data (i.e., a table of 100 integer columns) in an LSM engine? Is the current SST format still a good choice?
* Consider the case that the LSM engine is built on object store services (S3). How would you optimize/change the SST format/parameters and the block cache to make it suitable for such services?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
## Bonus Tasks
* **Explore different SST encoding and layout.** For example, in the [Lethe](https://disc-projects.bu.edu/lethe/) paper, the author adds secondary key support to SST. Or you can use B+ Tree as the SST format instead of sorted blocks.
* **Index Blocks.**
* **Index Cache.**
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -14,4 +14,10 @@ In this chapter, you will:
## Task 3: Read Path - Scan ## Task 3: Read Path - Scan
## Test Your Understanding
* Consider the case that a user has an iterator that iterates the whole storage engine, and the storage engine is 1TB large, so that it takes ~1 hour to scan all the data. What would be the problems if the user does so? (This is a good question and we will ask it several times at different points of the tutorial...)
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -11,6 +11,6 @@ In this chapter, you will:
## Task 2: Update the LSM State ## Task 2: Update the LSM State
## Task 3: Filtering the SSTs ## Task 3: Filter the SSTs
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -9,4 +9,12 @@ In this chapter, you will:
* Implement bloom filter on SSTs and integrate into the LSM read path `get`. * Implement bloom filter on SSTs and integrate into the LSM read path `get`.
* Implement key compression in SST block format. * Implement key compression in SST block format.
## Test Your Understanding
* How does the bloom filter help with the SST filtering process? What kind of information can it tell you about a key? (may not exist/may exist/must exist/must not exist)
* Consider the case that we need a backward iterator. How does key compression affect backward iterators? Any way to improve it?
* Can you use bloom filters on scan?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -12,6 +12,6 @@ In the first week of the tutorial, you will build necessary storage formats for
* Day 6: Write path. In day 5, the test harness generates the structures, and in day 6, you will control the SST flushes by yourself. You will implement flush to level-0 SST and the storage engine is complete. * Day 6: Write path. In day 5, the test harness generates the structures, and in day 6, you will control the SST flushes by yourself. You will implement flush to level-0 SST and the storage engine is complete.
* Day 7: SST optimizations. We will implement several SST format optimizations and improve the performance of the system. * Day 7: SST optimizations. We will implement several SST format optimizations and improve the performance of the system.
At the end of the week, your storage engine should be able to handle all get/scan/put requests. The only missing parts are persisting the LSM state to disk and a more efficient way of organizing the SSTs on the disk. You will have a working *Mini-LSM* storage engine. At the end of the week, your storage engine should be able to handle all get/scan/put requests. The only missing parts are persisting the LSM state to disk and a more efficient way of organizing the SSTs on the disk. You will have a working **Mini-LSM** storage engine.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -8,4 +8,11 @@ In this chapter, you will:
* Implement the logic to update the LSM states and manage SST files on the filesystem. * Implement the logic to update the LSM states and manage SST files on the filesystem.
* Update LSM read path to incorporate the LSM levels. * Update LSM read path to incorporate the LSM levels.
## Test Your Understanding
* Is it correct that a key will take some storage space even if a user requests to delete it?
* Given that compaction takes a lot of write bandwidth and read bandwidth and may interfere with foreground operations, it is a good idea to postpone compaction when there are large write flow. It is even beneficial to stop/pause existing compaction tasks in this situation. What do you think of this idea? (Read the Slik paper!)
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -9,6 +9,13 @@ In this chapter, you will:
## Test Your Understanding ## Test Your Understanding
* Is it correct that a key will only be purged from the LSM tree if the user requests to delete it and it has been compacted in the bottom-most level?
* Is it a good strategy to periodically do a full compaction on the LSM tree? Why or why not?
* Actively choosing some old files/levels to compact even if they do not violate the level amplifier would be a good choice, is it true? (Look at the Lethe paper!)
* If the storage device can achieve a sustainable 1GB/s write throughput and the write amplification of the LSM tree is 10x, how much throughput can the user get from the LSM key-value interfaces?
* What is your favorite boba shop in your city? (If you answered yes in week 1 day 3...) * What is your favorite boba shop in your city? (If you answered yes in week 1 day 3...)
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -7,4 +7,14 @@ In this chapter, you will:
* Implement a tiered compaction strategy and simulate it on the compaction simulator. * Implement a tiered compaction strategy and simulate it on the compaction simulator.
* Incorporate tiered compaction strategy into the system. * Incorporate tiered compaction strategy into the system.
The tiered compaction we talk about in this chapter is the same as RocksDB's universal compaction. We will use these two terminologies interchangeably.
## Test Your Understanding
* What are the pros/cons of universal compaction compared with simple leveled/tiered compaction?
* How much storage space is it required (compared with user data size) to run universal compaction without using up the storage device space?
* The log-on-log problem.
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -7,4 +7,10 @@ In this chapter, you will:
* Implement a leveled compaction strategy and simulate it on the compaction simulator. * Implement a leveled compaction strategy and simulate it on the compaction simulator.
* Incorporate leveled compaction strategy into the system. * Incorporate leveled compaction strategy into the system.
## Test Your Understanding
* Finding a good key split point for compaction may potentially reduce the write amplification, or it does not matter at all?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -7,4 +7,10 @@ In this chapter, you will:
* Implement encoding and decoding of the write-ahead log file. * Implement encoding and decoding of the write-ahead log file.
* Recover memtables from the WALs when the system restarts. * Recover memtables from the WALs when the system restarts.
## Test Your Understanding
* When can you tell the user that their modifications (put/delete) have been persisted?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -9,4 +9,11 @@ In this chapter, you will:
* Implement the batch write interface. * Implement the batch write interface.
* Add checksums to the blocks, SST metadata, manifest, and WALs. * Add checksums to the blocks, SST metadata, manifest, and WALs.
## Test Your Understanding
* Consider the case that an LSM storage engine only provides `write_batch` as the write interface (instead of single put + delete). Is it possible to implement it as follows: there is a single write thread with an mpsc channel receiver to get the changes, and all threads send write batches to the write thread. The write thread is the single point to write to the database. What are the pros/cons of this implementation? (Congrats if you do so you get BadgerDB!)
* Is it okay to put all block checksums altogether at the end of the SST file instead of store it along with the block? Why?
We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.
{{#include copyright.md}} {{#include copyright.md}}