diff --git a/mini-lsm-book/src/01-block.md b/mini-lsm-book/src/01-block.md index bcb7308..9b58e65 100644 --- a/mini-lsm-book/src/01-block.md +++ b/mini-lsm-book/src/01-block.md @@ -15,66 +15,79 @@ test cases, write a new module `#[cfg(test)] mod user_tests { /* your test cases ## Task 1 - Block Builder -Block is the minimum read unit in LSM. It is of 4KB size in general, similar database pages. In each block, we will -store a sequence of sorted key value pairs. +Block is the minimum read unit in LSM. It is of 4KB size in general, similar to database pages. In each block, we will +store a sequence of sorted key-value pairs. -You will need to modify `BlockBuilder` to build the encoded data and the offset array. The block contains two parts: -data and offsets. +You will need to modify `BlockBuilder` in `src/block/builder.rs` to build the encoded data and the offset array. +The block contains two parts: data and offsets. ``` -| data | offsets | +--------------------------------------------------------------------- +| data | offsets | meta | +|-----------------------|---------------------------|---------------| |entry|entry|entry|entry|offset|offset|offset|offset|num_of_elements| +--------------------------------------------------------------------- ``` When user adds a key-value pair to a block (which is an entry), we will need to serialize it into the following format: ``` -| entry1 | +----------------------------------------------------------------------- +| Entry #1 | ... | +----------------------------------------------------------------------- | key_len (2B) | key (keylen) | value_len (2B) | value (varlen) | ... | +----------------------------------------------------------------------- ``` -Key length and value length are 2B, which means their maximum length is 65536. +Key length and value length are both 2 bytes, which means their maximum lengths are 65535. (Internally stored as `u16`) We assume that keys will never be empty, and values can be empty. An empty value means that the corresponding key has -been deleted in the view of other parts of the system. For the block builder and iterator, we just treat empty value -as-is. +been deleted in the view of other parts of the system. For the `BlockBuilder` and `BlockIterator`, +we just treat the empty value as-is. -At the end of the block, we will store the offsets of each entry and the total number of entries. For example, if -the first entry is at 0th position of the block, and the second is at 12th position, +At the end of each block, we will store the offsets of each entry and the total number of entries. For example, if +the first entry is at 0th position of the block, and the second entry is at 12th position of the block. ``` +------------------------------- |offset|offset|num_of_elements| +------------------------------- | 0 | 12 | 2 | +------------------------------- ``` The footer of the block will be as above. Each of the number is stored as `u16`. The block has a size limit, which is `target_size`. Unless the first key-value pair exceeds the target block size, you should ensure that the encoded block size is always less than or equal to `target_size`. +(In the provided code, the `target_size` here is essentially the `block_size`) The `BlockBuilder` will produce the data part and unencoded entry offsets when `build` is called. The information will -be stored in the `Block` struct. As key-value entries are stored in the raw format and offsets are stored in a separate -vector, this reduces unnecessary memory allocations and processing overhead when decoding data -- what you need to do +be stored in the `Block` struct. As key-value entries are stored in raw format and offsets are stored in a separate +vector, this reduces unnecessary memory allocations and processing overhead when decoding data —— what you need to do is to simply copy the raw block data to the `data` vector and decode the entry offsets every 2 bytes, *instead of* -creating something like `Vec<(Vec, Vec)>` to store all the key value pairs in one block in memory. This compact -memory layout is very efficient. `Block::encode` and `Block::decode` will encode to / decode from the data layout -illustrated in the above figures. +creating something like `Vec<(Vec, Vec)>` to store all the key-value pairs in one block in memory. This compact +memory layout is very efficient. + +For the encoding and decoding part, you'll need to modify `Block` in `src/block.rs`. +Specifically, you are required to implement `Block::encode` and `Block::decode`, +which will encode to / decode from the data layout illustrated in the above figures. ## Task 2 - Block Iterator -Given a block object, we will need to extract the key-value pairs. To do this, we create an iterator over a block and +Given a `Block` object, we will need to extract the key-value pairs. To do this, we create an iterator over a block and find the information we want. `BlockIterator` can be created with an `Arc`. If `create_and_seek_to_first` is called, it will be positioned at -the first key in the block. If `create_and_seek_to_key` is called, the iterator will be positioned at the first key which -is `>=` the provided key. For example, if `1, 3, 5` is in a block, +the first key in the block. If `create_and_seek_to_key` is called, the iterator will be positioned at the first key +that is `>=` the provided key. For example, if `1, 3, 5` is in a block. ```rust let mut iter = BlockIterator::create_and_seek_to_key(block, b"2"); assert_eq!(iter.key(), b"3"); ``` -`seek 2` will make the iterator to be positioned at the next available key of `2`, which is `3`. +The above `seek 2` will make the iterator to be positioned at the next available key of `2`, which in this case is `3`. The iterator should copy `key` and `value` from the block and store them inside the iterator, so that users can access the key and the value without any extra copy with `fn key(&self) -> &[u8]`, which directly returns the reference of the @@ -92,4 +105,4 @@ Here is a list of extra tasks you can do to make the block encoding more robust *Note: Some test cases might not pass after implementing this part. You might need to write your own test cases.* * Implement block checksum. Verify checksum when decoding the block. -* Compress / uncompress block. Compress on `build` and uncompress on decoding. +* Compress / Decompress block. Compress on `build` and decompress on decoding.