2022-12-23 15:52:09 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# Block Builder and Block Iterator
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-19 12:00:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								< div  class = "warning" >  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-19 12:15:01 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								This is a legacy version of the Mini-LSM tutorial and we will not maintain it anymore. We are working on a new version of this tutorial and this chapter is now part of [Mini-LSM Week 1 Day 3: Blocks ](./week1-03-block.md ).
							 
						 
					
						
							
								
									
										
										
										
											2024-01-19 12:00:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								< / div >  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-12-24 00:26:11 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								<!--  toc  -->  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								In this part, you will need to modify:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  `src/block/builder.rs`  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  `src/block/iterator.rs`  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  `src/block.rs`  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								You can use `cargo x copy-test day1`  to copy our provided test cases to the starter code directory. After you have
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								finished this part, use `cargo x scheck`  to check the style and run all test cases. If you want to write your own
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								test cases, write a new module `#[cfg(test)] mod user_tests { /* your test cases */ }`  in `block.rs` . Remember to remove
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								`#![allow(...)]`  at the top of the modules you modified so that cargo clippy can actually check the styles. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Task 1 - Block Builder
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Block is the minimum read unit in LSM. It is of 4KB size in general, similar to database pages. In each block, we will
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								store a sequence of sorted key-value pairs.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								You will need to modify `BlockBuilder`  in `src/block/builder.rs`  to build the encoded data and the offset array.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The block contains two parts: data and offsets.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								---------------------------------------------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2024-01-19 12:00:36 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| data  | offsets | meta  |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| ----- | ------- | ----- |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| entry | entry   | entry | entry | offset | offset | offset | offset | num_of_elements |
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								---------------------------------------------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								When user adds a key-value pair to a block (which is an entry), we will need to serialize it into the following format:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-----------------------------------------------------------------------
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								|                           Entry #1                             | ... |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-----------------------------------------------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 11:14:47 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| key_len (2B) | key (keylen) | value_len (2B) | value (varlen) | ... |
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-----------------------------------------------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Key length and value length are both 2 bytes, which means their maximum lengths are 65535. (Internally stored as `u16` )
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								We assume that keys will never be empty, and values can be empty. An empty value means that the corresponding key has
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								been deleted in the view of other parts of the system. For the `BlockBuilder`  and `BlockIterator` ,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								we just treat the empty value as-is.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								At the end of each block, we will store the offsets of each entry and the total number of entries. For example, if
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								the first entry is at 0th position of the block, and the second entry is at 12th position of the block.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								|offset|offset|num_of_elements|
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								|   0  |  12  |       2       |
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-------------------------------
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The footer of the block will be as above. Each of the number is stored as `u16` .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The block has a size limit, which is `target_size` . Unless the first key-value pair exceeds the target block size, you
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								should ensure that the encoded block size is always less than or equal to `target_size` .
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								(In the provided code, the `target_size`  here is essentially the `block_size` )
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The `BlockBuilder`  will produce the data part and unencoded entry offsets when `build`  is called. The information will
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								be stored in the `Block`  struct. As key-value entries are stored in raw format and offsets are stored in a separate
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								vector, this reduces unnecessary memory allocations and processing overhead when decoding data —— what you need to do
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								is to simply copy the raw block data to the `data`  vector and decode the entry offsets every 2 bytes, *instead of* 
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								creating something like `Vec<(Vec<u8>, Vec<u8>)>`  to store all the key-value pairs in one block in memory. This compact
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								memory layout is very efficient.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								For the encoding and decoding part, you'll need to modify `Block`  in `src/block.rs` .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Specifically, you are required to implement `Block::encode`  and `Block::decode` ,
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								which will encode to / decode from the data layout illustrated in the above figures.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Task 2 - Block Iterator
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Given a `Block`  object, we will need to extract the key-value pairs. To do this, we create an iterator over a block and
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								find the information we want.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								`BlockIterator`  can be created with an `Arc<Block>` . If `create_and_seek_to_first`  is called, it will be positioned at 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								the first key in the block. If `create_and_seek_to_key`  is called, the iterator will be positioned at the first key
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								that is `>=`  the provided key. For example, if `1, 3, 5`  is in a block.
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```rust
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								let mut iter = BlockIterator::create_and_seek_to_key(block, b"2");
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								assert_eq!(iter.key(), b"3");
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								The above `seek 2`  will make the iterator to be positioned at the next available key of `2` , which in this case is `3` .
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The iterator should copy `key`  and `value`  from the block and store them inside the iterator, so that users can access
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								the key and the value without any extra copy with `fn key(&self) -> &[u8]` , which directly returns the reference of the
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								locally-stored key and value.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								When `next`  is called, the iterator will move to the next position. If we reach the end of the block, we can set `key` 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								to empty and return `false`  from `is_valid` , so that the caller can switch to another block if possible.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								After implementing this part, you should be able to pass all tests in `block/tests.rs` .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Extra Tasks
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-12-24 15:34:34 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Here is a list of extra tasks you can do to make the block encoding more robust and efficient.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-12-23 23:45:09 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								*Note: Some test cases might not pass after implementing this part. You might need to write your own test cases.*
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  Implement block checksum. Verify checksum when decoding the block. 
						 
					
						
							
								
									
										
										
										
											2023-07-11 12:02:32 +08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								*  Compress / Decompress block. Compress on `build`  and decompress on decoding.