finish week 1 day 3+4 block/sst

Signed-off-by: Alex Chi Z <iskyzh@gmail.com>
This commit is contained in:
Alex Chi Z
2024-01-21 14:21:09 +08:00
parent f88394a686
commit 9eb197114d
9 changed files with 237 additions and 39 deletions

View File

@@ -43,7 +43,7 @@ We are working on a new version of the mini-lsm tutorial that is split into 3 we
| 1.1 | Memtables | ✅ | ✅ | ✅ | | 1.1 | Memtables | ✅ | ✅ | ✅ |
| 1.2 | Merge Iterators | ✅ | ✅ | ✅ | | 1.2 | Merge Iterators | ✅ | ✅ | ✅ |
| 1.3 | Block Format | ✅ | ✅ | ✅ | | 1.3 | Block Format | ✅ | ✅ | ✅ |
| 1.4 | Table Format | ✅ | 🚧 | 🚧 | | 1.4 | Table Format | ✅ | | |
| 1.5 | Storage Engine - Read Path | ✅ | 🚧 | 🚧 | | 1.5 | Storage Engine - Read Path | ✅ | 🚧 | 🚧 |
| 1.6 | Storage Engine - Write Path | ✅ | 🚧 | 🚧 | | 1.6 | Storage Engine - Write Path | ✅ | 🚧 | 🚧 |
| 1.7 | Bloom Filter and Key Compression | | | | | 1.7 | Bloom Filter and Key Compression | | | |

View File

@@ -9,10 +9,73 @@ In this chapter, you will:
## Task 1: SST Builder ## Task 1: SST Builder
In this task, you will need to modify:
```
src/table/builder.rs
src/table.rs
```
SSTs are composed of data blocks and index blocks stored on the disk. Usually, data blocks are lazily loaded -- they will not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this tutorial, we make simple assumptions that all SST index blocks (meta blocks) can fit in memory (actually we do not have a designated index block implementation.) Generally, an SST file is of 256MB size.
The SST builder is similar to block builder -- users will call `add` on the builder. You should maintain a `BlockBuilder` inside SST builder and split blocks when necessary. Also, you will need to maintain block metadata `BlockMeta`, which includes the first/last keys in each block and the offsets of each block. The `build` function will encode the SST, write everything to disk using `FileObject::create`, and return an `SsTable` object.
The encoding of SST is like:
```plaintext
-------------------------------------------------------------------------------------------
| Block Section | Meta Section | Extra |
-------------------------------------------------------------------------------------------
| data block | ... | data block | metadata | meta block offset (u32) |
-------------------------------------------------------------------------------------------
```
You also need to implement `estimated_size` function of `SsTableBuilder`, so that the caller can know when can it start a new SST to write data. The function don't need to be very accurate. Given the assumption that data blocks contain much more data than meta block, we can simply return the size of data blocks for `estimated_size`.
Besides SST builder, you will also need to complete the encoding/decoding of block metadata, so that `SsTableBuilder::build` can produce a valid SST file.
## Task 2: SST Iterator ## Task 2: SST Iterator
In this task, you will need to modify:
```
src/table/iterator.rs
src/table.rs
```
Like `BlockIterator`, you will need to implement an iterator over an SST. Note that you should load data on demand. For example, if your iterator is at block 1, it should not hold any other block content in memory until it reaches the next block.
`SsTableIterator` should implement the `StorageIterator` trait, so that it can be composed with other iterators in the future.
One thing to note is `seek_to_key` function. Basically, you will need to do binary search on block metadata to find which block might possibly contain the key. It is possible that the key does not exist in the LSM tree so that the block iterator will be invalid immediately after a seek. For example,
```plaintext
--------------------------------------
| block 1 | block 2 | block meta |
--------------------------------------
| a, b, c | e, f, g | 1: a/c, 2: e/g |
--------------------------------------
```
We recommend only using the first key of each block to do the binary search so as to reduce the complexity of your implementation. If we do `seek(b)` in this SST, it is quite simple -- using binary search, we can know block 1 contains keys `a <= keys < e`. Therefore, we load block 1 and seek the block iterator to the corresponding position.
But if we do `seek(d)`, we will position to block 1, if we only use first key as the binary search criteria, but seeking `d` in block 1 will reach the end of the block. Therefore, we should check if the iterator is invalid after the seek, and switch to the next block if necessary. Or you can leverage the last key metadata to directly position to a correct block, it is up to you.
## Task 3: Block Cache ## Task 3: Block Cache
In this task, you will need to modify:
```
src/table/iterator.rs
src/table.rs
```
You can implement a new `read_block_cached` function on `SsTable` .
We use `moka-rs` as our block cache implementation. Blocks are cached by `(sst_id, block_id)` as the cache key. You may use `try_get_with` to get the block from cache if it hits the cache / populate the cache if it misses the cache. If there are multiple requests reading the same block and cache misses, `try_get_with` will only issue a single read request to the disk and broadcast the result to all requests.
At this point, you may change your table iterator to use `read_block_cached` instead of `read_block` to leverage the block cache.
## Test Your Understanding ## Test Your Understanding
* What is the time complexity of seeking a key in the SST? * What is the time complexity of seeking a key in the SST?
@@ -28,8 +91,10 @@ We do not provide reference answers to the questions, and feel free to discuss a
## Bonus Tasks ## Bonus Tasks
* **Explore different SST encoding and layout.** For example, in the [Lethe](https://disc-projects.bu.edu/lethe/) paper, the author adds secondary key support to SST. Or you can use B+ Tree as the SST format instead of sorted blocks. * **Explore different SST encoding and layout.** For example, in the [Lethe](https://disc-projects.bu.edu/lethe/) paper, the author adds secondary key support to SST.
* **Index Blocks.** * Or you can use B+ Tree as the SST format instead of sorted blocks.
* **Index Cache.** * **Index Blocks.** Split block indexes and block metadata into index blocks, and load them on-demand.
* **Index Cache.** Use a separate cache for indexes apart from the data block cache.
* **I/O Optimizations.** Align blocks to 4KB boundary and use direct I/O to bypass the system page cache.
{{#include copyright.md}} {{#include copyright.md}}

View File

@@ -79,18 +79,14 @@ impl FileObject {
} }
} }
/// ------------------------------------------------------------------------------------------------------- /// An SSTable.
/// | Data Block | Meta Block | Extra |
/// -------------------------------------------------------------------------------------------------------
/// | Data Block #1 | ... | Data Block #N | Meta Block #1 | ... | Meta Block #N | Meta Block Offset (u32) |
/// -------------------------------------------------------------------------------------------------------
pub struct SsTable { pub struct SsTable {
/// The actual storage unit of SsTable, the format is as above. /// The actual storage unit of SsTable, the format is as above.
file: FileObject, pub(crate) file: FileObject,
/// The meta blocks that hold info for data blocks. /// The meta blocks that hold info for data blocks.
block_metas: Vec<BlockMeta>, pub(crate) block_meta: Vec<BlockMeta>,
/// The offset that indicates the start point of meta blocks in `file`. /// The offset that indicates the start point of meta blocks in `file`.
block_meta_offset: usize, pub(crate) block_meta_offset: usize,
id: usize, id: usize,
block_cache: Option<Arc<BlockCache>>, block_cache: Option<Arc<BlockCache>>,
first_key: Bytes, first_key: Bytes,
@@ -112,7 +108,7 @@ impl SsTable {
pub fn create_meta_only(id: usize, file_size: u64, first_key: Bytes, last_key: Bytes) -> Self { pub fn create_meta_only(id: usize, file_size: u64, first_key: Bytes, last_key: Bytes) -> Self {
Self { Self {
file: FileObject(None, file_size), file: FileObject(None, file_size),
block_metas: vec![], block_meta: vec![],
block_meta_offset: 0, block_meta_offset: 0,
id, id,
block_cache: None, block_cache: None,
@@ -140,7 +136,7 @@ impl SsTable {
/// Get number of data blocks. /// Get number of data blocks.
pub fn num_of_blocks(&self) -> usize { pub fn num_of_blocks(&self) -> usize {
self.block_metas.len() self.block_meta.len()
} }
pub fn first_key(&self) -> &Bytes { pub fn first_key(&self) -> &Bytes {

View File

@@ -7,12 +7,16 @@ use std::sync::Arc;
use anyhow::Result; use anyhow::Result;
use super::{BlockMeta, SsTable}; use super::{BlockMeta, SsTable};
use crate::lsm_storage::BlockCache; use crate::{block::BlockBuilder, lsm_storage::BlockCache};
/// Builds an SSTable from key-value pairs. /// Builds an SSTable from key-value pairs.
pub struct SsTableBuilder { pub struct SsTableBuilder {
pub(super) meta: Vec<BlockMeta>, builder: BlockBuilder,
// Add other fields you need. first_key: Vec<u8>,
last_key: Vec<u8>,
data: Vec<u8>,
pub(crate) meta: Vec<BlockMeta>,
block_size: usize,
} }
impl SsTableBuilder { impl SsTableBuilder {
@@ -22,21 +26,22 @@ impl SsTableBuilder {
} }
/// Adds a key-value pair to SSTable. /// Adds a key-value pair to SSTable.
///
/// Note: You should split a new block when the current block is full.(`std::mem::replace` may /// Note: You should split a new block when the current block is full.(`std::mem::replace` may
/// be of help here) /// be helpful here)
pub fn add(&mut self, key: &[u8], value: &[u8]) { pub fn add(&mut self, key: &[u8], value: &[u8]) {
unimplemented!() unimplemented!()
} }
/// Get the estimated size of the SSTable. /// Get the estimated size of the SSTable.
///
/// Since the data blocks contain much more data than meta blocks, just return the size of data /// Since the data blocks contain much more data than meta blocks, just return the size of data
/// blocks here. /// blocks here.
pub fn estimated_size(&self) -> usize { pub fn estimated_size(&self) -> usize {
unimplemented!() unimplemented!()
} }
/// Builds the SSTable and writes it to the given path. No need to actually write to disk until /// Builds the SSTable and writes it to the given path. Use the `FileObject` structure to manipulate the disk objects.
/// chapter 4 block cache.
pub fn build( pub fn build(
self, self,
id: usize, id: usize,

View File

@@ -6,10 +6,14 @@ use std::sync::Arc;
use anyhow::Result; use anyhow::Result;
use super::SsTable; use super::SsTable;
use crate::iterators::StorageIterator; use crate::{block::BlockIterator, iterators::StorageIterator};
/// An iterator over the contents of an SSTable. /// An iterator over the contents of an SSTable.
pub struct SsTableIterator {} pub struct SsTableIterator {
table: Arc<SsTable>,
blk_iter: BlockIterator,
blk_idx: usize,
}
impl SsTableIterator { impl SsTableIterator {
/// Create a new iterator and seek to the first key-value pair in the first data block. /// Create a new iterator and seek to the first key-value pair in the first data block.

View File

@@ -73,8 +73,6 @@ impl BlockMeta {
} }
/// A file object. /// A file object.
///
/// Before day 4, it should look like:
pub struct FileObject(Option<File>, u64); pub struct FileObject(Option<File>, u64);
impl FileObject { impl FileObject {
@@ -111,7 +109,7 @@ impl FileObject {
pub struct SsTable { pub struct SsTable {
file: FileObject, file: FileObject,
block_metas: Vec<BlockMeta>, block_meta: Vec<BlockMeta>,
block_meta_offset: usize, block_meta_offset: usize,
id: usize, id: usize,
block_cache: Option<Arc<BlockCache>>, block_cache: Option<Arc<BlockCache>>,
@@ -131,12 +129,12 @@ impl SsTable {
let raw_meta_offset = file.read(len - 4, 4)?; let raw_meta_offset = file.read(len - 4, 4)?;
let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64; let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64;
let raw_meta = file.read(block_meta_offset, len - 4 - block_meta_offset)?; let raw_meta = file.read(block_meta_offset, len - 4 - block_meta_offset)?;
let block_metas = BlockMeta::decode_block_meta(&raw_meta[..]); let block_meta = BlockMeta::decode_block_meta(&raw_meta[..]);
Ok(Self { Ok(Self {
file, file,
first_key: block_metas.first().unwrap().first_key.clone(), first_key: block_meta.first().unwrap().first_key.clone(),
last_key: block_metas.last().unwrap().last_key.clone(), last_key: block_meta.last().unwrap().last_key.clone(),
block_metas, block_meta,
block_meta_offset: block_meta_offset as usize, block_meta_offset: block_meta_offset as usize,
id, id,
block_cache, block_cache,
@@ -147,7 +145,7 @@ impl SsTable {
pub fn create_meta_only(id: usize, file_size: u64, first_key: Bytes, last_key: Bytes) -> Self { pub fn create_meta_only(id: usize, file_size: u64, first_key: Bytes, last_key: Bytes) -> Self {
Self { Self {
file: FileObject(None, file_size), file: FileObject(None, file_size),
block_metas: vec![], block_meta: vec![],
block_meta_offset: 0, block_meta_offset: 0,
id, id,
block_cache: None, block_cache: None,
@@ -158,9 +156,9 @@ impl SsTable {
/// Read a block from the disk. /// Read a block from the disk.
pub fn read_block(&self, block_idx: usize) -> Result<Arc<Block>> { pub fn read_block(&self, block_idx: usize) -> Result<Arc<Block>> {
let offset = self.block_metas[block_idx].offset; let offset = self.block_meta[block_idx].offset;
let offset_end = self let offset_end = self
.block_metas .block_meta
.get(block_idx + 1) .get(block_idx + 1)
.map_or(self.block_meta_offset, |x| x.offset); .map_or(self.block_meta_offset, |x| x.offset);
let block_data = self let block_data = self
@@ -183,14 +181,14 @@ impl SsTable {
/// Find the block that may contain `key`. /// Find the block that may contain `key`.
pub fn find_block_idx(&self, key: &[u8]) -> usize { pub fn find_block_idx(&self, key: &[u8]) -> usize {
self.block_metas self.block_meta
.partition_point(|meta| meta.first_key <= key) .partition_point(|meta| meta.first_key <= key)
.saturating_sub(1) .saturating_sub(1)
} }
/// Get number of data blocks. /// Get number of data blocks.
pub fn num_of_blocks(&self) -> usize { pub fn num_of_blocks(&self) -> usize {
self.block_metas.len() self.block_meta.len()
} }
pub fn first_key(&self) -> &Bytes { pub fn first_key(&self) -> &Bytes {

View File

@@ -71,8 +71,7 @@ impl SsTableBuilder {
self.data.extend(encoded_block); self.data.extend(encoded_block);
} }
/// Builds the SSTable and writes it to the given path. No need to actually write to disk until /// Builds the SSTable and writes it to the given path.
/// chapter 4 block cache.
pub fn build( pub fn build(
mut self, mut self,
id: usize, id: usize,
@@ -90,7 +89,7 @@ impl SsTableBuilder {
file, file,
first_key: self.meta.first().unwrap().first_key.clone(), first_key: self.meta.first().unwrap().first_key.clone(),
last_key: self.meta.last().unwrap().last_key.clone(), last_key: self.meta.last().unwrap().last_key.clone(),
block_metas: self.meta, block_meta: self.meta,
block_meta_offset: meta_offset, block_meta_offset: meta_offset,
block_cache, block_cache,
}) })

View File

@@ -61,9 +61,9 @@ fn test_sst_build_all() {
#[test] #[test]
fn test_sst_decode() { fn test_sst_decode() {
let (_dir, sst) = generate_sst(); let (_dir, sst) = generate_sst();
let meta = sst.block_metas.clone(); let meta = sst.block_meta.clone();
let new_sst = SsTable::open_for_test(sst.file).unwrap(); let new_sst = SsTable::open_for_test(sst.file).unwrap();
assert_eq!(new_sst.block_metas, meta); assert_eq!(new_sst.block_meta, meta);
} }
fn as_bytes(x: &[u8]) -> Bytes { fn as_bytes(x: &[u8]) -> Bytes {

131
mini-lsm/src/tests/day4.rs Normal file
View File

@@ -0,0 +1,131 @@
use std::sync::Arc;
use bytes::Bytes;
use tempfile::{tempdir, TempDir};
use crate::iterators::StorageIterator;
use crate::table::{SsTable, SsTableBuilder, SsTableIterator};
#[test]
fn test_sst_build_single_key() {
let mut builder = SsTableBuilder::new(16);
builder.add(b"233", b"233333");
let dir = tempdir().unwrap();
builder.build_for_test(dir.path().join("1.sst")).unwrap();
}
#[test]
fn test_sst_build_two_blocks() {
let mut builder = SsTableBuilder::new(16);
builder.add(b"11", b"11");
builder.add(b"22", b"22");
builder.add(b"33", b"11");
builder.add(b"44", b"22");
builder.add(b"55", b"11");
builder.add(b"66", b"22");
assert!(builder.meta.len() >= 2);
let dir = tempdir().unwrap();
builder.build_for_test(dir.path().join("1.sst")).unwrap();
}
fn key_of(idx: usize) -> Vec<u8> {
format!("key_{:03}", idx * 5).into_bytes()
}
fn value_of(idx: usize) -> Vec<u8> {
format!("value_{:010}", idx).into_bytes()
}
fn num_of_keys() -> usize {
100
}
fn generate_sst() -> (TempDir, SsTable) {
let mut builder = SsTableBuilder::new(128);
for idx in 0..num_of_keys() {
let key = key_of(idx);
let value = value_of(idx);
builder.add(&key[..], &value[..]);
}
let dir = tempdir().unwrap();
let path = dir.path().join("1.sst");
(dir, builder.build_for_test(path).unwrap())
}
#[test]
fn test_sst_build_all() {
generate_sst();
}
#[test]
fn test_sst_decode() {
let (_dir, sst) = generate_sst();
let meta = sst.block_meta.clone();
let new_sst = SsTable::open_for_test(sst.file).unwrap();
assert_eq!(new_sst.block_meta, meta);
assert_eq!(new_sst.first_key(), &key_of(0));
assert_eq!(new_sst.last_key(), &key_of(num_of_keys() - 1));
}
fn as_bytes(x: &[u8]) -> Bytes {
Bytes::copy_from_slice(x)
}
#[test]
fn test_sst_iterator() {
let (_dir, sst) = generate_sst();
let sst = Arc::new(sst);
let mut iter = SsTableIterator::create_and_seek_to_first(sst).unwrap();
for _ in 0..5 {
for i in 0..num_of_keys() {
let key = iter.key();
let value = iter.value();
assert_eq!(
key,
key_of(i),
"expected key: {:?}, actual key: {:?}",
as_bytes(&key_of(i)),
as_bytes(key)
);
assert_eq!(
value,
value_of(i),
"expected value: {:?}, actual value: {:?}",
as_bytes(&value_of(i)),
as_bytes(value)
);
iter.next().unwrap();
}
iter.seek_to_first().unwrap();
}
}
#[test]
fn test_sst_seek_key() {
let (_dir, sst) = generate_sst();
let sst = Arc::new(sst);
let mut iter = SsTableIterator::create_and_seek_to_key(sst, &key_of(0)).unwrap();
for offset in 1..=5 {
for i in 0..num_of_keys() {
let key = iter.key();
let value = iter.value();
assert_eq!(
key,
key_of(i),
"expected key: {:?}, actual key: {:?}",
as_bytes(&key_of(i)),
as_bytes(key)
);
assert_eq!(
value,
value_of(i),
"expected value: {:?}, actual value: {:?}",
as_bytes(&value_of(i)),
as_bytes(value)
);
iter.seek_to_key(&format!("key_{:03}", i * 5 + offset).into_bytes())
.unwrap();
}
iter.seek_to_key(b"k").unwrap();
}
}