



# Asymmetry/Concurrency-Aware Bufferpool Manager for Modern Storage Devices

<u>Tarikul Islam Papon</u> <u>papon@bu.edu</u> Manos Athanassoulis <u>mathan@bu.edu</u>



#### Data Systems & Hardware

Bb Iab OSIC



**Memory Hierarchy** 



#### **Evolution of Storage Devices**

Bb da Jap OSiO





#### Hard Disk Drives



ि Seid Disc

> mechanical device slow random access one block at a time write latency ≈ read latency



#### Hard Disk Drives



lab Sido Sid



One I/O at a time





#### Bb da DSiC

# "Tape is Dead. Disk is Tape. Flash is Disk."

- Jim Gray



#### "Tape is Dead. Disk is Tape. Flash is Disk." - Jim Gray

lab **Sid** 

| Device   | Size   | Seq B/W  | Time to read |
|----------|--------|----------|--------------|
| HDD 1980 | 100 MB | 1.2 MB/s | ~ 1 min      |
| HDD 2022 | 4 TB   | 125 MB/s | ~ 9 hours    |



#### "Tape is Dead. Disk is Tape. Flash is Disk." - Jim Gray

Bb Iab OSIC

| Device   | Size   | Seq B/W  | Time to read |
|----------|--------|----------|--------------|
| HDD 1980 | 100 MB | 1.2 MB/s | ~ 1 min      |
| HDD 2022 | 4 TB   | 125 MB/s | ~ 9 hours    |

HDDs are moving deeper in the memory hierarchy



#### Solid State Drives



Bb da DSiC

electronic device

fast random access

concurrent I/Os

write latency > read latency





lab OlaD



#### lab **Sid**

# Concurrency



#### Internals of an SSD

lab **Sid** 



#### Internals of an SSD



Parallelism at different levels (channel, chip, die, plane block, page)



#### lab **Sid**

## Read/Write Asymmetry







**Out-of-place** updates cause invalidation

"Erase before write" approach







Block 0

lab **Sid** 

Block 1





Block 0

lab **Sid** 

Block 1

Writing in a free page isn't costly!











lab **Sid** 

Block 1





lab **Sid** 







Block 1







Block 0

Block 1





Block 0

lab OlaD

Block 1

#### Not all updates are costly!





#### What if there is no space?



. . .

Block 0

Block N





#### What if there is no space?



**Garbage Collection!** 





Block 0

Block N

. . .



Bb Iab OSIC





Bg dg DSiC



Higher average update cost (due to GC)  $\rightarrow$  *Read/Write asymmetry* 



## Read/Write Asymmetry

**Out-of-place** updates cause invalidation

"Erase before write" approach

Garbage Collection

Bb da DSiC

Larger erase granularity

All these results in higher amortized write cost





Plane



### Read/Write Asymmetry - Example

lab **Sid** 

| Device        | Advertised Rand<br>Read IOPS | Advertised Rand<br>Write IOPS | Advertised<br>Asymmetry |
|---------------|------------------------------|-------------------------------|-------------------------|
| PCIe D5-P4320 | 427k                         | 36k                           | 11.9                    |
| PCIe DC-P4500 | 626k                         | 51k                           | 12.3                    |
| PCIe P4510    | 465k                         | 145k                          | 3.2                     |
| SATA D3-S4610 | 92k                          | 28k                           | 3.3                     |
| Optane P4800X | 550k                         | 500k                          | 1.1                     |































#### Quantifying Asymmetry & Concurrency

ि प्र DisC



#### Empirical Asymmetry and Concurrency

Bb Iab OSIC

| Device      | α   | k <sub>r</sub> | k <sub>w</sub> |
|-------------|-----|----------------|----------------|
| Optane SSD  | 1.1 | 6              | 5              |
| PCIe SSD    | 2.8 | 80             | 8              |
| SATA SSD    | 1.5 | 25             | 9              |
| Virtual SSD | 2.0 | 11             | 19             |

- "A Parametric I/O Model for Modern Storage Devices", DaMoN 2021 <u>disc.bu.edu/papers/damon21-papon</u>



## Guidelines for Algorithm Design



**Know Thy Device** 

<u>망 윤</u> DiSC

Exploit concurrency (with care)

Treat read and write differently.

asymmetry controls performance

- "A Parametric I/O Model for Modern Storage Devices", DaMoN 2021 <u>disc.bu.edu/papers/damon21-papon</u>

37



#### lab 33 OSiO

# Bufferpool Manager &

# The Challenge



#### Bufferpool is Tightly Connected to Storage

Bb Iab OSIC

























ि Seid Disc







Bb Iab OSIC





## **Traditional Bufferpool Manager**

BS de DiSC





#### Popular Page Replacement Algorithms

Bg dg DSiC

> (Most Popular) LRU LFU, FIFO (Simple) Clock Sweep (Commercial) CFLRU LRU-WSR



#### CFLRU

B8 da DSC

















After Eviction:

|           | p7 | p1 p6 |   | p5 | p4 p3 |   |
|-----------|----|-------|---|----|-------|---|
|           | С  | D     | D | С  | D     | D |
| Cold flag |    | 1     | 1 |    | 0     | 0 |









#### After Eviction:

|           | p7 | p6 | p5 | p4 | р3 | p2 |
|-----------|----|----|----|----|----|----|
|           | С  | D  | С  | D  | D  | С  |
| Cold flag |    | 1  |    | 0  | 0  |    |



### The Challenges

• With write asymmetry, exchanging

DisC

one write for one read is **NOT ideal**.

Without exploiting concurrency,

device remains vastly **underutilized**.







52





# Asymmetry/Concurrency-Aware (ACE) Bufferpool Manager

ab DSC



## ACE Bufferpool Manager



#### Use device's properties



Bb Iab OSIC





## ACE Bufferpool Manager

Bb da DSiO





#### ieee icde 2023

## ACE Bufferpool Manager

lab OSIC





## ACE Bufferpool Manager

Bb Iab OSIC



An Example  $(k_w = 3)$ 

망요 DiSC



Let's assume:  $k_w = 3$ , LRU is the baseline replacement policy & red indicates dirty page

Write request of page 8 comes



## An Example ( $k_w = 3$ )

**Candidate** for eviction

write page 8

ि स्थ DiSC



Since candidate page is clean, we simply evict 9

After eviction:



Write request of page 1 comes

An Example ( $k_w = 3$ )

#### write page 1

lab Sido Sid





SSD

ഀ൬൬ഀ

After eviction:







After eviction:

Bb da Jap OSiO







Bb da DSiO

> 4,5,2 concurrently written 4 evicted



An Example (
$$k_w = 3$$
)

#### write page 1 LRU





After eviction:



After eviction:









After eviction:



After eviction:



4

An Example 
$$(k_w = 3, n_e = 2)$$
  
write page 1  
LRU LRU+ACE (w/o PF) LRU+ACE (w/PF)  
eviction window





After eviction:

lab **Sid** 



After eviction:



5

# 4,5,2 concurrently written4,7 evicted

6

8

2

3



An Example 
$$(k_w = 3, n_e = 2)$$
  
write page 1  
LRU LRU+ACE (w/o PF) LRU+ACE (w/PF)







After eviction:

lab **Sid** 



After eviction:

After eviction:





#### **Experimental Evaluation**



| Device      | α   | k <sub>r</sub> | k <sub>w</sub> |
|-------------|-----|----------------|----------------|
| Optane SSD  | 1.1 | 6              | 5              |
| PCIe SSD    | 2.8 | 80             | 8              |
| SATA SSD    | 1.5 | 25             | 9              |
| Virtual SSD | 2.0 | 11             | 19             |

**४१** थु DisC

#### Workload:

synthesized traces

**TPC-C** benchmark



### **ACE Improves Runtime**

**Device: PCle SSD** 



Bb da DSiC

 $\alpha$  = 2.8, k<sub>w</sub> = 8

ACE improves runtime by 22-26%

Negligible increase in buffer miss (<0.009%)

#### Benefit comes at no cost

# Higher Gain for Write-Heavy Workload

**Device: PCIe SSD** 



Bg dg DSiC

 $\alpha$  = 2.8, k<sub>w</sub> = 8

#### Write-intensive workloads have higher benefit (up to 32%)





## Impact of R/W Ratio & Asymmetry



more writes, more speedup higher asymmetry, higher speedup good benefit even for low asymmetry



## Impact of #Concurrent I/Os



**Device: PCIe SSD** 

$$\alpha$$
 = 2.8, k<sub>w</sub> = 8

# Highest speedup when optimal concurrency is used





#### Experimental Evaluation (TPC-C)





## Experimental Evaluation (TPC-C)



#### **ACE Achieves 1.3x for mixed TPC-C**



## Experimental Evaluation (TPC-C)



#### **ACE** Achieves 1.3x for mixed TPC-C



#### Summary



BS de DisC

Decoupled eviction and write-back mechanism

**ACE** works with **any** page replacement policy



Any prefetching technique can be used



With low engineering effort, any DBMS

bufferpool can benefit from this approach



## Conclusion & Future Work

ि प्र DisC

Make *asymmetry* and *concurrency* part of *algorithm design* 

... not simply an engineering optimization

Build algorithms/data structures for storage devices with asymmetry *α* and concurrency *k* 





## Thank You!

lab **Sid** 

disc.bu.edu/papers/icde23-papon