CS 561

Data Systems Architectures


Class at a glance

Class: Tu/Th 12:30-1:45pm (CAS 116)
Instructor: Manos Athanassoulis 

Lab: Fri 9:05-9:55am (CAS 213)
Teaching Fellows:
Tarikul Islam Papon  / JuHyoung Mun  / Aneesh Raman 

Office: MCS 106
OH: Tue/Thu 2pm - 3pm ET

Discussion on Piazza
TFs Office Hours: Posted on Piazza
Grades on Gradescope

Announcements



Class Milestones - Important Dates

Keep in mind the Official Semester Dates.

  • February 5th, last day to add (any) class
  • February 9th, select a project
  • February 19th, decide the semester project (which you have discussed in OH)
  • February 28th, submit your project proposal
  • March 1st, last day to drop (without a "W")
  • March 28th, submit your mid-semester progress report
  • April 2nd, last day to drop (with a "W")
  • April 23rd, submit your draft project presentation for review prior to presentation
  • April 25th 11:59pm, submit your final project report/code (draft version)
  • April 27th and April 29th, project presentation (in class)
  • May 3rd 11:59pm, final submission of project code and report


Class Schedule (tentative)

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS591

In this class we will discuss the basics of data systems and the goals and structure of the course.

Readings

Class : Data Systems Architectures Essentials – Part 1

In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.

Readings

Class : Data Systems Architectures Essentials – Part 2

In this class we continue discussing data systems architectures and the basics for modern systems focusing on relational row-stores and column-stores.

Readings

Class : Class Project Overview

In this class the students will be introduced to the class semester project. In that process we describe in detail LSM-trees and we highlight open research problems in data management.

Readings

A: Storage Layouts

Class : Row-Stores vs. Column-Stores

Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema

Readings

Class : Adaptive & Hybrid Layouts (student presentation)

Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)

Readings

B: Modern Storage Engines

Class : Key-Value Store & Log-Structure Merge-Tree Engines

Readings

Class : Enabling Efficient Deletes in Log-Structured Key-Value Storage (research lecture from Dr. S. Sarkar)

Readings

Class : Key-Value Store for JSON and CSV

Readings

C. Indexing

Class : Introduction to Indexing, Trees & Tries

In this class the instructor will provide the necessary background to indexing. We will describe the most common design principles and decisions of index strutures and provide the background needed for diving into the details of cutting-edge indexing papers.

Readings

Class : Adaptive Radix Trees

Concepts: tree indexing, tries, radix, adaptive radix trees

Readings

  • Slides
  • (P) The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases, ICDE 2013
  • Technical Question: One primary issue with Radix trees is that- with long keys, the space consumption per key can be significantly large even with the implicit prefix compression of keys. How does ART addresses this issue? In other words, what techniques are used by ART to decrease the height by reducing the number of nodes?

Class : Zonemaps & Data Skipping (student presentation)

Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning

Readings

Class : Adaptive Indexing & Cracking (student presentation)

Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution

Readings

Class : Updateable Bitmaps (student presentation)

Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)

Readings

D. Modern Hardware

Class : Modern hardware trends

In this class the instructor will discuss modern hardware trends that drive system and index design with respect to storage, memories, and processing.

Readings

Class : Query Evaluation for Multi-Core (student presentation)

Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join

Readings

Class : Data Processing with GPUs (student presentation)

Concepts: GPUs

Readings

Class : Asymmetry & Concurrency Aware Storage Management (research talk by T. I. Papon)

Concepts: storage, asymmetry, concurrency, bufferpool, eviction policy

Readings

Class : Indexing for Persistent Memories (student presentation)

Concepts: non-volatile memories, indexing

Readings

E. Scientific Data Management

Class : In-Situ Data Processing

Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption

Readings

Class : Multi-dimensional Data Management (student presentation)

Concepts: array management systems, multi-dimensional arrays, storage manager, tiles, thread-safe, process-safe, atomicity, dense vs. sparse arrays, global cell order, fragments, dense vs. sparse fragments, consolidation

Readings

F. Machine Learning for Data Systems

Class : Machine Learning for Data Systems

Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation

Readings

Class : Learned Indexes

Readings

Class : Learned Query Evaluation

Readings

Class : Project Presentations A

Project Presentations

Readings

  • 12:30-12:45 Class Evaluation
  • 12:45-1:05 (A) Deal B+-Trees to Support Sortedness by Sean Brady
  • 1:05-1:25 (B) LSM Implementation by Chenming Shi
  • 1:25-1:45 (C) Learned LSM-Trees by Jason Banks

Class : Project Presentations B

Project Presentations

Readings

  • 12:30-12:50 (D) Query-Driven LSM Compaction by Manish Patel, Chen-Wei Weng, and Al Dahler
  • 12:50-1:10 (E) Bufferpool Implementation by Kaijie Chen
  • 1:10-1:30 (F) Bufferpool Implementation by Haochuan Xiong
  • 1:30-1:45 Closing Remarks