Class
: Introduction to Data Systems and CS561
Readings
Class
: Data Systems Architectures Essentials – Part 1
Readings
- Slides
- Architecture of a Database System, Foundations and Trends in Databases, 2007
- The Design and Implementation of Modern Column-Oriented Database Systems, Foundations and Trends in Databases, 2012
- Data Structures for Data-Intensive Applications: Tradeoffs and Design Guidelines, Foundations and Trends in Databases, 2023
- Massively Parallel Databases and MapReduce Systems, Foundations and Trends in Databases, 2013
- The Seattle Report on Database Research, SIGMOD Record, 2022
- The Cambridge Report on Database Research, 2025
Class
: Data Systems Architectures Essentials – Part 2
Readings
Class
: LSM intro and Class Project Overview
Readings
A: Storage Layouts
Class
: Row-Stores vs. Column-Stores
Readings
Class
: Guest Lecture on SSD Design Elements: Teona Bagashvili
Readings
Class
: Log-Structured Merge (LSM) Trees & Compaction
Readings
Class
: Deletes on LSM Trees
Readings
Class
: Scans in Key-Value Stores
Readings
B. Indexing
Class
: Cancelled due to snow day.
Class
: Various forms of Indexing
Readings
Class
: Sortedness-Aware Indexing
Readings
Class
: Adaptive Radix Trees
Readings
Class
: Bitmap indexing
Readings
C. Modern Hardware
Class
: Modern hardware trends
Readings
Class
: ACE Bufferpool
Readings
Class
: Guest Lecture on "From Filters to Hash Tables: Rethinking Core Data Structures for Scalable Performance" (Prashant Pandey)
Abstract: Our ability to generate, acquire, and store data has grown exponentially over the past decade making the scalability of data systems a major challenge. In this talk, I will present my work on addressing this challenge through novel data structures and algorithms. First, I will introduce Monotonic Adaptive Filters, which address long-standing limitations in traditional filters by dynamically adapting to false positives while guaranteeing a maximum false positive rate, regardless of the query distribution. Next, I will discuss our advancements in modern hash tables, including IcebergHT and ZombieHT, which break traditional trade-offs by providing high performance with strong worst-case latency guarantees.
Bio: Pandey is an assistant professor in the Khoury College of Computer Sciences at Northeastern University. He focuses on creating scalable data systems with robust theoretical foundations. His work spans the entire spectrum of this challenge, from exploring the theoretical aspects of data structures to addressing the practical issues of scaling data systems. His work extends to tackling scalability challenges across computational biology, cybersecurity, stream processing, and storage systems. Pandey has received the NSF CAREER Award, NSF Elements Award, and the IEEE-CS Early Career Researchers Award for Excellence in High Performance Computing. Prior to joining Khoury College, he spent a year as a research scientist at VMware Research and held postdoctoral research positions at UC Berkeley and Carnegie Mellon University.
Readings
D. Student Talks
Class
: Rethinking The Compaction Policies in LSM-trees
(Student Discussion SD1)
Readings
Class
: How to Grow an LSM-tree? Towards Bridging the Gap Between Theory and Practice
(Student Discussion SD2)
Readings
Class
: Logical and Physical Optimizations for SQL Qery Execution over Large Language Models
(Student Discussion SD3)
Readings
Class
: Optimizing LLM Queries in Relational Data Analytics Workloads
(Student Discussion SD4)
Readings
E. ML For Data Systems
Class
: ML for Systems and Learned Query Evaluation
Readings
- Slides
- Technical Question : Given the query "SELECT SUM(revenue) FROM sales WHERE date BETWEEN '2023-01-01' AND '2023-01-31'" where date is an ordinal categorical attribute, explain how DBEst processes this query using its internal models. Specifically, what kind of model(s) must have been prebuilt? Justify your answer.
- DBEst: Revisiting Approximate Query Processing
Engines with Machine Learning Models, SIGMOD 2019 (Paper for TQ)
- Automatic Database Management System Tuning Through Large-scale Machine Learning, SIGMOD 2017 [conf. present. video]
- AB-tree: Index for Concurrent Random Sampling and Updates, VLDB 2022
- Approximate Queries over Concurrent Updates, VLDB 2023 (demo of AB-tree)
- Self-Driving Database Management Systems, CIDR 2017
- Learning Multi-Dimensional Indexes, SIGMOD 2018 [conf. present. video]
- The Case for Learned Index Structures, SIGMOD 2018 [conf. present. video]
- ML-In-Databases: Assessment and Prognosis, IEEE Data Engineering Bulletin, 44(1), March 2021
- Neo: A Learned Query Optimizer, VLDB 2019
- SageDB: A Learned Database System, CIDR 2019
- From Auto-tuning One Size Fits All to Self-designed and Learned Data-intensive Systems, SIGMOD 2019
Class
: Learned Indexes
Readings
Class
: Guest Lecture on "Space Efficient Secondary Learned Indexes" (Anwesha Saha)
Readings
Class
: Exam
Click here for the exam guide
Project Presentation
Class
: Project Presentations A
Project Presentation - I
Class
: Project Presentations B
Project Presentation - II
Project Awards (by popular vote)
As a tradition for the class, the projects that attracted the most excitement among the class members are receiving awards by popular vote!
Awards