CS 561

Data Systems Architectures


Class at a glance

Class: Tue/The 11:00am-12:15pm (CDS 364)
Instructor: Manos Athanassoulis 

Lab: Fri 1:25-2:15pm (MCS B33)
Teaching Fellow: Teona Bagashvili 

Office: CDS 928
Office Hours: Posted on Piazza

Discussion on Piazza / Grades on Gradescope
TF Office Hours: Posted on Piazza

Announcements

  • Semester starts on Jan 21 - stay tuned for updates.
  • See updates and announcements in Piazza.


Class Milestones - Important Dates

Keep in mind the Official Semester Dates.

  • January 31, submit project 0
  • February 14, submit project 1
  • February 23, submit your project proposal
  • February 25, last day to drop (without a "W")
  • March 3 - March 7, meet with your assigned mentor (graded)
  • March 6, Student-led discussion
  • March 22, submit your mid-semester progress report
  • March 27, Student-led discussion
  • March 31 - April 11, meet with your assigned mentor (graded)
  • April 4, last day to drop (with a "W")
  • April 8, Student-led discussion
  • May 2, final submission of project code and report


Class Schedule (tentative)

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS561

Readings

Class : Data Systems Architectures Essentials – Part 1

Readings

Class : Data Systems Architectures Essentials – Part 2

Readings

Class : LSM intro and Class Project Overview

Readings

A: Storage Layouts

Class : Row-Stores vs. Column-Stores

Readings

Class : Log-Structured Merge (LSM) Trees

Readings

Class : Design Tradeoffs in Key-Value Stores

Readings

B. Indexing

Class : Introduction to Indexing Design

Readings

Class : The design space of data structures

Readings

Class : Guest Lecture on Robust and Learned Tuning: Andy Huynh

Readings

Class : Guest Lecture on Sortedness-Aware Indexing: Aneesh Raman

Readings

Class : Adaptive Radix Trees

Readings

Class : Bitmap indexing (Student Discussion )

Readings

C. Modern Hardware

Class : Modern hardware trends

Readings

Class : Guest Lecture on SSD Design Elements: Teona Bagashvili

Readings

Class : ACE storage Management

Readings

Class : Serverless Computing (Student Discussion )

Readings

D. Query Processing

Class : Skew-aware Optimal Joins

Readings

Class : Guest Lecture on "Google's Spanner Database", by Ben Vandiver

Abstract: Spanner is Google's global scale database that underlies many of the core systems at Google. Think GMail, Photos, YouTube metadata, Ads. Spanner is built to meet the demanding needs of these applications: over 15 Exabytes of data under storage and over 2B QPS at peak. Additionally, Spanner must be highly reliable and compliant with a variety of regulations. I'll talk about some of the core design decisions and the architecture that resulted. Finally, I'll give some color commentary about the software engineering practices necessary to build databases like Spanner.

Bio: Ben Vandiver (MIT PhD '08) is one of the lead engineers on Google's Spanner development team. For the last 6 years, he's been working to "make Spanner less weird" to its Cloud and internal customers. Prior to joining Google, Ben was the CTO for Vertica, a pioneering columnar database company of Mike Stonebraker lineage. Ben has written code for almost every area of database implementation at some point, some of it even bug-free. When not hacking on databases, Ben enjoys hanging out with his kids and sword fighting.

Readings

  • Slides
  • Spanner: Becoming a SQL System, SIGMOD 2017 (Paper for TQ)
  • Technical Question : How can distributed queries be restarted without buffering intermediate results? Provide 2 concrete examples of SQL queries where restart would lead to repeatable results and 2 queries where the results may differ after a restart. Explain your answer!
  • Spanner: Google’s Globally-Distributed Database, OSDI 2012

Class : BMI-based Query Optimization (Student Discussion )

Readings

Class : Guest lecture on "OSDB: Exposing the Operating System’s Inner Database", by George Neville-Neil

Abstract: Operating systems must provide functionality that closely resembles that of a data management system, but existing query mechanisms are ad-hoc and idiosyncratic. To address this problem, we argue for the adoption of a relational interface to the operating system kernel. While prior work has made similar proposals, our approach is unique in that it allows for incremental adoption over an existing, production-ready operating system. In this paper, we present progress on a prototype system called OSDB that embodies the incremental approach and discuss key aspects of the design, including the data model and concurrency control mechanisms. We present four example use cases: a network usage monitor, a load balancer, file system checker, and network debugging session, as well as experiments that demonstrate the low overhead for our approach.

Bio: George V. Neville-Neil, works on networking and operating system code for fun and profit. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly twenty years he has been the columnist better known as Kode Vicious. Since 2014 he has been an Industrial Visitor at the University of Cambridge where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth but has been deployed, as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City. He is currently a PhD student at Yale University working with Robert Soulé, Avi Silberschatz and Peter Alvaro.

Readings

E. ML For Data Systems

Class : ML for Systems and Learned Query Evaluation

Readings

Class : Guest lecture on "Cloud Data Lakes", by Andrew Lamb

Abstract: In this talk, Andrew will review the architecture of recent cloud analytic database systems, the trends driving these architectures (e.g. cheap durable Object Storage, elastic compute), the common features in such systems (disaggregated storage, caching layers, metadata stores).

Bio: Andrew currently works in Rust on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He serves on the Apache DataFusion PMC (2024 Chair), and on the Apache Arrow PMC (2023 Chair), and actively contributes to the Apache Arrow DataFusion query engine and the Apache Arrow Rust implementation. He earned a BS and MEng in Course VI from MIT. More details are available at http://andrew.nerdnetworks.org/

Readings

Class : Learned Indexes

Readings

Class : Exam

Click here for the exam guide

Project Presentation

Class : Project Presentations A

Project Presentation - I

Class : Project Presentations B

Project Presentation - II

Project Awards (by popular vote)

Awards

  • Most Engaging Presentation: “Benchmark Compression With Near Sortedness” by Harshitha Tumkur Kailasa Murthy, Vishwas Bhaktavatsala
  • Project with Highest Technical Depth: “Query-Driven Compaction in LSM-Trees” by Karatsenidis Konstantinos, Shubham Kaushik, Nishil Agrawal
  • Best Overall Project: “Range Deletes in LSM-Trees” by Jingyi Li, Ming-Han Hsieh, Yu-Cheng Huang
  • Honorable Mention: “Exploring the Performance of Data Compression Algorithms with Varying Data Sortedness” by Shivangi and Vani Singhal