CAS CS561 -- Data Systems Architectures @ Boston University

Abstract: Spanner is Google's global scale database that underlies many of the core systems at Google. Think GMail, Photos, YouTube metadata, Ads. Spanner is built to meet the demanding needs of these applications: over 15 Exabytes of data under storage and over 2B QPS at peak. Additionally, Spanner must be highly reliable and compliant with a variety of regulations. I'll talk about some of the core design decisions and the architecture that resulted. Finally, I'll give some color commentary about the software engineering practices necessary to build databases like Spanner.

Bio: Ben Vandiver (MIT PhD '08) is one of the lead engineers on Google's Spanner development team. For the last 6 years, he's been working to "make Spanner less weird" to its Cloud and internal customers. Prior to joining Google, Ben was the CTO for Vertica, a pioneering columnar database company of Mike Stonebraker lineage. Ben has written code for almost every area of database implementation at some point, some of it even bug-free. When not hacking on databases, Ben enjoys hanging out with his kids and sword fighting.

Readings

Class : BMI-based Query Optimization (Student Discussion )

Readings

Class : Guest lecture on "OSDB: Exposing the Operating System’s Inner Database", by George Neville-Neil

Abstract: Operating systems must provide functionality that closely resembles that of a data management system, but existing query mechanisms are ad-hoc and idiosyncratic. To address this problem, we argue for the adoption of a relational interface to the operating system kernel. While prior work has made similar proposals, our approach is unique in that it allows for incremental adoption over an existing, production-ready operating system. In this paper, we present progress on a prototype system called OSDB that embodies the incremental approach and discuss key aspects of the design, including the data model and concurrency control mechanisms. We present four example use cases: a network usage monitor, a load balancer, file system checker, and network debugging session, as well as experiments that demonstrate the low overhead for our approach.

Bio: George V. Neville-Neil, works on networking and operating system code for fun and profit. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly twenty years he has been the columnist better known as Kode Vicious. Since 2014 he has been an Industrial Visitor at the University of Cambridge where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth but has been deployed, as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City. He is currently a PhD student at Yale University working with Robert Soulé, Avi Silberschatz and Peter Alvaro.

Readings

E. ML For Data Systems

Class : ML for Systems and Learned Query Evaluation

Readings

Class : Guest lecture on "Cloud Data Lakes", by Andrew Lamb

Abstract: In this talk, Andrew will review the architecture of recent cloud analytic database systems, the trends driving these architectures (e.g. cheap durable Object Storage, elastic compute), the common features in such systems (disaggregated storage, caching layers, metadata stores).

Bio: Andrew currently works in Rust on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He serves on the Apache DataFusion PMC (2024 Chair), and on the Apache Arrow PMC (2023 Chair), and actively contributes to the Apache Arrow DataFusion query engine and the Apache Arrow Rust implementation. He earned a BS and MEng in Course VI from MIT. More details are available at http://andrew.nerdnetworks.org/

Readings

Class : Learned Indexes

Readings

Class : Exam

Click here for the exam guide

Project Presentation

Class : Project Presentations A

Project Presentation - I

Class : Project Presentations B

Project Presentation - II

Class at a glance

Announcements

Class Milestones - Important Dates

Class Schedule (tentative)

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Introduction to Data Systems and CS561

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Data Systems Architectures Essentials – Part 1

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Data Systems Architectures Essentials – Part 2

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : LSM intro and Class Project Overview

Readings

A: Storage Layouts

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Row-Stores vs. Column-Stores

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Log-Structured Merge (LSM) Trees

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Design Tradeoffs in Key-Value Stores

Readings

B. Indexing

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Introduction to Indexing Design

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : The design space of data structures

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest Lecture on Robust and Learned Tuning: Andy Huynh

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest Lecture on Sortedness-Aware Indexing: Aneesh Raman

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Adaptive Radix Trees

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Bitmap indexing (Student Discussion document.write("S" + (studpres++)); )

Readings

C. Modern Hardware

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Modern hardware trends

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest Lecture on SSD Design Elements: Teona Bagashvili

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : ACE storage Management

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Serverless Computing (Student Discussion document.write("S" + (studpres++)); )

Readings

D. Query Processing

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Skew-aware Optimal Joins

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest Lecture on "Google's Spanner Database", by Ben Vandiver

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : BMI-based Query Optimization (Student Discussion document.write("S" + (studpres++)); )

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest lecture on "OSDB: Exposing the Operating System’s Inner Database", by George Neville-Neil

Readings

E. ML For Data Systems

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : ML for Systems and Learned Query Evaluation

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Guest lecture on "Cloud Data Lakes", by Andrew Lamb

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Learned Indexes

Readings

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Exam

Project Presentation

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Project Presentations A

document.write("<a id=\"class" + (cday + 1) + "\"></a>"); Class document.write(cday + 1) : Project Presentations B

Project Awards (by popular vote)

Awards

Class : Introduction to Data Systems and CS561

Class : Data Systems Architectures Essentials – Part 1

Class : Data Systems Architectures Essentials – Part 2

Class : LSM intro and Class Project Overview

Class : Row-Stores vs. Column-Stores

Class : Log-Structured Merge (LSM) Trees

Class : Design Tradeoffs in Key-Value Stores

Class : Introduction to Indexing Design

Class : The design space of data structures

Class : Guest Lecture on Robust and Learned Tuning: Andy Huynh

Class : Guest Lecture on Sortedness-Aware Indexing: Aneesh Raman

Class : Adaptive Radix Trees

Class : Bitmap indexing (Student Discussion )

Class : Modern hardware trends

Class : Guest Lecture on SSD Design Elements: Teona Bagashvili

Class : ACE storage Management

Class : Serverless Computing (Student Discussion )

Class : Skew-aware Optimal Joins

Class : Guest Lecture on "Google's Spanner Database", by Ben Vandiver

Class : BMI-based Query Optimization (Student Discussion )

Class : Guest lecture on "OSDB: Exposing the Operating System’s Inner Database", by George Neville-Neil

Class : ML for Systems and Learned Query Evaluation

Class : Guest lecture on "Cloud Data Lakes", by Andrew Lamb

Class : Learned Indexes

Class : Exam

Class : Project Presentations A

Class : Project Presentations B