CAS CS591 A1 -- Data Systems Architectures @ Boston University

Guest Lecture : The TileDB array data storage engine

Part of Scientific Data Management

Abstract: In this talk I will present TileDB, a storage engine for multi-dimensional dense and sparse arrays. TileDB is open-source (MIT License) and built as a C++ embeddable library. It comes with 6 APIs (C, C++, Python, R, Java, Go), and integrates with Spark, Dask, MariaDB, PrestoDB, PDAL and GDAL. I will first explain the TileDB data format, the internal engine mechanics (parallelism, filters, integrations), and its applicability to data science and analytics. I will then describe TileDB’s value in application domains such as Genomics and Geospatial. Finally, I will show a demo of our recently launched product that offers access control, logging and serverless SQL and UDFs on the cloud.

Bio: Dr. Stavros Papadopoulos is the founder and CEO of TileDB, Inc. Prior to founding TileDB, Inc. in February 2017, Dr. Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.

Readings

The TileDB Array Data Storage Manager, VLDB 2017

Guest Lecture : Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers

from Ryan Marcus, MIT

Part of Machine Learning for Data Systems

Abstract: Query optimization is one of the most well-studied problems in database management systems. Because of the exponential problem space and the heuristic nature of current solutions, researchers have recently applied machine learning techniques to this challenging problem domain. Various approaches range from using machine learning to perform cardinality estimation to outright replacing the query optimizer with deep reinforcement learning. Each of these approaches comes with their own unique sets of advantages and constraints. This talk will lay out these recent advancements in a skeptical light (my own work included), highlighting both potential advantages and the myriad of challenges that still need to be overcome to achieve a fully autonomous query optimizer.

Bio: Dr. Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska. Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil. Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory. He enjoys long walks down steep gradients, and shorter walks down gentler ones.

Readings

Slides from Ryan Marcus

Guest Lecture 3: Learning Multi-dimensional Indexes

from Jialin Ding, MIT

Part of Machine Learning for Data Systems

Date & Time: Outside class schedule: Friday, April 10th at 11am.

Click here for the recording!

Abstract: Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this talk, I will present Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system. Our paper on Flood will appear at SIGMOD 2020. I will also talk about our continuing work on extending the ideas of Flood to address real-world challenges of indexing multi-dimensional data such as data correlation, non-uniform queries, and categorical attributes.

Bio: Jialin Ding is a second-year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska. His research focuses broadly on the application of machine learning to data systems. He also collaborates with Umar Farooq Minhas and the Database Group at Microsoft Research on learned data structures. Prior to MIT, Jialin was an undergraduate at Stanford University, where he worked on data-intensive systems with Prof. Peter Bailis at part of Stanford DAWN.

Readings

Slides from Jialin Ding

Class at a glance

Announcements

Class Milestones - Important Dates

Class Schedule (tentative)

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Introduction to Data Systems and CS591

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Data Systems Architectures Essentials – Part 1

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Data Systems Architectures Essentials – Part 2

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Class Project Overview

Readings

A: Storage Layouts

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Row-Stores vs. Column-Stores

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Adaptive & Hybrid Layouts

Readings

B: Modern Storage Engines

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Designing a Key-Value Store

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Key-Value Store for Jason and CSV

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): LSM-based Key-Value Stores

Readings

C. Indexing

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Introduction to Indexing, Trees & Tries (presentation by the instructor)

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Updateable Bitmaps

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Zonemaps & Data Skipping

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Adaptive Indexing & Cracking

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Searching

Readings

Guest Lecture document.write(rday+1): The TileDB array data storage engine

Readings

D. Modern Hardware

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Modern hardware trends (presentation by the instructor)

Readings

Guest Lecture document.write(rday+1): Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Query Evaluation for Multi-Core

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Indexing for Persistent Memories

E. Scientific Data Management

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): In-Situ Data Processing: Efficiently Accessing Raw Data Files

Readings

F. Distributed Data Management

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Global Distributed Systems

Readings

Guest Lecture 3: Learning Multi-dimensional Indexes

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Map/Reduce: Data Management at Scale

Readings

G. Machine Learning for Data Systems

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Learned Tuning

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Learned Indexes

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Learned Query Evaluation

Readings

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Project Presentations A

Presentations

document.write("<a id=\"class"+(cday+1)+"\"></a>"); Class document.write(cday+1): Project Presentations B

Presentations

Project Awards (by popular vote)

Awards

Class : Introduction to Data Systems and CS591

Class : Data Systems Architectures Essentials – Part 1

Class : Data Systems Architectures Essentials – Part 2

Class : Class Project Overview

Class : Row-Stores vs. Column-Stores

Class : Adaptive & Hybrid Layouts

Class : Designing a Key-Value Store

Class : Key-Value Store for Jason and CSV

Class : LSM-based Key-Value Stores

Class : Introduction to Indexing, Trees & Tries (presentation by the instructor)

Class : Updateable Bitmaps

Class : Zonemaps & Data Skipping

Class : Adaptive Indexing & Cracking

Class : Searching

Guest Lecture : The TileDB array data storage engine

Class : Modern hardware trends (presentation by the instructor)

Guest Lecture : Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers

Class : Query Evaluation for Multi-Core

Class : Indexing for Persistent Memories

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Class : Global Distributed Systems

Class : Map/Reduce: Data Management at Scale

Class : Learned Tuning

Class : Learned Indexes

Class : Learned Query Evaluation

Class : Project Presentations A

Class : Project Presentations B