Class: Tu/Th 12:30-1:45pm, MCS B37
Instructor: Manos Athanassoulis
Office: MCS 106
OH: Tu 3-4pm/F 1-2pm
Discussion on Piazza
TFs Office Hours: available in Piazza
Before each class submit your paper review!
Here you can find the tentative schedule of the class (which might change as the semester progresses).
In this class we will discuss the basics of data systems and the goals and structure of the course.
In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.
In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.
In this class the students will be introduced to the class semester project. In that process we describe in detail LSM-trees and we highlight open research problems in data management.
Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema
Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)
In this class the instructor will provide the necessary background to indexing. We will describe the most common design principles and decisions of index strutures and provide the background needed for diving into the details of cutting-edge indexing papers.
Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)
Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning
Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution
Concepts: searching, binary search, interpolation search
from Stavros Papadopoulos, TileDB
Part of Scientific Data Management
Abstract: In this talk I will present TileDB, a storage engine for multi-dimensional dense and sparse arrays. TileDB is open-source (MIT License) and built as a C++ embeddable library. It comes with 6 APIs (C, C++, Python, R, Java, Go), and integrates with Spark, Dask, MariaDB, PrestoDB, PDAL and GDAL. I will first explain the TileDB data format, the internal engine mechanics (parallelism, filters, integrations), and its applicability to data science and analytics. I will then describe TileDB’s value in application domains such as Genomics and Geospatial. Finally, I will show a demo of our recently launched product that offers access control, logging and serverless SQL and UDFs on the cloud.
Bio: Dr. Stavros Papadopoulos is the founder and CEO of TileDB, Inc. Prior to founding TileDB, Inc. in February 2017, Dr. Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.
In this class the instructor will discuss modern hardware trends that drive system and index design with respect to storage, memories, and processing.
from Ryan Marcus, MIT
Part of Machine Learning for Data Systems
Abstract: Query optimization is one of the most well-studied problems in database management systems. Because of the exponential problem space and the heuristic nature of current solutions, researchers have recently applied machine learning techniques to this challenging problem domain. Various approaches range from using machine learning to perform cardinality estimation to outright replacing the query optimizer with deep reinforcement learning. Each of these approaches comes with their own unique sets of advantages and constraints. This talk will lay out these recent advancements in a skeptical light (my own work included), highlighting both potential advantages and the myriad of challenges that still need to be overcome to achieve a fully autonomous query optimizer.
Bio: Dr. Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska. Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil. Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory. He enjoys long walks down steep gradients, and shorter walks down gentler ones.
Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join
Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption
Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API
from Jialin Ding, MIT
Part of Machine Learning for Data Systems
Date & Time: Outside class schedule: Friday, April 10th at 11am.
Abstract: Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this talk, I will present Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system. Our paper on Flood will appear at SIGMOD 2020. I will also talk about our continuing work on extending the ideas of Flood to address real-world challenges of indexing multi-dimensional data such as data correlation, non-uniform queries, and categorical attributes.
Bio: Jialin Ding is a second-year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska. His research focuses broadly on the application of machine learning to data systems. He also collaborates with Umar Farooq Minhas and the Database Group at Microsoft Research on learned data structures. Prior to MIT, Jialin was an undergraduate at Stanford University, where he worked on data-intensive systems with Prof. Peter Bailis at part of Stanford DAWN.
Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID
Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector
Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation
We will spend the first 15 minutes to provide class evaluation!