CS 591 A1

Data Systems Architectures


Class at a glance

Class: Tu/Th 12:30-1:45pm, MCS B37
Instructor: Manos Athanassoulis 

Teaching Fellows:
Andy Huynh   /  JuHyoung Mun 

Office: MCS 106
OH: Tu 3-4pm/F 1-2pm

Discussion on Piazza
TFs Office Hours: available in Piazza

Announcements



Class Milestones - Important Dates

  • February 3rd, last day to add (any) class
  • February 7th, select a project
  • February 14th, decide the semester project (which you have discussed in OH)
  • February 21st, submit your project proposal
  • February 25th, last day to drop (without a "W")
  • March 21st, submit your mid-semester progress report
  • April 26th, submit your final project report/code (optional)
  • April 28th and April 30th, project presentation (in class)
  • May 8th 11:59pm, final submission of project code and report

Before each class submit your paper review!



Class Schedule (tentative)

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS591

In this class we will discuss the basics of data systems and the goals and structure of the course.

Readings

Class : Data Systems Architectures Essentials – Part 1

In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.

Readings

Class : Data Systems Architectures Essentials – Part 2

In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.

Readings

Class : Class Project Overview

In this class the students will be introduced to the class semester project. In that process we describe in detail LSM-trees and we highlight open research problems in data management.

Readings

A: Storage Layouts

Class : Row-Stores vs. Column-Stores

Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema

Readings

Class : Adaptive & Hybrid Layouts

Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)

Readings

B: Modern Storage Engines

Class : Designing a Key-Value Store

Readings

Class : Key-Value Store for Jason and CSV

Readings

Class : LSM-based Key-Value Stores

Readings

C. Indexing

Class : Introduction to Indexing, Trees & Tries (presentation by the instructor)

In this class the instructor will provide the necessary background to indexing. We will describe the most common design principles and decisions of index strutures and provide the background needed for diving into the details of cutting-edge indexing papers.

Readings

Class : Updateable Bitmaps

Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)

Readings

Class : Zonemaps & Data Skipping

Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning

Readings

Class : Adaptive Indexing & Cracking

Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution

Readings

Class : Searching

Concepts: searching, binary search, interpolation search

Readings

D. Modern Hardware

Class : Modern hardware trends (presentation by the instructor)

In this class the instructor will discuss modern hardware trends that drive system and index design with respect to storage, memories, and processing.

Readings

Class : Query Evaluation for Multi-Core

Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join

Readings

Class : Indexing for Persistent Memories

E. Scientific Data Management

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption

Readings

F. Distributed Data Management

Class : Global Distributed Systems

Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API

Readings

Class : Map/Reduce: Data Management at Scale

Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID

Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector

Readings

G. Machine Learning for Data Systems

Class : Learned Tuning

Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation

Readings

Class : Learned Indexes

Readings

Class : Learned Query Evaluation

We will spend the first 15 minutes to provide class evaluation!

Readings

Class : Project Presentations A

Presentations

Class : Project Presentations B

Presentations

Project Awards (by popular vote)

Awards