CS-422 Advanced Databases — THE BIG DATA COURSE (Spring 2013) ‒ DATA ‐ EPFL

Important information

7 credits: 3 (lectures) + 2 (exercises) + 2 (project). This course is taught in English.

We use MOODLE (go here for the course page). The moodle key is “data”. Please enroll as soon as possible if you are a student taking this course!

Lecture dates: Tuesdays 1:15-4pm in INM200. The first lecture will take place on Tuesday Feb. 19, 2013. Exercise dates: Wednesdays 10:15am-noon in INJ218. Project meetings: project teams decide on their own when to meet. (See also this page.)

Course staff: Christoph Koch (lecturer); Abdallah Elguindy, Mohammed ElSeidy, Yannis Klonatos, Daniel Lupei, Milos Nikolic, Andrej Spielmann, and Aleksandar Vitorovic (teaching assistants).

Office hours are by appointment (our email addresses are [email protected]). Please use classes and the breaks between them to ask questions where possible. Teaching assistants will be present in the exercise sessions.

It is important to attend the lectures since we have in-classroom quizzes and tasks, but physically attending the exercises and project meetings is optional.

Getting a grade

This course uses continuous grading — there is no midterm or final exam but attendance of and active participation in the lectures is mandatory. Attending the exercises and project meetings is not mandatory but please keep in mind that the TAs spend a lot of time there so please be nice trying to ask your questions there rather than asking for separate appointments. If you cannot attend the project meetings you have to arrange this with your team.

The grade is determined based on

homework (20%), classroom participation (20%), and quizzes (30%) that will be held periodically in the lectures. Homework has to be submitted on Tuesdays by the start of the lecture.
the course project (30%). The project will be worked on in student teams. The goal is to build a parallel query engine using technologies such mysql, Java, and map-reduce/Hadoop. The amount of coding required for this course will be moderate. Experimentation with a large dataset on a real cluster will be required.

Every week, we will read and discuss one research paper or survey. The homework will consist of questions based on the content of the lectures as well as these papers. There is a quiz every week at the beginning of the lecture; typically you have 20 minutes to work on the quiz. In addition, we have an in-classroom group task every week.

Academic integrity and group work:

Quizzes, homework, and, unless stated otherwise, project work are to be done individually. Collaboration on these will be considered cheating.

The in-classroom tasks are collaborative group work and you are encouraged to work in a team of several (ideally around 5) people and to submit your result as a team. Submissions either on paper in class or by email before the end of the lecture are admissible. Late email submission (after 4pm on Tuesdays) will receive no credit.

We will make a clear distinction between quizzes and classroom tasks (group work) will be made clear. Quizzes are given to you as a sheet of paper on which the problem set is printed, stating clearly that it is a quiz.

All quizzes have equal weight. All homeworks and classroom tasks have equal weight. So if you do very badly on e.g. a quiz, you do not lose more than 30% / 13 of the overall credit that counts towards your grade.

Objectives

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as read and understand recent research publications.

Content

The course consists of 14 modules, one per week. Their TENTATIVE content is as follows:

Course overview. Using query languages; expressing advanced problems as queries. Universal quantifcation, sophisticated uses of aggregation. SQL and relational algebra
Query operators – join, selection, projection, sorting. Join and sorting algorithms. Sequential versus random access to secondary storage; operator cost functions. Memory hierarchies. New hardware.
Query optimization. The System R algorithm. Index selection, physical database design, and database tuning.
Big Data 1: Map-reduce/Hadoop, GFS/HDFS, Bigtable/HBASE
Parallel & distributed databases: Scaling, partitioning, replication, bloom joins.
Data-parallel programming. Data-flow parallelism vs. message passing. Circuit complexity and its interpretation in data-parallel programming. Relational algebra is in AC0. Monad algebra. NESL, DryadLINQ, PigLatin. The bulk-synchronous parallel programming model: Pregel
Big Data 2: massively parallel joins. theta-joins on map-reduce, handling skew; online map-reduce
Data stream processing: DSMS and CEP systems. CQL. Window semantics and window joins. Load shedding. Sampling and approximating aggregates (no joins). Querying histograms. Maintaining histograms of streams. Synopes. Haar wavelets.
Incremental and online query processing: incremental view maintenance: materialized views, delta processing; online aggregation – sampling, ripple joins, error bounding.
OLAP, data cubes. The data warehousing workflow, ETL. Bitmap indexes, join indexes. Analytics and mining: Frequent itemsets (the a-priori algorithm), association rules. Clustering (BIRCH). Decision tree construction.
Concurrency control (CC): transactions. SQL isolation levels. Anomalies. Serializability. 2-phase locking. Optimistic CC. Multiversion CC. Snapshot isolation. From snapshot isolation to serializability.
Recovery: the basic ARIES algorithm. Concurrency control part 2: distributed transactions. 2-phase commit. Eventual consistency, Dynamo. The CAP theorem.
Foundations of query processing and optimization. Homomorphism theorem/CQ optimization, also under integrity constraints (fds, inds); structural decompositions: acyclic queries, GYO algorithm, Yannakakis’ algorithm; (hyper)tree-width and decompositions. Robbers-and-cops games.
Surprise!

The project is in the space of large-scale data analysis and will draw together many of the ideas covered in the course.

Required prior knowledge

A basic course on database systems (e.g. covering parts III, IV, and V of Ramakrishnan and Gehrke on storage and indexing, query processing, and concurrency control).
You absolutely must master SQL, relational algebra, key and foreign key constraints, B-trees, the transaction concept.
Solid programming skills in Java.
Familiarity with working on a Unix-style operating system.
Basic knowledge of linear algebra (vector spaces, matrix multiplication), probability theory and statistics, and complexity theory (complexity classes, reductions, completeness, LOGSPACE, P, NP) are required.

Literature

“THE RED BOOK”: J. Hellerstein & M. Stonebraker, Readings in Database Systems, 4th Edition, 2005. (A table of contents with download links to most papers contained in the book can be found here.)

*and* (

R. Ramakrishnan & J. Gehrke: “Database Management Systems”, McGraw-Hill, 3rd Edition, 2002.

R. Elmasri & S. Navathe: ” Fundamentals of Database Systems “, Benjamin-Cummings, 3rd edition, 2000.

Hector Garcia-Molina, Jeffrey Ullman, and Jennifer Widom, “Database Systems: The Complete Book”, Prentice-Hall, 2nd edition, 2008.
)