Big Data (Spring 2014)

Projects

This course is over. Here is a list of the cool projects developed by the students in this course. All project ideas were proposed by students.

Projects with a public repo:

  • DevMine – Evaluating developer skills and potential based on their open-source contributions
    by Robin Hahling (Team Leader), Kevin Gillieron, Laurent Weingart, Hoai Xuan Luong, Frederik Galle, Daniel Espino Timón, and Clément Nicolas Doucet
    (https://github.com/DevMine)

 

Projects without a public repo:

  • PAST: Processing and Storage of Time series
  • Random Trip

Course objectives

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them effectively address analytics and data science challenges.

Content

  1. Map-reduce/Hadoop, GFS/HDFS, Bigtable/HBASE; Spark.
  2. SQL and relational algebra. Expressing advanced problems as queries. Data-parallel programming. Circuit complexity and its interpretation in data-parallel programming. Monad algebra. NESL, DryadLINQ, PigLatin. Data-flow parallelism vs. message passing.  The bulk-synchronous parallel programming model: Pregel.
  3. Data locality. Memory hierarchies. New hardware. Sequential versus random access to secondary storage. Query operators – join, selection, projection, sorting. Join and sorting algorithms.
  4. Query optimization. Index selection. Physical database design. Database tuning.
  5. Parallel & distributed databases: Scaling, partitioning, replication, bloom joins. Massively parallel joins. theta-joins on map-reduce, handling skew; online map-reduce.
  6. Concurrency control (CC): transactions. SQL isolation levels. Anomalies. Serializability. 2-phase locking. Optimistic CC. Multiversion CC. Snapshot isolation. Distributed transactions. 2-phase commit.
  7. Eventual consistency. The CAP theorem. NoSQL systems. NewSQL systems.
  8. OLAP, data cubes. The data warehousing workflow, ETL. Data mining: Frequent itemsets (the a-priori algorithm), association rules. Clustering. Decision tree construction.
  9. Basics of big data machine learning.
  10. Realtime analytics: Data stream processing: DSMS and CEP systems. CQL. Window semantics and window joins. Load shedding. Sampling and approximating aggregates (no joins). Querying histograms. Maintaining histograms of streams. Synopes. Haar wavelets. Incremental and online query processing: incremental view maintenance: materialized views, delta processing; online aggregation – sampling, ripple joins, error bounding.

The project is in the space of large-scale data analysis and will draw together many of the ideas covered in the course.

Required prior knowledge

  •  A basic course on database systems (e.g. covering parts III, IV, and V of Ramakrishnan and Gehrke on storage and indexing, query processing, and concurrency control).
  • You absolutely must master SQL, relational algebra, key and foreign key constraints, B-trees, the transaction concept.
  • Solid programming skills in Java.
  • Familiarity with working on a Unix-style operating system.
  • Basic knowledge of linear algebra (vector spaces, matrix multiplication), probability theory and statistics, and complexity theory (complexity classes, reductions, completeness, LOGSPACE, P, NP) are required.

Important Information

7 credits: 3 (lectures) + 2 (exercises) + 2 (project). This course is taught in English.

We use MOODLE (go here for the course page). The moodle key is “data”. Please enroll as soon as possible if you are a student taking this course!

Plenary dates: Tuesdays 1:15-4pm in CE3. The first plenary will take place on Tuesday Feb. 18, 2014. Exercise dates: Wednesdays 10:15am-noon in INJ218. Project meetings: project teams decide on their own when to meet. (See also this page.)

Course staff: Christoph Koch (lecturer); Mohammad Dashti, Mohammed ElSeidy, Milos Nikolic, Amir Shaikhha, and Aleksandar Vitorovic (teaching assistants).

Office hours are by appointment (our email addresses are firstname.lastname@epfl.ch). Please use classes and the breaks between them to ask questions. Teaching assistants will be present in the exercise sessions.

It is important to attend the plenaries since we have in-classroom quizzes and tasks, in addition to lectures. Physically attending the exercises and project meetings is optional. We partially reverse the classroom: some lectures are provided on video and we use the plenaries for group work, discussions, and hands-on work.

This is the successor course of Advanced Databases. Advanced Databases is not offered any further.

Getting a grade

This course uses in course grading. Attendance of and active participation in the plenaries is mandatory. Attending the exercises is optional but please keep in mind that the TAs spend a lot of time there so please be nice trying to ask your questions there rather than asking for separate appointments. If you cannot attend the project meetings you have to arrange this with your team.

The grade is determined based on

  • 5 homeworks/one-pagers(OPs) (5 * 4% = 20%). Homework has to be done individually (collaboration is considered cheating) and is to be submitted on Tuesdays by the start of the plenary. The homework will either consist of one-page essays on some problem in big date or questions based on the content of the (video and in-plenary) lectures.
  • classroom participation (20%). We have an in-classroom group task in most weeks. Its solution is handed in for grading at the end of the plenary.
  • 5 quizzes + 1 final exam (5 * 2% + 20% = 30%). These are held in class. The final exam takes place in the last plenary, on May 27, 2014 at 1:15pm in SG0211. It will take 90 minutes.
  • the course project (30%). The project will be worked on in student teams.

Homework due dates and quizzes will usually take place in alternate weeks, so, not counting classroom tasks, there is one deliverable per week. (Details on deliverables by week will be posted in moodle.)

The grade scale is as follows: 6: >= 95%; 5.5: >= 85%; 5: >= 75%; 4.5: >= 65%; 4: >= 50%. Failure below 50%.

Missing plenaries

Generally speaking, you must attend the plenaries, since quizzes, the final, and classroom tasks take place there. If you miss a class and bring a certificate from a doctor showing that you were sick, we will compute your grade as if this day did not exist. Overall, you can obtain a hundred points in the course. If you, say, miss a class with a quiz (2pt) and a classroom task (~ 2pt), we’d take the score you obtain in the rest of the course and re-normalize by multiplying it with 100/(100-2-2).

If you miss a class because you present a paper at a research conference, we may treat this like a case of sickness.

However, job interviews and internships are not acceptable reasons for missing classes. You have to schedule these around the course.

The exception is the final exam. If you miss the final because of sickness (the only acceptable reason), you will have to repeat it at another date, possibly orally.

Academic integrity and group work

Quizzes, the final, homework, and, unless stated otherwise, project work are to be done individually. Collaboration on these will be considered cheating.

The in-classroom tasks are collaborative group work and you are encouraged to work in a team of several people and to submit your result as a team. Submissions either on paper in class or by email before the end of the lecture are admissible. Late email submission (after 4pm on Tuesdays) will receive no credit.

We will make a clear distinction between quizzes and classroom tasks (group work). Quizzes are given to you as a sheet of paper on which the problem set is printed, stating clearly that it is a quiz.

All cases of cheating will be taken very seriously, and may lead to your eviction from the university without graduation. We use plagiarism detection software that is not easy to trick.

In case of doubt in academic integrity matters, ask the instructor.

Acknowledgements

   We thank Microsoft for a Microsoft Azure teaching grant.