Data Science and Machine Learning at Scale

Datasets are growing, new systems for managing, distributing, and streaming data are being developed, and new architectures for AI applications are emerging. This course will focus on techniques for managing and analyzing large datasets, and on new and emerging architectures for applications in machine learning and data science. Topics include machine learning algorithms and how they must be reformulated to run at scale on petabytes of data, as well as data management and cleaning techniques at scale. In addition to large-scale aspects of data science and machine learning, this course will also cover core concepts of parallel and distributed computing and cloud computing, including hands-on experience with frameworks like Spark, streaming architectures like Flink or Spark Streaming, MLlib, TensorFlow, and more. The course will include programming assignments and a substantial final project requiring students to get hands-on experience with large-scale machine learning pipelines or emerging computing architectures.

Syllabus Piazza


  • Zoom link: If you are registered for the course or on the waitlist, you should have received an email from Prof. Miller with a link for joining the lecture. The link is also posted as an announcement on the course‚Äôs Canvas page.

Course Information

  • This semester (Fall 2020), 17-400/17-700 will meet online on Tuesdays and Thursdays 8:00-9:20am.

Course Staff

Haithem Turki
OH: Wed 2:00 - 3:00 PM


An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor. The course will include several substantial programming assignments, so an additional prerequisite is 15-211, or 15-214, or 17-214, 17-514, or comparable familiarity with Python/Java and good programming skills.

Course Policies

For all of the course policies, including grading policies, please head to the syllabus.