Machine Learning & Data Science at Scale

Datasets are growing, new systems for managing, distributing, and streaming data are being developed, and new architectures for AI applications are emerging. This course will focus on techniques for managing and analyzing large datasets, and on new and emerging architectures for applications in machine learning and data science. Topics include machine learning algorithms and how they must be reformulated to run at scale on petabytes of data, as well as data management and cleaning techniques at scale. In addition to large-scale aspects of data science and machine learning, this course will also cover core concepts of parallel and distributed computing and cloud computing, including hands-on experience with frameworks like Spark, streaming architectures like Flink or Spark Streaming, MLlib, TensorFlow, and more. The course will include programming assignments and a substantial final project requiring students to get hands-on experience with large-scale machine learning pipelines or emerging computing architectures.

Syllabus Canvas


  • Lecture link: If you are registered for the course or on the waitlist, you should have received an email from Prof. Miller with a link for joining the lecture. The link is also posted as an announcement on the course‚Äôs Canvas page.
  • Communication: For questions/discussions, we will be using Slack. All registered students will receive an invite via email. If you have not received an invite, please contact the instructor.

Course Information

  • This semester (Spring 2021), 17-400/17-700 will meet online on Mondays and Wednesdays 10:40am-12:00pm. There will be recitations on Fridays 10:40am-12:00pm when announced (not every Friday).

Course Staff

Heather Miller
Nikhita Nadgauda
Sreedhar Radhakrishnan


An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor. The course will include several substantial programming assignments, so an additional prerequisite is 15-211, or 15-214, or 17-214, 17-514, or comparable familiarity with Python/Java and good programming skills.

Course Policies

For all of the course policies, including grading policies, please head to the syllabus.