View on GitHub

Data Science and Engineering with Apache Spark

Download this project as a .zip file Download this project as a tar.gz file

Data Science and Engineering with Apache Spark

This site contains my work for the Data Science and Engineering with Apache Spark XSeries Program created by UC Berkeley and Databricks.

The program includes three courses:
1. CS105x: Introduction to Apache Spark
2. CS120x: Distributed Machine Learning with Apache Spark
3. CS110x: Big Data Analysis with Apache Spark

My Review about the XSeries Program

These three courses are probably the best Apache Spark online training courses you can get. The core value of these three course comes from the remarkable labs developed by the course team who are top Spark experts from UC Berkeley and Databricks. For this series, learning is 10% lecture + 90% working on the labs. It is very hands-on and practical. That is how you learn a new programming tool - learning by doing; you won't learn a new programming tool by spend most of your time watching lecture videos and reading books without writing real code.

In total, the three courses have 10 labs. Most of these labs are big and not easy, and they can make into small projects if one really dives into it. Throughout the series, all labs are completed using notebooks on Databricks Community Edition, which is free. I must say that Databricks' notebook is an awesome tool, data scientist will love it.

Lab Notebooks

All labs are completed using DataFrames.

CS105x: Introduction to Apache Spark

CS120x: Distributed Machine Learning with Apache Spark

CS110x: Big Data Analysis with Apache Spark

RDD Notebooks

The following are the same labs finished using Resilient Distributed Dataset (RDD). The purpose is to make myself familiar with RDD. Although DataFrame is recommended for most situations because it is more efficient, easier to use, and more expressive, knowing RDD is still helpful since DataFrames are built on top of RDDs.