Data Science and Engineering with Apache Spark
This site contains my work for the Data Science and Engineering with Apache Spark XSeries Program created by UC Berkeley and Databricks.
The program includes three courses:
1. CS105x: Introduction to Apache Spark
2. CS120x: Distributed Machine Learning with Apache Spark
3. CS110x: Big Data Analysis with Apache Spark
My Review about the XSeries Program
These three courses are probably the best Apache Spark online training courses you can get. The core value of these three course comes from the remarkable labs developed by the course team who are top Spark experts from UC Berkeley and Databricks. For this series, learning is 10% lecture + 90% working on the labs. It is very hands-on and practical. That is how you learn a new programming tool - learning by doing; you won't learn a new programming tool by spend most of your time watching lecture videos and reading books without writing real code.
In total, the three courses have 10 labs. Most of these labs are big and not easy, and they can make into small projects if one really dives into it. Throughout the series, all labs are completed using notebooks on Databricks Community Edition, which is free. I must say that Databricks' notebook is an awesome tool, data scientist will love it.
Lab Notebooks
All labs are completed using DataFrames.
CS105x: Introduction to Apache Spark
- Lab 0: Running Your First Notebook on Databricks
- Lab 1a: Spark Tutorial
- Lab 1b: Word Count
- Lab 2: Web Server Log Analysis
CS120x: Distributed Machine Learning with Apache Spark
- Lab 1a: Math and Python Review
- Lab 1b: Word Count Using RDD
- Lab 2: Linear Regression-Predicting Release Year of a Song
- Lab 3: Click Through Rate Prediction
- Lab 4: Principal Component Analysis
CS110x: Big Data Analysis with Apache Spark
- Lab 1: Spark ML Machine Learning Pipepine Application - Power Plant
- Lab 2: Alternating Least Square - Predicting Movie Ratings
- Lab 3: Text Analysis and Entity Resolution
RDD Notebooks
The following are the same labs finished using Resilient Distributed Dataset (RDD). The purpose is to make myself familiar with RDD. Although DataFrame is recommended for most situations because it is more efficient, easier to use, and more expressive, knowing RDD is still helpful since DataFrames are built on top of RDDs.
- Lab 1: Spark Tutorial RDD
- Lab 2: Word Count RDD
- Lab 3: Web Server Log Analyis RDD
- Lab 4: Math Review RDD
- Lab 5: Linear Regression-Predicting Release Year of a Song RDD
- Lab 6: Click-Through Rate Prediction RDD
- Lab 7: Principal Component Analysis RDD
- Lab 8: Alternating Least Square - Predicting Movie Ratings RDD
- Lab 9: Text Analysis and Entity Resolution RDD