This activity will combine the skills and techniques you learned so far in this chapter. Also, this activity will introduce brand new concepts not covered previously.
As an intern at XYZ BigData Analytics Firm you are progressing in your Spark skills and your first project was a big success. Now you are tasked with getting a dataset and a ML Pipeline ready for machine learning algorithms. Your assignment will have four parts:
- Cleaning up the dataset
- Splitting the data into training and testing sets
- Making a ML Pipeline that one-hot encodes all the DataFrame features
- Saving the final DataFrame to HDFS with partitions