This activity will combine the skills and techniques you learned so far in this chapter. Also, this activity will introduce brand new concepts not covered previously.
As an intern at XYZ BigData Analytics Firm you are progressing in your Spark skills and your first project was a big success. Now you are tasked with getting a dataset and a ML Pipeline ready for machine learning algorithms. Your assignment will have four parts:
Cleaning up the dataset
Splitting the data into training and testing sets
Making a ML Pipeline that one-hot encodes all the DataFrame features
Saving the final DataFrame to HDFS with partitions
So far, this Spark book has focused on Spark DataFrames: how to create, use, and transform DataFrames. You will use Spark DataFrames for the remainder of this book. But we wanted to introduce Datasets because in reality Spark DataFrames are a type of Spark Dataset. A Spark Dataset is a “strongly typed collection of domain-specific objects”. What this means is that a Dataset has a set or defined type (or schema) that is prescribed before the Dataset is created. Technically, a DataFrame is an untyped Dataset [Row] (which means it doesn’t have a schema at compile time).
In this section, we are going to cover the Spark Pearson’s Chi-squared ( χ2) statistic . We will also introduce Spark’s ML Pipelines and a new transformer: the StringIndexer.
The purpose of data, in general, is to use the data to help make effective decisions. But how do we know with any certainty that our analysis will lead to better decisions? We rarely, if ever, have all the desired data about a business problem, academic problem, or any problem for that matter. Therefore, it is impossible to know with absolute certainty whether our analysis is correct. This is where the science of statistics tries to provide insight into things that are unknowable. Statistics deals with samples of data that represent the entire population of data and tries to make reasonable assertions with sample data.
At the end of the section, 2.3 DataFrame Cleaning, we stated that the objective of any data set is to be used to help us make decisions. Furthering that theme is the realm of statistics. At its base form, statistics is a science that uses mathematical analysis to draw conclusions about data. Some examples from statistics include sample mean, sample variance, sample quantiles, and test statistics to name a few. In this section we will cover the following built-in Spark statistical functions using DataFrames: Summarizer, Correlation, and Hypothesis Testing. However, this section does not intend to teach statistics or even be an introduction of statistics. Instead, this section will focus on using these built-in Spark statistical operations and introduce the concept of ML Pipelines that are used in creating machine learning pipelines.
Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data.
In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data.
In the previous section, 2.1 DataFrame Data Analysis, we used US census data and processed the columns to create a DataFrame called census_df. After processing and organizing the data we would like to save the data as files for use later. In Spark the best and most often used location to save data is HDFS. As we saw in 1.3: Creating DataFrames from Files, we can read files from HDFS to create a DataFrame. Likewise, we can also write a DataFrame to HDFS as files in different file formats. This section will cover writing DataFrames to HDFS as Parquet, ORC, JSON, CSV, and Avro files.
In the last chapter, the reader was introduced to Spark DataFrames. The reader was shown Spark commands for creating, processing, and working with DataFrames. In this chapter, we will accelerate into more complex and challenging topics like data cleaning, statistics, and ML Pipelines. In the Chapter 1, DataFrames with Spark, we talked about the “what” of Spark DataFrames, which would be the commands to do different things. Here, we will introduce the “why” and the “how” of Spark DataFrames, which would be starting to ask, “why are we doing this” and thinking longer term on how the steps we do at the beginning of a data problem impact us at the end.
This activity will combine the skills and techniques you learned so far in this chapter.
Imagine you are working as an intern at XYZ BigData Analytics Firm, where you are tasked with performing data analysis on a data set to solidify your Spark DataFrame skills.
In this activity you will use the data set “Kew Museum of Economic Botany: Object Dispersals to Schools, 1877-1982” which “is a record of object dispersals from the Kew Museum of Economic Botany to schools between 1877 and 1982” that “is based on entries in the ‘Specimens Distributed’ books (or exit books) in the Museum archive.” The source of the data set can be found at https://figshare.com/articles/Schools_dispersals_data_with_BHL_links/9573956 This activity is a good practice for a real-world scenario because it simulates being given a random data set that you must process and analyze when you have no prior connection to the data set.
Aggregations are combining data into one or many groups and performing statistical operations like average, maximum, and minimum on the groups. In Spark it is easy to group by a categorical column and perform the needed operation. DataFrames allow multiple groupings and multiple aggregations at once.
Since DataFrames are comprised of named columns, in Spark there are many options for performing operations on individual or multiple columns. This section will introduce converting columns to a different data type, adding calculate columns, renaming columns, and dropping columns from a DataFrame.