Spark Starter Guide 4.10: How to Filter on Aggregate Columns

Previous post: Spark Starter Guide 4.9: How to Rank Data

Having is similar to filtering (filter(), where() or where, in a SQL clause), but the use cases differ slightly. While filtering allows you to apply conditions on your data to limit the result set, Having allows you to apply conditions on aggregate functions on your data to limit your result set.

Both limit your result set – but the difference in how they are applied is the key. In short: where filters are for row-level filtering. Having filters are for aggregate-level filtering. As a result, using a Having statement can also simplify (or outright negate) the need to use some sub-queries.

Let’s look at an example.

Read More »

Spark Starter Guide 4.9: How to Rank Data

Previous post: Spark Starter Guide 4.8: How to Order and Sort Data

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.

Read More »

Spark Starter Guide 4.8: How to Order and Sort Data

Previous post: Spark Starter Guide 4.7: How to Standardize Data

Ordering is useful for when you want to convey… well, order. To be more specific, Ordering (also known as sorting) is most often used in the final analysis or output of your data pipeline, as a way to display data in an organized fashion based on criteria. This results in data that is sorted, and ideally easier to understand.

Read More »

Spark Starter Guide 2.7: Chapter 2 Activity

This activity will combine the skills and techniques you learned so far in this chapter. Also, this activity will introduce brand new concepts not covered previously.

As an intern at XYZ BigData Analytics Firm you are progressing in your Spark skills and your first project was a big success. Now you are tasked with getting a dataset and a ML Pipeline ready for machine learning algorithms. Your assignment will have four parts:

  1. Cleaning up the dataset 
  2. Splitting the data into training and testing sets
  3. Making a ML Pipeline that one-hot encodes all the DataFrame features
  4. Saving the final DataFrame to HDFS with partitions
Read More »

Spark Starter Guide 2.6: Datasets

Introduction

So far, this Spark book has focused on Spark DataFrames: how to create, use, and transform DataFrames. You will use Spark DataFrames for the remainder of this book. But we wanted to introduce Datasets because in reality Spark DataFrames are a type of Spark Dataset. A Spark Dataset is a “strongly typed collection of domain-specific objects”. What this means is that a Dataset has a set or defined type (or schema) that is prescribed before the Dataset is created. Technically, a DataFrame is an untyped Dataset [Row] (which means it doesn’t have a schema at compile time).

Read More »

Spark Starter Guide 2.5: Hypothesis Testing

Introduction

In this section, we are going to cover the Spark Pearson’s Chi-squared ( χ2) statistic . We will also introduce Spark’s ML Pipelines and a new transformer: the StringIndexer. 

The purpose of data, in general, is to use the data to help make effective decisions. But how do we know with any certainty that our analysis will lead to better decisions? We rarely, if ever, have all the desired data about a business problem, academic problem, or any problem for that matter. Therefore, it is impossible to know with absolute certainty whether our analysis is correct. This is where the science of statistics tries to provide insight into things that are unknowable. Statistics deals with samples of data that represent the entire population of data and tries to make reasonable assertions with sample data.

Read More »

Spark Starter Guide 2.4: DataFrame Statistics

Introduction

At the end of the section, 2.3 DataFrame Cleaning, we stated that the objective of any data set is to be used to help us make decisions. Furthering that theme is the realm of statistics. At its base form, statistics is a science that uses mathematical analysis to draw conclusions about data. Some examples from statistics include sample mean, sample variance, sample quantiles, and test statistics to name a few. In this section we will cover the following built-in Spark statistical functions using DataFrames: Summarizer, Correlation, and Hypothesis Testing. However, this section does not intend to teach statistics or even be an introduction of statistics. Instead, this section will focus on using these built-in Spark statistical operations and introduce the concept of ML Pipelines that are used in creating machine learning pipelines.

Read More »

Spark Starter Guide 2.3: DataFrame Cleaning

Introduction

Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data. 

In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data. 

Read More »

Spark Starter Guide 2.2: DataFrame Writing, Repartitioning, and Partitioning

Introduction

In the previous section, 2.1 DataFrame Data Analysis, we used US census data and processed the columns to create a DataFrame called census_df. After processing and organizing the data we would like to save the data as files for use later. In Spark the best and most often used location to save data is HDFS. As we saw in 1.3: Creating DataFrames from Files, we can read files from HDFS to create a DataFrame. Likewise, we can also write a DataFrame to HDFS as files in different file formats. This section will cover writing DataFrames to HDFS as Parquet, ORC, JSON, CSV, and Avro files.

Read More »

Spark Starter Guide 2.1: DataFrame Data Analysis

Introduction

In the last chapter, the reader was introduced to Spark DataFrames. The reader was shown Spark commands for creating, processing, and working with DataFrames. In this chapter, we will accelerate into more complex and challenging topics like data cleaning, statistics, and ML Pipelines. In the Chapter 1DataFrames with Spark, we talked about the “what” of Spark DataFrames, which would be the commands to do different things. Here, we will introduce the “why” and the “how” of Spark DataFrames, which would be starting to ask, “why are we doing this” and thinking longer term on how the steps we do at the beginning of a data problem impact us at the end. 

Read More »