Hadoopsters is Moving to Medium.com

For nearly 7 years, I’ve been writing Big Data and Data Engineering tutorials for Hadoop, Spark, and beyond – and I’ve been doing it on a website I created with my friends and co-workers – this website – Hadoopsters.com

A peek at our analytics from the year level.

In 2021, our biggest year yet, we saw over 47,000 unique visitors, and served nearly 62,000 views. Since the site launched 2,430 days ago, we’ve helped over 189,000 visitors, and served just over a quarter-million views (and we never ran ads). 

We created Hadoopsters out of the love of sharing knowledge, and a passion for filling documentation gaps in what we knew would be an explosive new career path for many: Data Engineering! In the early days of Hadoop, my peers and I recognized the significant gap in documentation for solving some of the most frustrating technical problems – and we wanted to help solve it.

Hadoopsters as a website has been a WordPress blog this entire time… until now! We have officially moved to Medium.com as a Medium Publication. 

Introducing, the new Hadoopsters.com!

So, why did we do this?

Firstly, economics. Hosting a website all on our own (via WordPress) is just not worth the expense – especially since we choose not to run advertising. Moving to Medium allows us to have effectively zero hosting expense, and we only pay a small amount annually to keep our domain mapped to Medium.

Secondly, distribution. The Medium platform presents an excellent opportunity to circulate our work in the go-to place for tech writing. It has built-in distribution through its article recommendation engine, and it has quickly become the go-to place for tech writing. This allows our work to reach a much broader audience, organically, and without the need for intentional marketing (again, more money) to drive people to our specific website.

I am extremely excited about reaching more Data Engineers and Data Scientists through knowledge sharing, and making the data world an even more well-documented place.

If you wish to follow my writing, all of it will continue to be through Hadoopsters – just now on Medium. 🙂 

Thank you to everyone who has read and leveraged our work through this website. We hope to make an impact on your work and career for years to come. Be well!


Hadoopsters.com now fully routes to our Medium publication. Our legacy website (this one), which has now reverted to hadoopsters.wordpress.com, will live on as long as WordPress will continue to serve it. We’ve migrated over (cloned) select historical articles from this site to Medium, but every article ever written here will remain here. All new content will be produced exclusively on Medium.


Spark Starter Guide 4.10: How to Filter on Aggregate Columns

Previous post: Spark Starter Guide 4.9: How to Rank Data

Having is similar to filtering (filter(), where() or where, in a SQL clause), but the use cases differ slightly. While filtering allows you to apply conditions on your data to limit the result set, Having allows you to apply conditions on aggregate functions on your data to limit your result set.

Both limit your result set – but the difference in how they are applied is the key. In short: where filters are for row-level filtering. Having filters are for aggregate-level filtering. As a result, using a Having statement can also simplify (or outright negate) the need to use some sub-queries.

Let’s look at an example.

Read More »

Spark Starter Guide 4.9: How to Rank Data

Previous post: Spark Starter Guide 4.8: How to Order and Sort Data

Ranking is, fundamentally, ordering based on a condition. So, in essence, it’s like a combination of a where clause and order by clause—the exception being that data is not removed through ranking , it is, well, ranked, instead. While ordering allows you to sort data based on a column, ranking allows you to allocate a number (e.g. row number or rank) to each row (based on a column or condition) so that you can utilize it in logical decision making, like selecting a top result, or applying further transformations.

Read More »

Spark Starter Guide 4.8: How to Order and Sort Data

Previous post: Spark Starter Guide 4.7: How to Standardize Data

Ordering is useful for when you want to convey… well, order. To be more specific, Ordering (also known as sorting) is most often used in the final analysis or output of your data pipeline, as a way to display data in an organized fashion based on criteria. This results in data that is sorted, and ideally easier to understand.

Read More »

Spark Starter Guide 2.7: Chapter 2 Activity

This activity will combine the skills and techniques you learned so far in this chapter. Also, this activity will introduce brand new concepts not covered previously.

As an intern at XYZ BigData Analytics Firm you are progressing in your Spark skills and your first project was a big success. Now you are tasked with getting a dataset and a ML Pipeline ready for machine learning algorithms. Your assignment will have four parts:

  1. Cleaning up the dataset 
  2. Splitting the data into training and testing sets
  3. Making a ML Pipeline that one-hot encodes all the DataFrame features
  4. Saving the final DataFrame to HDFS with partitions
Read More »

Spark Starter Guide 2.6: Datasets


So far, this Spark book has focused on Spark DataFrames: how to create, use, and transform DataFrames. You will use Spark DataFrames for the remainder of this book. But we wanted to introduce Datasets because in reality Spark DataFrames are a type of Spark Dataset. A Spark Dataset is a “strongly typed collection of domain-specific objects”. What this means is that a Dataset has a set or defined type (or schema) that is prescribed before the Dataset is created. Technically, a DataFrame is an untyped Dataset [Row] (which means it doesn’t have a schema at compile time).

Read More »

Spark Starter Guide 2.5: Hypothesis Testing


In this section, we are going to cover the Spark Pearson’s Chi-squared ( χ2) statistic . We will also introduce Spark’s ML Pipelines and a new transformer: the StringIndexer. 

The purpose of data, in general, is to use the data to help make effective decisions. But how do we know with any certainty that our analysis will lead to better decisions? We rarely, if ever, have all the desired data about a business problem, academic problem, or any problem for that matter. Therefore, it is impossible to know with absolute certainty whether our analysis is correct. This is where the science of statistics tries to provide insight into things that are unknowable. Statistics deals with samples of data that represent the entire population of data and tries to make reasonable assertions with sample data.

Read More »

Spark Starter Guide 2.4: DataFrame Statistics


At the end of the section, 2.3 DataFrame Cleaning, we stated that the objective of any data set is to be used to help us make decisions. Furthering that theme is the realm of statistics. At its base form, statistics is a science that uses mathematical analysis to draw conclusions about data. Some examples from statistics include sample mean, sample variance, sample quantiles, and test statistics to name a few. In this section we will cover the following built-in Spark statistical functions using DataFrames: Summarizer, Correlation, and Hypothesis Testing. However, this section does not intend to teach statistics or even be an introduction of statistics. Instead, this section will focus on using these built-in Spark statistical operations and introduce the concept of ML Pipelines that are used in creating machine learning pipelines.

Read More »

Spark Starter Guide 2.3: DataFrame Cleaning


Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data. 

In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data. 

Read More »

Spark Starter Guide 2.2: DataFrame Writing, Repartitioning, and Partitioning


In the previous section, 2.1 DataFrame Data Analysis, we used US census data and processed the columns to create a DataFrame called census_df. After processing and organizing the data we would like to save the data as files for use later. In Spark the best and most often used location to save data is HDFS. As we saw in 1.3: Creating DataFrames from Files, we can read files from HDFS to create a DataFrame. Likewise, we can also write a DataFrame to HDFS as files in different file formats. This section will cover writing DataFrames to HDFS as Parquet, ORC, JSON, CSV, and Avro files.

Read More »