Hadoopsters is Moving to Medium.com

For nearly 7 years, I’ve been writing Big Data and Data Engineering tutorials for Hadoop, Spark, and beyond – and I’ve been doing it on a website I created with my friends and co-workers – this website – Hadoopsters.com

A peek at our analytics from the year level.

In 2021, our biggest year yet, we saw over 47,000 unique visitors, and served nearly 62,000 views. Since the site launched 2,430 days ago, we’ve helped over 189,000 visitors, and served just over a quarter-million views (and we never ran ads). 

We created Hadoopsters out of the love of sharing knowledge, and a passion for filling documentation gaps in what we knew would be an explosive new career path for many: Data Engineering! In the early days of Hadoop, my peers and I recognized the significant gap in documentation for solving some of the most frustrating technical problems – and we wanted to help solve it.

Hadoopsters as a website has been a WordPress blog this entire time… until now! We have officially moved to Medium.com as a Medium Publication. 

Introducing, the new Hadoopsters.com!

So, why did we do this?

Firstly, economics. Hosting a website all on our own (via WordPress) is just not worth the expense – especially since we choose not to run advertising. Moving to Medium allows us to have effectively zero hosting expense, and we only pay a small amount annually to keep our domain mapped to Medium.

Secondly, distribution. The Medium platform presents an excellent opportunity to circulate our work in the go-to place for tech writing. It has built-in distribution through its article recommendation engine, and it has quickly become the go-to place for tech writing. This allows our work to reach a much broader audience, organically, and without the need for intentional marketing (again, more money) to drive people to our specific website.

I am extremely excited about reaching more Data Engineers and Data Scientists through knowledge sharing, and making the data world an even more well-documented place.

If you wish to follow my writing, all of it will continue to be through Hadoopsters – just now on Medium. 🙂 

Thank you to everyone who has read and leveraged our work through this website. We hope to make an impact on your work and career for years to come. Be well!

TL;DR

Hadoopsters.com now fully routes to our Medium publication. Our legacy website (this one), which has now reverted to hadoopsters.wordpress.com, will live on as long as WordPress will continue to serve it. We’ve migrated over (cloned) select historical articles from this site to Medium, but every article ever written here will remain here. All new content will be produced exclusively on Medium.

How to Join Static Data with Streaming Data (DStream) in Spark

Today we’ll briefly showcase how to join a static dataset in Spark with a streaming “live” dataset, otherwise known as a DStream. This is helpful in a number of scenarios: like when you have a live stream of data from Kafka (or RabbitMQ, Flink, etc) that you want to join with tabular data you queried from a database (or a Hive table, or a file, etc), or anything you can normally consume into Spark.Read More »

Managing LDAP Users in Ambari

cc_0309careers-waiting-in-line_16x9
Adding users to a Hadoop cluster can be a little time-intensive.

I’ve managed Hadoop clusters for just a little while now and I’ve discovered the user management factor of Ambari is a little rough around the edges. Specifically, there’s no easy way to manage Ambari LDAP users from within Ambari despite LDAP being a very popular way to provision and manage user access.

There is the command ambari-server sync-ldap [--users user.csv | --groups groups.csv] for adding users or groups but that can be an issue if access to the ambari user or server is limited. Additionally, the command line utility doesn’t innately have any control over HDFS directories (either creating or deleting) upon a user- or group-sync, creating extra steps in the user creation process.

To address this, I present:
ambari-ldap-manager

Read More »

How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it.Read More »

How to Set or Change Log Level in Spark Streaming

logging-595x335
Logs can really add up. Let’s learn to make like a tree and reduce them via convenient built-in methods.

Apache Spark alone, by default, generates a lot of information in its logs. Spark Streaming creates a metric ton more (in fairness, there’s a lot going on). So, how do we lower that gargantuan wall of text to something more manageable?

One way is to lower the log level for the Spark Context, which is retrieved from the Streaming Context. Simply:

val conf = new SparkConf().setAppName(appName) // run on cluster
val ssc = new StreamingContext(conf, Seconds(5))
val sc = ssc.sparkContext
sc.setLogLevel("ERROR")

Pretty easy, right?

What I Learned Building My First Spark Streaming App

IMG_20170811_183243
Get it? Spark Streaming? Stream… it’s a stream…. sigh.

I’ve been working with Hadoop, Map-Reduce and other “scalable” frameworks for a little over 3 years now. One of the latest and greatest innovations in our open source space has been Apache Spark, a parallel processing framework that’s built on the paradigm Map-Reduce introduced, but packed with enhancements, improvements, optimizations and features. You probably know about Spark, so I don’t need to give you the whole pitch.

You’re likely also aware of its main components:

  • Spark Core: the parallel processing engine written in the Scala programming language
  • Spark SQL: allows you to programmatically use SQL in a Spark pipeline for data manipulation
  • Spark MLlib: machine learning algorithms ported to Spark for easy use by devs
  • Spark GraphX: a graphing library built on the Spark Core engine
  • Spark Streaming: a framework for handling data that is live-streaming at high speed

Spark Streaming is what I’ve been working on lately. Specifically, building apps in Scala that utilize Spark Streaming to stream data from Kafka topics, do some on-the-fly manipulations and joins in memory, and write newly augmented data in “near real time” to HDFS.

I’ve learned a few things along the way. Here are my tips:Read More »

How to Build Data History in Hadoop with Hive: Part 1

hadoop_elephant_trex

The Wind Up

One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.

A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!

…And the Pitch!

In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.

The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.

Let’s get going.Read More »

Move a Running Command to a Background Process

time-saving-strategies

You just kicked off a command on the command line and one of three things happens:

  1. You have to leave your computer and run off to a meeting or go talk to your boss,
  2. You realize you’ve made a terrible mistake and didn’t realize how much data or work that command has to deal with and it’s probably going to take a few hours,
  3. You were testing a command, liked that it was working, and want to let it run now

Now what?

Read More »