How Random Sampling in Hive Works, And How to Use It

Image Courtesy: https://faculty.elgin.edu/dkernler/statistics/ch01/images/srs.gif

Random sampling is a technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.

In the big data world, we have an enormous total population: a population that can prove tricky to truly sample randomly. Thankfully, Hive has a few tools for realizing the dream of random sampling in the data lake.Read More »

Advertisement

How to Build Optimal Hive Tables Using ORC, Partitions and Metastore Statistics

hive-logo

Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.

We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think.Read More »

How to Write ORC Files and Hive Partitions in Spark

sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it.Read More »

How to Build Data History in Hadoop with Hive: Part 1

hadoop_elephant_trex

The Wind Up

One of the key benefits of Hadoop is its capacity for storing large quantities of data. With HDFS (the Hadoop Distributed File System), Hadoop clusters are capable of reliably storing petabytes of your data.

A popular usage of that immense storage capability is storing and building history for your datasets. You can not only utilize it to store years of data you might currently be deleting, but you can also build on that history! And, you can structure the data within a Hadoop-native tool like Hive and give analysts SQL-querying ability to that mountain of data! And it’s pretty cheap!

…And the Pitch!

In this tutorial, we’ll walk through why this is beneficial, and how we can implement it on a technical level in Hadoop. Something for the business guy, something for the developer tasked with making the dream come true.

The point of Hadoopsters is to teach concepts related to big data, Hadoop, and analytics. To some, this article will be too simple — low hanging fruit for the accomplished dev. This article is not necessarily for you, captain know-it-all — it’s for someone looking for a reasonably worded, thoughtfully explained how-to on building data history in native Hadoop. We hope to accomplish that here.

Let’s get going.Read More »

Finding Physical Records in Hive with Virtual Columns

hive-logo

How many times have you been querying your data in Hive, only to come across some gnarly looking (or mostly null) records? Maybe it’s only one or two entries, perhaps it’s thousands – either way, you have some (seemingly) busted data, and you’re not happy about it.

Let’s talk about some ways to pinpoint problems in data in Hive, leveraging the tools available on our stack.Read More »