How to Override a Spark Dependency in Client or Cluster Mode

In this post, we’ll cover a simple way to override a jar, library, or dependency in your Spark application that may already exist in the Spark classpath, which would cause you runtime issues.

Recently, I needed to use a specific library as a dependency: Google’s GSON.

The version of GSON I needed had to be a certain version or newer, otherwise there would be a runtime conflict. Why? There was a method in GSON (that I was using) that was private in earlier versions of GSON, but public in later versions.

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:

gson-2.2.4.jar: the method is private, and therefore too old for use here
gson-2.6.1: the method is public, and works fine.
Somewhere between the two, the method’s status changed.

So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

So I added the GSON version I needed, and made sure to exclude GSON from any other dependencies that might be peskily including their own GSON.

"com.google.code.gson" % "gson" % "2.6.1" % "compile"

But there was a snag.

Everything worked fine during compilation and execution on my local machine, but my app would fail at runtime in a production environment, specifically using a spark-submit command on a YARN/Hadoop cluster. Specifically, I ran into this dreaded error:

java.lang.IllegalAccessError: tried to access method com.google.gson.Gson.newJsonWriter(Ljava/io/Writer;)Lcom/google/gson/stream/JsonWriter; from class retrofit2.converter.gson.GsonRequestBodyConverter

I did what everyone does, and I started Googling.

I didn’t know it at the time, but this Illegal Access exception makes sense: somewhere in my build (either my jar or my classpath) was a version of GSON that was older than the one I was expecting/including (in my Uber/Fat jar no less!), and causing me to hit this error.

But where was it coming from?

I checked the dependency tree of my Scala project using the wonderful SBT Dependency Graph by Jrudolph, but to my disappointment GSON was only coming up in the class I expected it to — and only that one place! I thought for sure I’d find another library that was including GSON that I missed.

But then I though to check the Spark classpath on my cluster, which for me lived at:

../spark-2.3.2-bin-hadoop2.6/jars/

A quick ls and I found my jackpot:

[lrobinson@myhadoopcluster]$ ls spark-2.3.2-bin-hadoop2.6/jars/*gson*
spark-2.3.2-bin-hadoop2.6/jars/gson-2.2.4.jar

So, what are we looking at? Apache Spark includes GSON for its own use and purpose in its jar library (which is included in your classpath by default). So Spark was behind it the whole time!

I now knew the problem: Spark’s baked in GSON jar was causing dependency collision with my application at runtime, which explains why I did not see the problem when executing locally on my laptop.

So what were my options? I again turned to Google.

Option 1: spark.driver.userClassPathFirst

This is an experimental setting that did not solve my issue and I would not recommend to you either. It will attempt to use the user’s classpath ahead of Spark’s. Unfortunately, it caused a myriad of other confusing errors that I did not even know where to begin debugging, so I gave up on it. If it works for you, please give me a holler.

Option 2: spark.driver.extraClassPath & spark.executor.extraClassPath

This is exactly what I wanted, and exactly what worked. According to the docs, this would allow me to prepend dependencies/JARS to the classpath. Keyword being prepend, as in, put in front of Spark’s built-in classpath and libs.

Below is how we leverage this feature for our apps. We use the packages CLI setting to pull the GSON jar from Maven Central, the Jars setting to point to where those jars are saved via packages, and the two conf settings to prepend the jar to the driver and executor classpaths.

--packages com.google.code.gson:gson:2.6.1
--jars /home/lrobinson/.ivy2/jars/com.google.code.gson_gson-2.6.1.jar
--conf spark.driver.extraClassPath=com.google.code.gson_gson-2.6.1.jar 
--conf spark.executor.extraClassPath=com.google.code.gson_gson-2.6.1.jar

And like that, voila, the application ran and the IllegalAccessError was gone.

There are more scenarios beyond this one that you can use this solution for, so let us know how it helps you! Cheers!

5 thoughts on “How to Override a Spark Dependency in Client or Cluster Mode”

shanjames says:

May 13, 2019 at 5:50 am

Great Blog | I appreciate your work on Hadoop. It’s a great post. It’s such a wonderful read on Hadoop tutorial. Keep sharing such kind of worthy information.
https://www.kellytechno.com/Hyderabad/Course/Hadoop-Training

LikeLike

Overriding Spark Dependencies – Curated SQL says:

May 21, 2019 at 12:21 pm

[…] Landon Robinson shows how to override a Spark dependency located on the classpath: […]

LikeLike

a.abc says:

April 18, 2021 at 4:56 am

Your post saved my day.. I met similar issue and cannot figure out why it compiles and works fine on my local but will break on spark-submit.
Really appreciate!

LikeLiked by 1 person

- Landon Robinson says:
  
  April 18, 2021 at 4:03 pm
  
  Fantastic! Glad to hear it. It was seriously about 2-3 workdays of sorting it out.
  
  LikeLike
  
Sushama says:

December 29, 2021 at 7:05 am

I am facing similar kind of issue, in my case i have older version of jar added in system classpath, I want give priority to jar version which I am adding in spark-submit. I can not use –conf spark.driver.userClassPathFirst=true

LikeLike