CERN Accelerating science

Support for spark-root (like "LHCb open data" example)

Greetings,

I am setting up a workflow to use spark to analyse large ROOT TTrees.
What I’d like to achieve is exactly what is presented in one of the swan very useful example/tutorial:

https://swan002.cern.ch/user/franzoni/gallery/apache_spark
Swan > Examples Gallery (top right corner) > Apache Spark > An example using LHCb open data

As I execute that example, I cannot resolve this error:

: java.lang.ClassNotFoundException: Failed to find data source: org.dianahep.sparkroot. Please find packages at http://spark.apache.org/third-party-projects.html

which arises the first time a conversion ROOT -> df is invoked:

spark.read.format(“org.dianahep.sparkroot”).load(data_directory + “PhaseSpaceSimulation.root”)

I’ve also tried to move to a later version

spark = SparkSession.builder
.appName(“LHCb Open Data with Spark”)
.config(“spark.jars.packages”,‘org.diana-hep:histogrammar-sparksql_2.11:1.0.15’,)
.getOrCreate()

(1.0.15 in place of the original 1.0.4)

I am new to spark, and wonder if anyone who used https://diana-hep.org/pages/project_spark_root.html.html or set up the tutorial
is aware of possible issues with spark-root and workarounds ?

Thanks in advance,
Giovanni

I believe @pkothuri will be able to help you.

Hi Giovanni,

I have just run the gallery notebook “Processing LHCb Opendata with Spark and ROOT” and it works OK for me.
There is an important note on that, the notebook has been developed originally to run with Spark in the simplest configuration, that is in local mode (not attached to a cluster), as it reads data from EOS fuse mounted on the driver.
I see also another possible source of confusion, in your message, the spark-root package version 0.1.15 can be accessed from maven at: “org.diana-hep:spark-root_2.11:0.1.15”
Spark-root is OK for this and other use cases to read root files, but as you know it is has limitations to read more recent ROOT files and it is not further developed.
For your information, a new library for the use cases of CMS analysis is currently in active development https://github.com/spark-root/laurelin

Cheers,
Luca

Thanks a lot Luca!

I have just run the gallery notebook “Processing LHCb Opendata with Spark and ROOT” and it works
OK for me.

Can you help me reproduce what you do ?
I report below my steps. There probably is something incomplete in my setup,
I can consistently reproduce the error.

that is in local mode (not attached to a cluster), as it reads data from EOS fuse mounted on
the driver.

The notebook uses the “spark” w/o defining it explicitly, nor importing it in included modules.
I had guessed that connecting to a cluster was expected, as upon establishing a connection spark is instantiated for the notebook. Perhaps I am not getting in full what you want to tell ?

spark-root/laurelin

Thanks. I take note, I’ll move to it once I’ll have sorted out my issues with the vanilla example.

Thanks a lot, and regards,
Giovanni

PS: if deem adequate we move this thread away from the swan community (since it’s only partly related) , I’ll be happy to, of course.

This is my sequence:
1 . Apache Spark > An example using LHCb open data
2 . I clone the notebook making a copy for myself ( https://swan002.cern.ch/user/franzoni/notebooks/SWAN_projects/LHCb_OpenData_Spark1/LHCb_OpenData_Spark_GF_20191021.ipynb )
3 . I click on the “star” to create a connection to the analytic CERN cluster (adding no custom options)
4 . I run the cells one by one, till the one starting with
# Let us now load the simulated data
which fails with the message I reported in my opening message

Py4JJavaError: An error occurred while calling o69.load.

: java.lang.ClassNotFoundException: Failed to find data source: spark.jars.packages. Please find packages at http://spark.apache.org/third-party-projects.html