Distributed RDataFrames with Spark

sagiorge · September 12, 2023, 2:37pm

Dear All,

I wish to read ROOT ntuples from EOS as distributed RDataFrames using the CERN Spark cluster on SWAN. I found an example from the last ROOT workshop: https://github.com/etejedor/distrdf-dimuon-analysis/blob/main/Dimuon_Python_Combined.ipynb. However, when running it on SWAN, the h.Draw() job fails. Attached is the error I get.

I have been working for months with the SWAN + Spark cluster framework with parquet files, filtering data and plotting histograms.

Now that I have ROOT files, I’d like to set analogous notebooks for the analysis. Any expert who could help me set the the environment correctly? I have not found detailed documentation; it would be great if you could point me to more examples and explanations of the integration between RDataFrame and Spark within SWAN.

Thank you for your time.

Best Regards,
Sabrina

nakolkar · September 14, 2023, 12:37pm

Dear @sagiorge,

I was also encountering this error and had discussed with the community. I suggest you to go through the discussion here. A probable reason for the problem is also in the discussion.

An expert tried using software stack ‘101’ when you login to SWAN instead of ‘103’ which is recommended. It worked for me. I hope it helps!

Regards,
Nilima