Access to EOS files from spark cluster

Dear experts,

I have been trying to run the PyRDF with a simple code below by accessing my files under /eos/atlas .

import ROOT
import PyRDF
# Select Spark backend, partition the dataset in 16 fragments when running distributedly
PyRDF.use("spark", {'npartitions': '16'})
from PyRDF import RDataFrame
rdf = RDataFrame( "MGPy8EG_A14N23LO_HH_151p0_150p5_150p0_MET75_NoSys", "root://eosatlas.cern.ch//eos/atlas/unpledged/group-tokyo/users/ymino/SusySkimHiggsino/DT_v2.0/MC16a/MGPy8EG_A14N23LO_C1C1_151p0_150p5_150p0_MET75_merged_processed.root")
hist = rdf.Histo1D(("h","",30,0,30), "trkD0Sig")
c = ROOT.TCanvas("c","c",800,500)
c.Draw()
hist.Draw()

However, I have been failing to access the files by xrootd receiving the error below.

Error in TNetXNGFile::Open: [ERROR] Server responded with an error: [3010] Unable to give access - user access restricted - unauthorized identity used ; Permission denied

Running the klist command on the spark cluster, I cannot find the credentials cache so I believe this is preventing me from accessing the files.

klist: No credentials cache found (filename: /tmp/krb5cc_114225_18552)

Reading other topics, I thought the kerberos tickets are correctly handled when using k8s clusters but are there other procedures I need to follow ?

I am not sure whether is still the case, but you need to acquire a valid token. For this you can open a terminal window and do a kinit.

Dear Yuya,

I believe the reason you don’t get your credentials on the other side is that you started your session in one of our SWAN physical machines (swan004,005 or 006) which require an extra kinit for this to work.

The easiest way to do what you want is to connect to https://swan-k8s.cern.ch, and then select the Cloud containers (k8s) Spark cluster when you start your session. With this, SWAN should automatically propagate your credentials to the cluster side. Please try it out and let me know how it goes.

On the other hand, I see that you are using an old version of the distributed RDataFrame library. It’s no longer imported as PyRDF, please check out these links:

https://root.cern/doc/master/classROOT_1_1RDataFrame.html

for examples. From SWAN, you would do something like:

import ROOT

# Point RDataFrame calls to the Spark specific RDataFrame
RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame

# The Spark RDataFrame constructor accepts an optional "sparkcontext" parameter
# and it will distribute the application to the connected cluster
df = RDataFrame("mytree", "myfile.root", sparkcontext = sc)  # sc is provided by SWAN as a result of connecting to a cluster from your notebook

...

Thank you for the detailed instructions !
I have followed the procedures last weekend and now the Spark clusters are able to look at the eos/user directories by xrootd.
I couldn’t find any instructions about using swan-k8s when using Spark clusters, but are there any dedicated pages on this topic ? (If would be nice to improve my understanding on the structure of the SWAN machines.)

I was looking at an old slide about the RDataFrame distribution.
Thank you for pointing to the latest page and I have modified my script to use the latest version.

Hello,

swan-k8s will become during this year the only infrastructure of SWAN, so already some things are smoother there. Note you can still use the other (puppet) machines if you do a kinit first.

Glad to hear it’s working!

Enric