PyRDF issues

Dear Enric, Dear SWAN/Spark team,

I would like to share here a small issue which was not present last week in the PyRDF module which I am using. This module needs for now the Bleeding edge software stack to have access to its 0.2.0 version, so it might be related to some “instabilities” (?). Here is the bug:

import ROOT, sys
sys.path.insert(0,'/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF')import PyRDF
sc.addPyFile("Tools/Modules/PyRDF.zip") # my own setup
import PyRDF
PyRDF.use('spark')
r = PyRDF.RDataFrame(10)
r = r.Define("weight", "2.")
#r.AsNumpy()
print(r.Count().GetValue())
print(r.Sum("weight").GetValue())

This works using the “local” backend ; I put here 3 RDataFrames methods which worked last week and not anymore. This can be linked to the usage of an old PyRDF version, but I use the one from github and followed the indications in https://github.com/JavierCVilla/PyRDF/tree/master/demos (also setuping the spark connector)

I may have forgotten something, but since it worked last Friday, I have to admit that I do not understand what is going on on my side… First how can I check that I really use bleeding edge software stack and that it does not redirect me to LCG 96 ?

Best regards,
Brian

Hi Brian,

Bleeding edge can be unstable because packages are updated constantly, but PyRDF should not have changed.

What is exactly the error you see? What Spark cluster are you using? @pmrowczy @pkothuri

You can check what LCG release you are using if you echo the $LCG_VIEW env variable.

Dear Enric,

Oh I understand, so that should be a wrong direction. (and this variable is indeed valued to devpython3 stuff)
Also I note that even drawing the simplest histogram is not working so indeed that is a spark cluster related issue I guess. This is the error I get:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 15, 10.100.2.97, executor 1): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

I use k8s with the spark.files option (from swan006 for info …)

Best regards, and thanks or your reactivity :slight_smile:
Brian

Dear @vbrian

Did you select LongRunningAnalysis configuration options (spark.python.worker.reuse to true), we have seen in that past executors (python process) crashing due to OOM

regards,
Prasanth

Dear Prasanth,

I did not select any other option than the spark.files config for PyRDF. (I however just tried with and it still does not change anything…)

btw: if you wish a test example, even the following spark tutorial does not work for me “Processing ROOT (NanoAOD) files with Distributed ROOT”

Best regards,
Brian

Dear @vbrian

The example notebook works for me with LCG_96python3 and k8s cluster, are you using dev3python3 ?

regards,
Prasanth

Dear Prasanth,

Very interesting… I indeed use that config with k8s cluster

echo $LCG_VERSION
dev3python3

This reduces drastically the inquest, but I have no idea why this example notebook does not work for me…

PS: Piotr and Prasanth are helping me viewing the logs…
Best regards,
Bria

Dear @vbrian

dev3python3 can be unstable, Please can you try to run your notebook with LCG_96python3

regards,
Prasanth

Dear Prasanth,

Have you seen anything in the logs?
Anyhow, it works for me with LCG96 however this cannot be a solution for me… Indeed I have to use bleeding edge because Javier & Vincenzo did some changes in ROOT serialisation itself (or something like that); https://github.com/JavierCVilla/PyRDF/issues/76#issuecomment-515997610 to make the method AsNumpy() possible for PyRDF…

Then, may I ask you whether a LCG 97 is of currentness (with all those changes regarding PyRDF) ? I am not hoping anything, and I am sure I will have to keep working and paining with those bleeding edge instabilities ; the fact of their randomness & duration are a quite stressful factor. At least I hope it will be solved tomorrow.

Thanks a lot then :slight_smile: and best regards,
Brian

Dear @vbrian,

Do you know what method and code snippet works in LCG96 and breaks in LCG_dev3 ?

I see that you get EOF Exception at the shuffle stage. I think you had it some time in the past and it was a bug in PyRDF.

Dear Piotr,

Nice to hear that, I guess. Actually any kind of RDataFrame PyRDF method seem to crash. You can find one in the example notebook for spark use NanoAODDimuonAnalysis-PyRDF-Spark.ipynb, but here is a smallest one:

import PyRDF
sc.addPyFile(“Tools/Modules/PyRDF.zip”) # TO BE CHANGED
PyRDF.use(‘spark’)
r = PyRDF.RDataFrame(10)
r = r.Define(“weight”, “2.”)

# EITHER
a = ROOT.TCanvas(“”,“”,600,600)
a.cd()
h = r.Histo1D(“weight”)
h.Draw()
a.Draw()

# OR
print(r.Count().GetValue())
#OR
print(r.Sum(“weight”).GetValue())

Indeed I could have similar issues previously, this is why I did the very first test not using Count or Sum which were developped afterwards. Histo1D was clearly supported even since an old version of PyRDF, so that simplest snippet of code for drawing histogram should work, even if Count() or Sum() not

Best regards,
Brian

Hi Brian,

Does it also crash if you run some Spark code (not PyRDF)?

@etejedor we are running PyRDF daily (also today) in our testing pipelines - https://gitlab.cern.ch/db/swan-spark-notebooks/-/jobs/6395376 . So LCG-dev3 and PyRDF works there. It is related to some specific function he does. This is the notebook we test https://gitlab.cern.ch/db/swan-spark-notebooks/blob/master/k8s-example/NanoAODDimuonAnalysis-PyRDF-Spark.ipynb

This is error we see on swan006 for vbrian in the logs

Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/backend/Spark.py", line 104, in spark_mapper
Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/backend/Dist.py", line 460, in mapper
Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/CallableGenerator.py", line 133, in mapper
Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/CallableGenerator.py", line 133, in mapper
Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/CallableGenerator.py", line 109, in mapper
Dec  2 15:17:32 swan006 ea129f8aa72a:  File "/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib/ROOT.py", line 418, in _RDataFrameAsNumpy
Dec  2 15:17:32 swan006 ea129f8aa72a:    column_type = df_rnode.GetColumnType(column)
Dec  2 15:17:32 swan006 ea129f8aa72a: Exception: string ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>::GetColumnType(basic_string_view<char,char_traits<char> > column) =>
Dec  2 15:17:32 swan006 ea129f8aa72a:    Column "weight_mc" is not in a dataset and is not a custom column been defined. (C++ exception of type runtime_error)
Dec  2 15:17:32 swan006 ea129f8aa72a: ) [duplicate 12]
Dec  2 15:17:32 swan006 ea129f8aa72a: 19/12/02 15:17:32 ERROR TaskSetManager: Task 2 in stage 12.0 failed 4 times; aborting job

Dear Piotr,

Okay so actually the error you spot was using LCG96 and I mispelled the column for AsNumpy (true :)), but this is an unrelated issue because this method should not even work for LCG96 https://github.com/JavierCVilla/PyRDF/issues/76#issuecomment-515997610

This is why I have to be in bleeding edge, for which nothing works. I have corrected this line but you should see also an error (right now). This erros has been corrected in the previous github link I sent

Best,
Brian

@vbrian you can see driver logs in SWAN: - Spark Connector “star button” -> “Show driver logs”

Currently:

PicklingError: Can't pickle <class 'ROOT.ndarray'>: attribute lookup ROOT.ndarray failed
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunn.....
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
...
Driver stacktrace:
19/12/02 15:30:57 INFO DAGScheduler: Job 3 failed: treeReduce at /eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/backend/Spark.py:113, took 9.474031 s

Hi Brian,
The error:

PicklingError: Can't pickle <class 'ROOT.ndarray'>

makes me think you are using an older version of PyRDF. Are you using the one provided by bleeding edge or a custom one that you ship to the Spark workers?

Dear Enric,

Sorry for having made the thing so messy…
You are totally right. As I tried to mention, this issue appeared because of LCG96, this has been solved by Javier and Vincenzo integrating something for serialisation in (Py)ROOT itself so that it is present and did work with Bleeding edge. If you wonder what, it is in the previous github link. This is solved for bleeding edge and concerns only AsNumpy. I had then time to check that everything went on clockwork with this bleeding edge and the most updated pyRDF. So that I use this custom PyRDF (not shipped in the software stack) I just regit cloned it to be sure today, but everything worked normally last week…

Right now, the issue I am getting today is that I am using Bleeding edge and none of the examples shown before work:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 14, 10.100.47.154, executor 4): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

This is not due to PyRDF it seems (?) maybe bleeding edge? ; I will try to launch simple spark code as you suggested.

NB: do you have any simplest spark code to run ? :blush:

@vbrian yes now I see the correct error .

19/12/02 15:47:10 INFO DAGScheduler: ShuffleMapStage 0 (treeReduce at /eos/home-v/vbrian/SWAN_projects/dvcs_project/Tools/Modules/PyRDF/PyRDF/backend/Spark.py:113) failed in 15.424 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 14, 10.100.47.154, executor 4): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at 
....
Caused by: java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)

This usually means that your shuffle stage returned nothing (not even an empty array). I think you had this error in the past with PyRDF - it was some empty ROOT file or something.

@vbrian can you please run as a test this https://gitlab.cern.ch/db/swan-spark-notebooks/blob/master/k8s-example/NanoAODDimuonAnalysis-PyRDF-Spark.ipynb ? Just copy paste relevant code lines to your notebook just for a test. This SHOULD work, we run it every 12h on schedule in the same K8s cluster you do.

Thanks a lot !
Sorry I have issues opening it (I used wget and I have an unreadable file error not json creepy sutff)
Is it the same as in the example gallery?

EDIT: anyway I copied paste the relevant code…

So I just did it , same error …

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 44, 10.100.35.87, executor 1): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

@vbrian I just saw that in our pipelines the dev notebook does not indeed work! It was app status being wrongly parsed!

We even have a worse error:

 *** Break *** segmentation violation
 Generating stack trace...
 0x00007f45df11bcfc in <unknown> from /cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-centos7-gcc8-opt/lib/libCling.so
 0x00007f45df11c3f6 in <unknown> from /cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-
....
/cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-centos7-gcc8-opt/lib/libpython2.7.so.1.0
 0x00007f45fc3a1445 in __libc_start_main + 0xf5 from /lib64/libc.so.6
 0x000000000040066e in <unknown> from python 

I see it started failing exactly 2 days ago

prod-lcgdev-pyrdf-notebook-23176-1575080860873-driver     0/1     Completed   0          2d12h
prod-lcgdev-pyrdf-notebook-3602-1575167322642-driver      0/1     Error       0          36h

@etejedor can you help here somehow ?