Debugging "java.net.SocketException: Connection reset"

nihaubri · April 20, 2021, 10:30pm

Dear experts,

I’m trying to understand why some of my spark jobs are failing at the treeReduce stage when using PyRDF and the k8s cluster.

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 5.0 failed 4 times, most recent failure: Lost task 3.3 in stage 5.0 (TID 127, 10.100.234.199, executor 10): java.net.SocketException: Connection reset...

I’ve uploaded my notebook here, but it analyzes files on the cms eos drive that probably aren’t accessible to everyone.
https://nihaubri.web.cern.ch/error_message/VJets_MC_Comparison_2017_2018.html
The script makes plots over two large (~250 GB) datasets and I only seem to get this error from the larger ones’s spark jobs. I’ve uploaded the spark driver log here https://nihaubri.web.cern.ch/error_message/spark_connection_reset.txt
I’m suspicious of the line Job 1 failed: treeReduce at /cvmfs/sft.cern.ch/lcg/views/LCG_99/x86_64-centos7-gcc8-opt/lib/python3.8/site-packages/PyRDF/backend/Spark.py:113, took 799.507747 s. Maybe something’s timing out at 800s?

The individual logs are here but I’m unsure if they’re viewable by others: http://swan006.cern.ch:5190/stages/stage/?id=3&attempt=0 . Please let me know if there’s a better way to share them.

I don’t have this problem if I run on a subset of the samples and produce fewer histograms. Is this error consistent with jobs running out of a resource? Any suggestions of things to try or more logs to look at would be appreciated.

Best,
Nick

vbrian · April 21, 2021, 7:26am

Dear Nick,

I am not an expert on this but only a big user&fan of PyRDF. I already encounter this issue, also with some “python worker exited” or “connection reset” as you cite, while running pretty huge jobs…I may be wrong here, but to my mind it seems that it is due to a memory/resource issue in the workers, or something related to k8s infrastructure. Anyway recently I tried to switch to the software stack “bleeding edge” instead of LCGs, and I had magically no issues anymore.
I don’t know if you use bleeding edge or not, but it can worth a try at least as a possible workaround.

Also I guess that PyRDF is going under huge changes and will be integrated in the next (or next next) root version with the following use, as described in ROOT RDataFrame page (tab Distributed execution in python)

import ROOT

# Point RDataFrame calls to the Spark specific RDataFrame
RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
 
# It still accepts the same constructor arguments as traditional RDataFrame
df = RDataFrame("mytree", "myfile.root")
 
# Continue the application with the traditional RDataFrame API
sum = df.Filter("x > 10").Sum("y")
h = df.Histo1D("x")
 
print(sum.GetValue())
h.Draw()

This in order to have a more modern version of PyRDF within root, maintained and better integrated :), and soon with more features than PyRDF (handling of friend trees, etc.).

I hope this helps, at least for a workaround, or maybe @vpadulan has a better knowledge of what happens to your jobs there.

Have a nice day ! and cheers,
Brian

vpadulan · April 21, 2021, 7:57am

Dear @nihaubri ,
Indeed it seems to me your problem might have to do with an excessive usage of memory on the workers, as suggested by @vbrian . Unfortunately the java stacktrace is not too helpful, since Java.net.SocketException: Connection reset could happen for many reasons, although definitely something related to the health of the executors. Having the Python/C++ stacktrace might help a little bit. Maybe try also with other connection configurations and let us know.

Meanwhile, Brian anticipated me in the announcement of the new integration of PyRDF in ROOT, it has indeed just landed with the latest 6.24 release! I will make a proper announcement on our channels shortly, I will let you know ! Expect to see it available in the next LCG release on SWAN too

Cheers,
Vincenzo

nihaubri · April 21, 2021, 8:47pm

Switching to the “bleeding edge” stack made everything work for me too (I was on 99 before), so it must have been a similar resource issue to what Brian saw. I wish the Spark error messages were a bit more helpful, but maybe I’m just spoiled by condor’s direct “Memory Limit exceeded” errors.

For future reference, how can I view the python/C++ stack trace? I didn’t see it in the Spark History Server page.

PyRDF and SWAN continue to be very useful in my work and it’s great to see constant improvement (and the helpful community!).

Thank you for all the help!
Cheers,
Nick

etejedor · April 22, 2021, 7:21am

@nihaubri I don’t think you’re spoiled Debugging of distributed workflows is not as easy as when everything is local, but there should be mechanisms to hint on what’s going on.

@krraghav is it possible in the current version of the Spark monitor to show the output/error of failed tasks?