Spark jobs never handled on k8s

vbrian · February 25, 2021, 10:46am

Hello,
I recently have some issues related to using k8s. Here is a screenshot where I am submitting spark jobs through the PyRDF module, and you see that I have 0 executors/0 cores assigned to them ; these numbers move from time to time (1executor/4cores) I guess due to kubernetes working system, but nonetheless the jobs seem to never be handled by the workers (the tasks number remain 0 even after 10minutes from the submission time).

This is something I observe for both bleeding edge software stack and LCG99 (pyton3)

Another remark is that I used to use the LCG97a (python3) with k8s, but it seems that I cannot choose k8s anymore in the configuration of my swan session. Is this expected ?

Thanks for helping and have a nice day !
Brian

psilva · February 25, 2021, 11:01am

Hi Brian
i recently ran across a similar issue. Apparently you need to enable the
“Spark3Shuffle” option before connecting to k8s in LCG99
Cheers,
Pedro

pkothuri · February 25, 2021, 11:02am

Dear Brian, Pedro is exactly right, we are trying to make this default (more intuitive), until then please choose that bundle, thank you

regards,
Prasanth

vbrian · February 25, 2021, 11:15am

Dear Pedro, Dear Prasanth,

Thanks a lot for the hint, it works as usual now
Cheers !
Brian

glerner · September 30, 2021, 8:45am

Dear all,
I reopen this thread because I am experiencing the same issue when trying to use an NXCALS extraction API:
df1 = DataQuery.builder(spark).byVariables()
.system(‘CMW’)
.startTime(‘2021-01-15 00:00:00.000’).endTime(‘2021-12-30 00:00:00.000’)
.variable(‘F16.BLMIB.105.UCAP:Acquisition:lossBeamPresence’)
.build()
Over the last few days it has worked occasionally, but at other times I have been assigned 0 executors and cores, as shown above (e.g. now it has been like this for over one hour).
I am configuring the environment with NXCals Spark 3.
Also, apologies for my ignorance, but I don’t really know how to look for and/or enable the “Spark3Shuffle” option as suggested above (and if this same solution may apply to my case).
Many thanks!
Giuseppe

vbrian · September 30, 2021, 9:08am

Dear Giuseppe,

Just for a quick reply on the spark3shuffle option, I found that this option does not appear in the bundled configurations for some swan session configurations…So it might not appear in your case I guess.
Anyway, in the spark cluster connection tab in your notebook, you can still enable it manually by typing the following 2 option names :

spark.shuffle.service.enabled value set to false
spark.dynamicAllocation.shuffleTracking.enabled value set to true

At least this is what is set by enabling Spark3suffle bundle option on LCG100 on k8s. I hope you have the possibility to set those and at least give them a try though.
This was just for a quick guess answer, but someone from the SWAN team will surely answer your question much more precisely

Brian

rcastell · September 30, 2021, 9:15am

Hi,

I believe the issue is unrelated, you are exhausting the resources for your queue in the YARN cluster manager.

In case you cannot reduce the resources you are currently using, please open a SNOW ticket here explaining your use case and requesting additional resources.

Thanks @vbrian for helping!

Cheers,
Riccardo

glerner · September 30, 2021, 9:23am

thanks both for your prompt replies! As a followup, how can I monitor the resources that I am using in the YARN cluster?
Again, apologies if my questions may be trivial, but I am a new user of all these tools.
Best
Giuseppe

rcastell · September 30, 2021, 12:18pm

You can access the YARN Web UI and see which applications are running here: http://ithdp1005.cern.ch:8088/cluster/apps/RUNNING
Then you can filter the list of running application with e.g. your username and you will get a list very similar to the one I pasted above (there, 74.4 and 23.8 are the percentages of resources in your queue you are already using).
You can also go here http://ithdp1005.cern.ch:8088/cluster/apps/ACCEPTED and see which applications are “accepted”, which means that are ready to start but haven’t been allocated resources yet.

glerner · September 30, 2021, 12:46pm

thanks again Riccardo for your precious help! Last question then: how do I delete/kill applications that I don’t want to continue using? I would like to remove them (with one exception) to be able to start new ones, but I don’t know how to do it.
I closed the swan notebooks, but this doesn’t seem to be working.
Best
Giuseppe

rcastell · September 30, 2021, 3:50pm

Closing the notebook just means that you don’t have access to the web UI anymore but the “kernel” (the python interpreter behind the notebook) keeps running. The proper way to stop it is either from the notebook ui “Kernel”-> “Shutdown” or the orange button you get when you select a running notebook in the main swan interface (i.e. Swan).

In addition to this, you can run more applications if you disable the spark dynamic allocation feature in the connection extension. You will notice that when you connect to the NXCALS cluster you have the possibility to include the “NXCALS bundle” which, among other things, enables the dynamic allocation with up to 40 executors. This means that a single application, in case it sees it can benefit from more resources, can fill the whole queue. In this case, adding spark.dynamicAllocation.enabled to false prevents the extension from taking more than 4 executors.

It’s a balance between faster response times and resource availability, and we are still trying what’s best. You could also try to keep using the dynamic allocation but reduce the maxExecutors to, say, 10, setting spark.dynamicAllocation.maxExecutors to 10

glerner · October 1, 2021, 3:55pm

Thanks Riccardo, this is again very useful! So if I understand correctly, if I have already closed the web UI there is no more way for me to interact with these applications and kill the jobs (or visualize the results)? Of course, I will make sure to avoid this in the future.
Best
Giuseppe