Connect to spark cluster on startup

grigolet · February 4, 2021, 9:56am

Hello,
I would like to know if it is possible to connect to a spark cluster when I open swan without having to connect manually from the UI.
In particular I would like to be able to connect to the NXCALS cluster when starting my SWAN session. I configured my SWAN startup script to get a valid kerberos ticket.
Is there a way to skip this part?

pkothuri · February 4, 2021, 11:18am

currently it is not possible to generate spark session object in the startup script. As this is attached to the notebook, this has to be done from with in the notebook either with the icon or we also have a python snippet that create spark session from the cell

grigolet · February 4, 2021, 11:31am

Thanks for the info.
What is the snippet you’re referring to start the spark session?

pkothuri · February 4, 2021, 12:13pm

Dear @grigolet

Following snipped should work, make sure you have krb ticket (which you said you were getting in the startup script)

regards,
Prasanth

# Stop spark session

try:
    spark.stop()
except NameError:
    pass

# Manual spark configuration to execute notebook outside of SWAN service

import os
import random
import subprocess

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if "SPARK_PORTS" in os.environ:
    ports = os.getenv("SPARK_PORTS").split(",")
else:
    ports = [random.randrange(5001,5300),random.randrange(5001,5300),random.randrange(5001,5300)]

# Spark Config
nxcals_jars=subprocess.run(['ls $LCG_VIEW/nxcals/nxcals_java/* | xargs | sed -e "s/ /:/g"'], shell=True, stdout=subprocess.PIPE, env=os.environ).stdout.decode('utf-8')
conf = SparkConf()
conf.set('spark.master', 'yarn')
conf.set("spark.logConf", True)
conf.set("spark.driver.host", os.environ.get('SERVER_HOSTNAME'))
conf.set("spark.driver.port", ports[0])
conf.set("spark.blockManager.port", ports[1])
conf.set("spark.ui.port", ports[2])
conf.set('spark.executorEnv.PYTHONPATH', os.environ.get('PYTHONPATH'))
conf.set('spark.executorEnv.LD_LIBRARY_PATH', os.environ.get('LD_LIBRARY_PATH'))
conf.set('spark.executorEnv.JAVA_HOME', os.environ.get('JAVA_HOME'))
conf.set('spark.executorEnv.SPARK_HOME', os.environ.get('SPARK_HOME'))
conf.set('spark.executorEnv.SPARK_EXTRA_CLASSPATH', os.environ.get('SPARK_DIST_CLASSPATH'))
conf.set('spark.driver.extraClassPath', nxcals_jars)
conf.set('spark.executor.extraClassPath', nxcals_jars)
conf.set('spark.driver.extraJavaOptions',"-Dlog4j.configuration=file:/eos/project/s/swan/public/NXCals/log4j_conf -Dservice.url=https://cs-ccr-nxcals6.cern.ch:19093,https://cs-ccr-nxcals7.cern.ch:19093,https://cs-ccr-nxcals8.cern.ch:19093 -Djavax.net.ssl.trustStore=/etc/pki/tls/certs/truststore.jks -Djavax.net.ssl.trustStorePassword=password")

sc = SparkContext(conf=conf)
spark = SparkSession(sc)

mmacieje · February 4, 2021, 12:29pm

Hi Prasanth and Gianluca,

this topic is of interest for me too. I’m not sure how to get a krb ticket with SWAN. Could you please share some insights on that?

Thanks in advance, Michał.

grigolet · February 4, 2021, 12:45pm

Thanks @pkothuri for the script.
@mmacieje I don’t know if it’s the best way to do it but what I did in my case was to generate a keytab file (from example from lxplus with cern-get-keytab --user --keytab grigolet.keytab and I saved it on my eos). Then I made a simple startup script that I’m calling when SWAN is starting that contains this line:

kinit -kt /path/to/grigolet.keytab grigolet@CERN.CH