Passing sparkcontext variable to execution scripts

ymino · February 8, 2023, 9:48am

Dear Experts,

I have been using RDataFrame with Spark Clusters and currently have no issues using a single jupyter notebook. Now I want to convert my local scripts with RDataFrame without converting all scripts to Jupyter Notebooks. (Keep most of the scripts as python files and execute the script from the jupyter notebook.)
For example, there is a script called “CutFlow.py” which uses the RDataFrame with spark clusters. I can execute this python file from a notebook as

import subprocess
subprocess.call(["python","CutFlow.py"])

The current problem is that the code below cannot be executed correctly since it cannot find the SparkContext provided by SWAN.

import ROOT
RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
rdf = RDataFrame(tchain, sparkcontext = sc)

Is there a way to pass the sparkcontext to execution files (or import modules) to keep the structure of my local scripts as much as possible ?

ymino · May 1, 2023, 2:30pm

Dear Experts,

I haven’t received an answer to this topic for a while but I would like to bring this question up again.
I am trying to use the spark cluster form my physics analysis.
However I am having some trouble passing the sparkcontext to the execution files.
Currently I am calling the function below to pass the sparkcontext from the jupyter notebook.

configMgr.setSparkContext( sc )

The definition of the function is

    def setSparkContext(self, sc):
        """                                                                                                                                              
        Set sparkcontext to distribute jobs to spark clusters                                                                                            
                                                                                                                                                         
        @param sc SparkContext                                                                                                                           
        """
        if sc:
            self.RDataFrame = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame( self.tchain, sparkcontext = sc )
        return

However, I need to convert a python script to a jupyter notebook to implement this.
To keep the file structure as much as possible, is there a way to pass the sparkcontext to execution files (or import modules) ?
Ideally, I would like to pass the spark context as an argument or import to a module directly.
For example, I would like to execute a python file from a notebook as

import subprocess
subprocess.call(["python","CutFlow.py","--sparkcontext",sc])

and pass the sparkcontext to a rdataframe defined inside CutFlow.py

etejedor · May 2, 2023, 10:08am

Dear Yuya,

This won’t work:

subprocess.call(["python","CutFlow.py","--sparkcontext",sc])

because you are creating a different process which does not have access to the sc object in memory of its parent process.

I believe you just need to import your module and use the functions you need from that module in the notebook:

from yourmodule import setSparkContext, runMe
setSparkContext(sc)
runMe()

In other words, you need to offer some interface for the SparkContext to be set and for the main method to be executed, that’s it. Or even fuse these two into one (runMe(sc)).

Cheers,
Enric

ymino · May 2, 2023, 4:16pm

Dear Enric,

because you are creating a different process which does not have access to the sc object in memory of its parent process.

Now I understand why I could not pass the spark context variable as an argument which subprocess call.
Following your suggestion, I modified my script to contain a function to run the main function which can be imported to the notebook. By adding a spark context argument to the function, I was able to pass the spark context to the RDataFrame defined in my script.
Thank you for your support.

Best regards,
Yuya