Using spark with ROOT DataFrames

Dear SWAN developpers,

First of all, thanks for having thought and developped SWAN.

I use this tool daily, however I still feel that I do not know how to use most of its functionalities, especially the spark interface. Indeed most of my codes are done under pyROOT (soon using RDataFrames…), and I would like to ask how it is possible to interface spark with the RDataFrame ROOT framework.

I saw quite a number of slides related to the following work :

https://indico.cern.ch/event/754811/contributions/3223563/attachments/1766431/2868164/Distributed_data_analysis_with_ROOT_RDataFrame_and_Spark.pdf

I tried then to install it in my workspace which ‘seems’ to work, but only using the PyRDF.use(‘local’) config. And if I connect my notebook to analytix, the PyRDF option ‘spark’ still does not work (might be normal actually). Is this PyRDF officially in use or only an exploratory work ?

So here I come with a question of design… What I would like is a framework that can use the possibilities of spark, ROOT, and DataFrames in a context of a classical analysis (cuts mostly, and histogramming ->pyROOT?) but also machine learning analysis ( using pytorch, MLFlow, etc., preferentially not TMVA) and doing all this quite fast (spark)

My question is the following, should I move ROOT data to spark sql dataframes and then use spark ? However RDataFrames in ROOT seems really promising to me (and especially ease the histogramming part). I am a bit lost between all the possibilities : pandas, spark dataframes, ROOT RDataFrames that I would prefer, but not sure it is the most valuable choice for data analysts, etc.)

Sorry for this quite long message, and I really hope I am not fuzzy in my explanations and that you can help me with this design issue.

NB : independent fact : I succeeded to use the papermill module for running multiple notebooks in one. Is there a possibility to centralise and parametrize different notebooks as papermill does but using spark for the running notebooks ? (not running many of them at the same time of course but one at a time, and faster thanks to spark)

Thank you very much, and have a nice day ! (snowing at CERN !)

Brian Ventura

Dear Brian,

Thank you very much for your kind words!

PyRDF is still not integrated in mainstream ROOT, but that does not mean it cannot be used to do actual work. It has been used already by TOTEM colleagues to run a real analysis on Spark.

PyRDF is just a layer on top of RDataFrame that allows you to choose multiple backends to run your RDataFrame computation graph. Therefore, you can use it to do cuts and histogramming on bigger datasets than what you would process in a single machine. We can guide you through the process of configuring your SWAN session to spawn Spark jobs with PyRDF on analytix.

For what concerns the ML use case, what is exactly what you want to do? There is a feature that was added recently to RDataFrame, which is called AsNumpy, which you can use to read RDataFrame columns into Numpy arrays. Here’s a link to open a tutorial in SWAN:

https://cern.ch/swanserver/cgi-bin/go?projurl=https://root.cern.ch/doc/master/notebooks/df026_AsNumpyArrays.py.nbconvert.ipynb

Such functionality is not yet available for PyRDF, only for local RDataFrame, but it is on the todo list. You could feed that processed data in Numpy format into any ML tool you wanted.

On the other hand, we are planning to add GPU resources to SWAN, so that ML-enabled libraries (e.g. tensorflow) can run faster.

Cheers,

Enric

Dear Enric,

Thank you very much for this really complete message. I would indeed appreciate a lot to configure my swan session to use PyRDF to interface RDataFrames with spark, I am looking forward to your further instructions !

Also thanks for the AsNumpy tutorial, I did not know about it. The fact of using pandas dataframes afterwards is an answer I was looking for in order to use further ML library like pytorch. However I cannot be really specific on my ML applications for now but for example :

  • basic/deep NN maybe for complex fitting procedures,
  • manifold learning
  • source separation under constraints (like non linear ICA)

GPU resources would be interesting especially for pytorch & indeed tensorflow. I am also looking forward this functionality in a second time !
Thanks for helping, and have a nice day !

Best regards,
Brian Ventura

Dear Brian,

@jcervant will be able to give you more instructions on how to setup your SWAN session with PyRDF.

I also take the opportunity to mention that we are preparing a 1-day SWAN users workshop at CERN, most probably in October (date is not official yet). There will be presentations by the SWAN team and also the users, we will discuss issues and ideas for improving the service, including new features (by then we might have a prototype for the GPU support). It would be great if you could attend and tell us about how you are using SWAN and what is missing.

Dear Brian,

It is super nice to see more people interested in PyRDF! Thanks!

Apologies for my late answer, I have been preparing a document with some instructions to run PyRDF + Spark in Swan, you can find it on the link below while it gets merged into master:

For the time being, the configuration is not as user-friendly as we would like, as there are some manual steps involved to make PyRDF work in SWAN (such as creating the zip file for the python module, sending it to the workers, adding some instructions to your notebook…), however this is only a temporal solution until we distribute PyRDF as part of the LCG Releases, so it can be easily accessible from every Spark worker through CVMFS.

These instructions suggest to use the new Cloud Containers cluster available in SWAN, which is based on Kubernetes. Indeed, this cluster is the recommended option to be used together with PyRDF. Nonetheless, PyRDF can also work with Analytix, in case you really need to use that cluster (please, let us know if this is a strong requirement for your use-case).

On the same link above, there are some demos that you can take as examples of use. Our main design goal behind PyRDF is to require as minimum changes as possible on RDataFrame-based user’s code, however we are still on a early development version, so any feedback/suggestion/issues about the API’s, usability or lacking features is more than welcome.

I would suggest you to start running your examples with Python 2, if possible, since there are still some python3 incompatibilities to be fixed.

Please let us if you find any issue setting up the environment in SWAN.

Cheers,
Javier

Dear Javier,

Thanks for this walkthrough ( for other users, be cautious to checkout the new-demo branch).

First don’t have any need to use one cluster or another, so I guess it will be Kubernetes (I don’t know what are the differences actually :grin:). Your tutorial seems to work for me (at least the RDF_demo.ipynb not forgetting the line sc.addPyFile("./PyRDF.zip") ) !

For now I will convert my codes to use RDataFrames (and pandas at some point, even if PyRDF is not suited for pandas I guess?). I need some methods for DataFrames not available in ROOT I think.
I will also be interested in configuring and connecting to the cluster not using SWAN interface but an external python script/notebook, feeding the sparkcontext as written in the very beginning of step 8. (And afterwards maybe one can use papermill to run notebooks?) But this is another step.

Thank you very much for helping,
Best regards,
Brian

@vbrian if you need higher initial computing power than 3 executors (3x3=9CPU), you can also play with parameters e.g.

spark.dynamicAllocation.maxExecutors=10 (specify some cap)
spark.dynamicAllocation.minExecutors=6 (which would be good for e.g 18 partitions)