Dear SWAN developpers,
First of all, thanks for having thought and developped SWAN.
I use this tool daily, however I still feel that I do not know how to use most of its functionalities, especially the spark interface. Indeed most of my codes are done under pyROOT (soon using RDataFrames…), and I would like to ask how it is possible to interface spark with the RDataFrame ROOT framework.
I saw quite a number of slides related to the following work :
I tried then to install it in my workspace which ‘seems’ to work, but only using the PyRDF.use(‘local’) config. And if I connect my notebook to analytix, the PyRDF option ‘spark’ still does not work (might be normal actually). Is this PyRDF officially in use or only an exploratory work ?
So here I come with a question of design… What I would like is a framework that can use the possibilities of spark, ROOT, and DataFrames in a context of a classical analysis (cuts mostly, and histogramming ->pyROOT?) but also machine learning analysis ( using pytorch, MLFlow, etc., preferentially not TMVA) and doing all this quite fast (spark)
My question is the following, should I move ROOT data to spark sql dataframes and then use spark ? However RDataFrames in ROOT seems really promising to me (and especially ease the histogramming part). I am a bit lost between all the possibilities : pandas, spark dataframes, ROOT RDataFrames that I would prefer, but not sure it is the most valuable choice for data analysts, etc.)
Sorry for this quite long message, and I really hope I am not fuzzy in my explanations and that you can help me with this design issue.
NB : independent fact : I succeeded to use the papermill module for running multiple notebooks in one. Is there a possibility to centralise and parametrize different notebooks as papermill does but using spark for the running notebooks ? (not running many of them at the same time of course but one at a time, and faster thanks to spark)
Thank you very much, and have a nice day ! (snowing at CERN !)
Brian Ventura