SWAN for physics analysis

etejedor · June 21, 2022, 9:52am

Dear SWAN users,

The SWAN team is looking for users who are currently using SWAN to do physics analysis and can provide feedback about their experience – what works well, what doesn’t, what is missing, etc. If you are one of those, please reply to this message or send me a PM!

Also on this topic, we are currently working closely with the Batch service at CERN to integrate SWAN with HTCondor resources for both batch and interactive analysis (the latter via Dask or any framework that can use Dask underneath, such as RDataFrame or coffea). More news on this will be announced soon!

Best,

Enric

alobanov · June 21, 2022, 12:20pm

Hi. Not sure if that qualified as physics analysis, but since I just saw that message:

I’m looking into some first Run3 collisions for prompt feedback using CMS MiniAOD through XrootD and uproot (RDF fails to load the CMSSW files even though the data types are not custom !).
And I must say that even though I’m in exploratory mode the loading of the individual branches over XrootD is very, very slow. Kudos that XRootD works at all though!

Not sure if that is solvable at all, but I like using notebook in particular in the exploratory phase for any kind of analysis and this slow file access is quite annoying.

When working with EOS there is of course no problem at all! (just hitting the memory limit quite often when working on many files).

etejedor · July 5, 2022, 11:35am

Hello Artur,

Thank you for your reply and sorry for my late reply!

RDF fails to load the CMSSW files even though the data types are not custom !).

Could we follow this up in the ROOT forum? We’d be interested in knowing what the problem is (I am speaking with my ROOT hat on now ).

And I must say that even though I’m in exploratory mode the loading of the individual branches over XrootD is very, very slow. K

I see, would you say that XrootD access of ROOT files is slower in SWAN than say lxplus? If this is the case, we need to find out why.

When working with EOS there is of course no problem at all! (just hitting the memory limit quite often when working on many files).

Good, you might know already but you can increase the memory limit of your session in the web form when starting the session. Also, with RDataFrame you shouldn’t have memory problems (unless you run something like AsNumpy and the result does not fit into memory).

Would you by any chance be interested in trying distributed analysis? We are integrating SWAN with Dask and HTCondor, and both RDataFrame and coffea can work on top of Dask.

alobanov · July 8, 2022, 12:52pm

Hi Enric,

I’ve actually never really used RDF so can’t guarantee my experience with it is useful. I’ll report any issues to ROOT if I’ll encounter them again.

I haven’t tried accessing the XrootD files from lxplus since I don’t like using lxplus for such exploratory things.

As for distributed analysis: does this really fit for exploratory studies? I rather thought of it for production level analysis and as an alternative to the batch system (HTCondor).

etejedor · July 11, 2022, 8:45am

Hi Artur,

I haven’t tried accessing the XrootD files from lxplus since I don’t like using lxplus for such exploratory things.

Ok, so you are comparing SWAN w.r.t. your own machine? Are you based at CERN or elsewhere? I understand you find XRootD access consistently slow(er) in SWAN and not just sporadically? If this is the case I’d be very much interested in investigating this.

As for distributed analysis: does this really fit for exploratory studies? I rather thought of it for production level analysis and as an alternative to the batch system (HTCondor).

It fits anything you can’t run interactively on your own machine because it would take too long. Perhaps now your exploratory work can be done in a local SWAN session and you don’t need any offloading, but we are also thinking of what will be necessary in a few years with more data. Anyway, if you are interested in being an early user of this, just let me know.

alobanov · July 11, 2022, 9:04am

No, only in SWAN and I have no way to compare to anything else. I general I have not used non-EOS based files since a very long time. Will get back to you if that reappears.

That’s definitely interesting for sure! If you don’t need fast feedback, I’d be interested to beta test

alobanov · July 25, 2022, 8:15pm

Seems that now I should get interested in learning Dask:
uproot.lazy will be deprecated in favour of uproot.dask

etejedor · July 25, 2022, 8:35pm

And SWAN has everything you need to do that

alobanov · August 10, 2022, 8:05am

Hi @etejedor please point me to the dask-enabled SWAN installation whenever you will have it running! Thanks

etejedor · August 24, 2022, 9:03am

Sure, will do!

algomez · March 8, 2023, 9:41pm

Hi @etejedor

not sure if this topic is too old, but I am currently trying to use SWAN with CMS OpenData to replicate a full CMS analysis. Let me know if you still need feedback.

In the meantime, I ended up in this topic because I am using coffea and I want to use dask, or whatever other tool is available, to try to run multiple datasets in SWAN. Do you have any examples of this?

cheers,

etejedor · March 9, 2023, 5:40am

Hello Alejandro,

Thank you for the offer!

We are about to integrate in SWAN the last pieces of the support for Dask and HTCondor. More news coming soon.

Cheers,
Enric

nakolkar · August 9, 2023, 8:17am

Hello @etejedor,

I have been trying to integrate SWAN in my ATLAS analysis. The plan is to use distributed RDataFrame to read large multiple root files. As far as SWAN is concerned everything works well, especially reading/writing from CERNBox, also the notebook interface is much easier to debug.

Problem arises while using the spark cluster. Running spark locally works but as soon as I connect to analytix cluster, I get various errors of type Py4JJavaError. This especially happens while reading root files. I tested with parquet files and things work there.

Kudos to the team for developing SWAN and integrating clusters, its a huge help! I just wish if guidelines on working with root files in spark cluster, is provided.

Cheers,
Nilima.

etejedor · August 9, 2023, 11:17am

Hello Nilima,

I’m happy to hear SWAN is useful for you!

Regarding the error you see when running Spark on analytix, could you please open a ticket here:

https://cern.service-now.com/service-portal/?id=functional_element&name=swan

and ideally share a reproducer of the error you see (i.e. the RDataFrame code you are running).

I’m also pinging @vpadulan from the ROOT team about this.

Best,

Enric

nakolkar · August 9, 2023, 12:28pm

Hello Enric,

I have created the ticket as you suggested. I am looking forward to the solution/discussion!

Cheers,
Nilima