HTCondor and coffea

Hello!

I’m a newbie to SWAN, and I’m also new to coffee and HTCondor. I’m trying to analyze some Run3 files on the cluster using a CMS analysis (event selection). I’m curious if there are any great resources out there that can help me learn how to run dask jobs on HTCondor through the SWAN interface. I have gone through the process of initialising the cluster and scaling but I’ve had a few hiccups while running the analysis, so I thought it might be helpful to take a step back and check out any tutorials or guides that might be available. I’d really appreciate any help you can provide! Thanks!

Cheers,

AG

Dear Amish,

There’s some docs here:

Cheers,

Enric

Hey @etejedor ,

Thanks for the reply. Yes I did go through these to understand the steps of initialisation and scaling of the cluster. The thing I am not sure on is what is the exact way of running code on the cluster and ensuring that it is not running locally.

Let me detail my problem a little bit. I have two files that are supposed to run on the cluster: One is to create a fileset JSON and cumulate all the files from DAS. The code I had used to do this is below:

outputname = os.path.join("filesets",f"EGamma0_Data_nanov15_fileset_Era_[B-J]")

with client:
    Data_ddc.do_preprocess(output_file=outputname,
                          file_exceptions=(OSError, lzma.LZMAError, DeserializationError))

This code ran successfully on the cluster.

The second file is a selection code and this is where I am encountering the problem that I am not able to ascertain if the analysis is running on the cluster or not. The code I have been guided to use for this is below:

with ProgressBar():
    with gzip.open(FILESET_DATA_LOC, "rt") as file:
        fileset_full = json.load(file)
        
    # fileset_cleaned = filter_files(fileset_full, lambda x: not x[0].startswith("root://maite"))
    # fileset_cleaned = filter_files(fileset_full)
    fileset_ready = max_files(fileset_full, 10)  ##### test with 10 files#### None for all
    # fileset_ready = change_sources(fileset_ready)

    analyze_func = partial(createSingleElectronCR, delayed=True, isMC=False)
    outputs_elecCR, reports_elecCR = apply_to_fileset(analyze_func, fileset_ready, uproot_options=UPROOT_OPTIONS)
    # outputs_elecCR, reports_elecCR = apply_to_fileset(analyze_func, fileset_full, uproot_options=UPROOT_OPTIONS)
    # print(events.fields)
    print("Applying Selections...")
    # print("Selections: OBJECT PAIR REQUIREMENT --> delta_R + Opp sign + Vis Z mass #All files ")
    t0 = time.time()
    coutputs_elecCR, creports_elecCR = dask.compute(outputs_elecCR, reports_elecCR)
    print("Done. Time taken:", time.time() - t0)

The code here is taking a very long time to run (I suspect it is running locally) and even if it is running on the cluster I don’t see the job by doing condor_q. Hence I want to understand the process of running code on the cluster through SWAN a bit better.

Also to make it clear I did run the following code before executing the main code in each of the files.

from dask.distributed import Client

client = Client("tls://10.100.106.244:32694")
client

I understand that my problem is a bit specific so let me know if you need any more details.

Thanks,

AG

Hello,

Did you kinit before submitting any job? If you follow the instructions above, your jobs will submit to HTCondor and you should see them with condor_q.

Yes I did run kinit before. Another interesting thing to note is that when I ran the fileset code I got the following printed out first:

/cvmfs/sft.cern.ch/lcg/views/LCG_107_swan/x86_64-el9-gcc13-opt/lib/python3.11/site-packages/distributed/node.py:187
: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42345 instead
  warnings.warn(

which I assume is the indication that the job is running on the cluster, but when I ran the second file no such text was printed.

Also is there a need to write the processing code within with client: or once the client is initialised all subsequent code automatically runs on the cluster?

Thanks,

AG

I’d not use the client via context manager to prevent deletion.

But most importantly we should figure out why jobs do not show. Can you open a ticket to SWAN support?

Hey @etejedor

I have raised a ticket with SWAN support.

About my issue, after a few tries and waiting for a very long time, my jobs were able to get submitted to the cluster.

I got my code verified with my supervisor and they said it was ok and were unsure about why the jobs were not being submitted the cluster.

The steps I followed were:

  1. Login to SWAN and initialise the cluster
  2. Run kinit and voms proxy init for HTCondor and XRoot
  3. Scale the condor cluster.
  4. Reconnect to SWAN.

These steps seem to work but it took a very very long time for the jobs to actually show up on condor_q and on the various graphics interfaces that come with the SWAN.

After running some jobs, the cluster completely stopped processing jobs (I ran the code for 8 hours or so on 200 files). This is I guess due to disconnecting with SWAN after a timeout.

I have now switched to submitting my condor jobs directly from LXPLUS.

Thanks for all the help.

Cheers,

AG