XRootD Authentication Issues When Using HTCondor Cluster

Dear Experts,

I am currently using pocket-coffea and coffea 7.25.0 on the software stack 107 for my analysis. When running locally without the HTCondor cluster, everything works correctly - I can access remote files using XRootD on all sites without issues.

However, when I attempt to run the analysis on all datasets using an HTCondor cluster with the command:

pocket-coffea run --cfg config.py -e dask@swan --sched-url tls://10.100.184.85:30169 -o output

I consistently encounter XRootD errors such as:

  1. Authentication failure:

OSError: XRootD error: [FATAL] Auth failed: No protocols left to try in file root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/VLTquarkToPhotonTop_M-1000_PtG-10_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v4/2810000/E68E9489-38BC-2C49-B0CA-9E1B481A40FD.root

  1. Operation expiration:

OSError: XRootD error: [ERROR] Operation expired in file root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/VLTquarkToPhotonTop_M-600_PtG-10_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v4/2560000/1CB86E3A-E658-2843-BFA6-FCF32810E820.root

  1. Redirect limit reached (when using redirectors):

OSError: XRootD error: [FATAL] Redirect limit has been reached in file root://xrootd-cms.infn.it///store/mc/RunIISummer20UL18NanoAODv9/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/280000/63F9BC2F-DBED-4347-B031-19775711FB8D.root

I have tried different sites and redirectors, but there is always at least one file that fails with similar errors. Notably:

  • This is not a new issue - I have experienced it since I first began using SWAN
  • With small datasets, the cluster used to work fine but today no file is accessed
  • The problem occurs consistently when processing all data, with different files failing in different runs

Previously, I worked around this by using LXPLUS, NanoAODTools, and CRAB to first skim the data and save it to my /eos/ directory, then running the analysis locally without HTCondor. However, this approach doesn’t work anymore and now I need to use HTCondor and access datasets remotely.

I would really appreciate any assistance in resolving this issue.

Thank you in advance for your help.

Dear Mohammad,

You can skim your data and store the skimmed output on your EOS using Pocket-Coffea. You can find the documentation at [1].

Also, there is a Mattermost channel for Pocket-Coffea users that you can join.
1.Configuration — PocketCoffea

1 Like

Hello Vahid,

Thank you for your response. I am currently using pocket-coffea, but my issue is not related to skimming. Instead, I am encountering problems when reading data using HTCondor. When processing a large number of ROOT files—for example, one year of CMS data—some files fail to be read properly, causing the execution to crash. Enabling skim_bad_files does not resolve the issue, as recently all jobs have been failing.

As far as I know, job failures are common when running on a cluster. I faced a similar issue when using CRAB to analyze large datasets, but CRAB includes a utility for resubmitting failed jobs. Implementing such a feature in dask-coffea might be challenging since the output of all jobs must be merged into a single output, meaning a single job failure can crash the entire process. However, this functionality would be extremely useful for SWAN HTCondor clusters, especially for skimming tasks where jobs are handled independently.

Thank you again for your assistance.

HI Mohammad,

Indeed the automatic dataset location handling is still not perfect in Coffea-based tools. You can use the CMS redirector path for files, but in our experience that’s more unstable than getting the file location directly with rucio. I recommend that you refresh the dataset location file just before running your jobs, to make sure that files are available. Have a look at Datasets handling — PocketCoffea for further guidance on how to filter the dataset sources.

In general, to improve the user experience with skimming, we recommend to avoid dask for the skimming process and use instead the direct condor submission: you can find a tutorial here:

Best,
Davide

1 Like

Hi Davide,

Thank you for your guidance. I’ll follow your recommendations. Much appreciated!

Best,
Mohammad