Speed of accessing files on EOS

jpucek · July 26, 2022, 2:44pm

Hello everyone,

I am using SWAN to analyze event data which are in .h5 file format. After the analysis code I save some variables onto my CERNbox. Everything works well, however, as I am accessing 1000s of files, speed of access is crucial. And that is when I noticed that the first 50 files open very quickly and then the whole process slows down. Is there a way to make my code fast the whole way?

import h5py
from os import listdir
import numpy as np
import math
import datetime as dt
import csv

#Define which days I want to look at
startDate = dt.date(2022,6,10)
endDate = dt.date(2022,6,20)

nextDay = dt.timedelta(days=1)

#Prepare the csv to save analysis data
csvfile = open('Test_Report.csv', 'w', newline='')
writer = csv.writer(csvfile) #This is the writer into csv


data = ['Date'] # This is the header
writer.writerow(data)

for nDays in range((endDate-startDate).days+1):#cycle through days
    folderDate = startDate+nDays*nextDay
    folder = '/eos/experiment/awake/event_data/'+str(folderDate.year)+'/'+folderDate.strftime('%m')+'/'+folderDate.strftime('%d') #Setting the folder I want to go through
    
    try: #just make sure that we can open the folder and find files
        files = listdir(folder)
        print(folderDate.strftime("%d/%m/%Y"), len(files))
    except:
        print('Folder:'+folder+' could not have been opened')
        files = []
        
    count = 0 #Stupid way to count how many events there are each day
    for nFile in files: # cycle through all files in the folder
        f = h5py.File(folder+'/'+nFile,'r') #opening h5 file
        print(f'\r %d' % count, end = '\r')
                    
        #Some more fields are here, not important for this minimal example
        try:
            LaserEnergy = f['AwakeEventData/EMETER04/Acq/value'][0]
        except:
            LaserEnergy = -1
        
        
        data[0] = folderDate.strftime("%d/%m/%Y")
        data[1] = LaserEnergy
        
        writer.writerow(data)
        f.close()
        count = count+1
        
csvfile.close()
print('DONE!')

Thank you very much,
Jan

dalvesde · July 28, 2022, 7:18am

Hi Jan,

What part of the code slows down?

f = h5py.File(folder+'/'+nFile,'r') #opening h5 file

The opening of the files in here ^ ?
Do you know how h5py works internally?

I ask this because we’ve checked and there were no stalls in EOS for your account. And if there were, these would’ve been in namespace/metadata operations (like the part where you list files), not in opening the files.
But we also saw that just in the month of June we’re talking about 1.8TB of data, so, if your code is actually pulling the full files, I would say it’s perfectly normal that it takes some time.

Did you compare the size of the files which are quicker to load?

Cheers,
Diogo

jpucek · July 28, 2022, 10:27am

Hello Diogo,

I had the impression that the line of opening the file slows down, but today I run on a different cluster and it seems to be stable and a lot faster.
I have no idea about how h5py works internally. I asked because at some point I remember seeing an email that swan might have a time limit on opening files - but I did not really understood what should be done, so I just wanted to make sure that it is all good.

Thank you for your reply,
Jan