CERN Accelerating science

Processing of a 36 Gb file -- Kernel Restarting issue

Hello,

After trimming and pre-processing my original dataset, I ended up with a csv file around 36 Gb that I need to use on my analysis. The problem is that even trying to load this file at a notebook to process it, I get a kernel restart.

Any suggestions on how I should proceed? Currently I’m thinking of putting the file to HDFS and trying to do the processing with Spark, but are there any alternatives?

Thanks,
George

What is your use-case with the data in CSV? We can see if this makes sense for you in Spark or not

What spark does, will partition your CSV into many chunks to store in HDFS, and process in parallel on executors. This ensures you never have problems with memory. You can also do something like this by chunking your file and process sequentially in the notebook

Hi @gekaklam

Certainly you can put it in HDFS and process with spark, I also want to understand ‘load the file in the notebook’ using what, pandas ? if so it does have a way to specify chunk size

pd.read_csv(filename, chunksize=chunksize)

regards,
Prasanth

My file format is the following:
nsfileid,ts1,ts2,ts3,ts4,ts5,ts6,ts7,ts8s,ts8e,ts9s,ts9e,reads_started,reads_completed,tpvid,h1,h2,h3,h4,h5

All ts* values are unix timestamps. What I want to get are boxplots (or worst case histograms), of the difference between these timestamp values and then put them on the same plot. Specifically have 6 boxplots from the: ts2-ts1, ts3-ts2, etc.

@pkothuri Yes, the command that I’m using and it “freezes” is:

dset = pd.read_csv(filename, header=None, names=column_names, keep_default_na=False, na_values="0") 

I will try to use the chunksize option. Regarding the boxplots, to my understanding to do that with modern libraries I’d need to convert my dataset to long format and then also add another grouping feature, but I didn’t reached that point yet.