Handling a heavy root tree file as a datafame in Jupyter Notebook

blim · February 18, 2020, 2:04pm

Dear All,

Is there a common or general way to handle the heavy data in the notebook?

I’m trying to load a ~500 MB(or over) root format file including the tree which I’ve prepared with MC.
By using uproot, the tree file will be loaded as a dataframe in pandas. (n rows × 25 columns from TNtupleD)
When I check the memory usage of ‘top’ command in the web terminal, the notebook used up to 10GB of memory when it’s crashed. (And I configured my session as using 10GB of memory)

I can handle the simple(smaller) datafile by controlling the size of the MC output dataset, but I think that there would be a better way so that I can use more and more datasets.

Regards,
Bong-Hwi

etejedor · February 18, 2020, 2:29pm

Hi Bong-Hwi,

I see two options:

You use RDataFrame to filter out your dataset before converting to pandas, so that it fits in memory. See this tutorial:
https://root.cern/doc/master/df026__AsNumpyArrays_8py.html
You connect to https://swan006.cern.ch where you can create a session with 16 GB of memory.

Enric

ravinab · February 21, 2020, 5:36pm

Hi @blim,

If you don’t need access to all rows at the same time, you can use the uproot iterate function to read your tree in chunks and output a pandas dataframe (with additional options, like filtering of branches).

Cheers, Baptiste