How to save data to file with BE NXCals cluster?

dguerrei · November 29, 2020, 10:21am

hi all,

For a quicker access to data previously queried to NXCALS DB, I would like to save my pandas dataframe as a parquet file.

I looked around but could not find any documentation on how to do this, specially in how to define the file path. I guess I should save the file in my personal space in CERNbox or HDFS, but not sure what’s the best choice so some advice here is also appreciated. For your information, I tried to browse HDFS via the small elephant button but I get the following error

“HDFS Browser not available, no active hdfs namenode”

.

Currently I’m using the SW stack NXCals Python 3 and the cluster BE NXCALS.

Thanks
Diogo

pkothuri · November 30, 2020, 9:43am

Dear Diogo

You can save spark dataframe as parquet file as below

df.write.parquet(“hdfs://nxcals/user/{username}/{folder_name}”)

and if it is pandas dataframe you have, then you need to convert it to spark dataframe and save it

df = spark.createDataFrame(pdf)
df.write.parquet(“hdfs://nxcals/user/{username}/{folder_name}”)

Please note that spark creates the folder_name. Also thanks for reporting the issue on hdfs browser, we will investigate and fix it soon

Best Regards,
Prasanth