TFRecord format integration on the SWAN Spark Cluster for ML inference

sagiorge · January 10, 2023, 4:09pm

Hello,
I’m currently using Jupyter Notebooks on SWAN with the k8s Spark cluster to process and analyze data (with python). I’m trying to run inference on a collection of parquet files, processed as a spark dataframe. My goal is to feed the data to a pre-trained Tensorflow model without going through pandas.

In the SparkDLTrigger repository they show some examples using the TFRecord format and I would like to see how it performs.
Is there a way to implement this on the SWAN Spark cluster? How can one add the spark-tensorflow-connector package in the configurations?
If you are aware of other alternatives, feel free to share them.

Thank you,
Sabrina

canali · January 10, 2023, 8:15pm

Hi Sabrina,

In the work that you mentioned SparkDLTrigger we used Spark to convert data in the TFRecord format, which then can be processed natively by TensorFlow.
I understand this is what you want to do to, and I believe it should still work and make sense with the current releases of Spark and Tensorflow.

To enable Spark to write data in the TFRecord format you need to add a package/jars: spark-tensorflow-connector.

In the original work we used the TFrecord connector package from maven central, to do that we set the configuration spark.jars.packages to org.tensorflow:spark-tensorflow-connector_2.11:1.14.0
That version would only work with Spark 2.4.x (available in LCG97a when using SWAN). More recent LCG versions, including the current LCG102b have Spark 3.x which requires spark-tensorflow-connector_2.12, which unfortunately has not yet been pushed to maven central by the TensorFlow team from what I can see. However the connector is available and should rather be compiled (using mvn package) from github see ecosystem/spark/spark-tensorflow-connector at master · tensorflow/ecosystem · GitHub

Another option “to feed parquet to Spark” is to use the petastorm package (available as open source from Uber).

I hope this help. Feel free to contact me directly if you have additional questions.

Best,
Luca

sagiorge · January 13, 2023, 2:11pm

Thank you so much for the detailed reply and all the information you provided. I’ll keep in mind what you are suggesting and have a look into it.

Best,
Sabrina

canali · March 17, 2023, 7:43am

Hi,
I am coming back on this as I finally found some time to write a short note with some example notebooks that could be useful for this and for the general task of feeding data to TensorFlow and PyTorch: “Three methods to feed Parquet data to TensorFlow or Pytorch”

github.com

LucaCanali/Miscellaneous/blob/master/DeepLearning_Notes/Parquet_to_Tensorflow_PyTorch_HowTo.md

# Three methods to feed Parquet data to TensorFlow or Pytorch

1. Load data into memory and feed it to TensorFlow or Pytorch 
2. Use the Petastorm library to load Parquet and feed it to TensorFlow or Pytorch
3. Convert data to the TFRecord data format and process it natively using TensorFlow

## 1. Load data into memory then feed it to TensorFlow or Pytorch
This works by reading the data in memory using Pandas or similar packages, convert it into numpy arrays
and then passing those to TensorFlow.  
A few examples with the [HLF particle classifier](https://github.com/cerndb/SparkDLTrigger/tree/master/Training_HLF_Classifier):  
    - [Read using PySpark and feed data from memory to TensorFlow](https://github.com/cerndb/SparkDLTrigger/blob/master/Training_HLF_Classifier/4.0c-Training-HLF-TF_Keras_PySpark_Parquet.ipynb)  
    - [Read with Pandas and feed to TensorFlow](https://github.com/cerndb/SparkDLTrigger/blob/master/Training_HLF_Classifier/4.0c_bis-Training-HLF-TF_Keras_Pandas_Parquet.ipynb)  
    - [Read with PyArrow and feed to TensorFlow](https://github.com/cerndb/SparkDLTrigger/blob/master/Training_HLF_Classifier/4.0c_tris-Training-HLF-TF_Keras_Pyarrow_Parquet.ipynb)  

## 2. Ingest Parquet files using Petastorm and feed them to TensorFlow or Pytorch
[Petastorm](https://github.com/uber/petastorm) is library that enables single machine or distributed training and
evaluation of deep learning models directly from datasets in Apache Parquet format.    
Examples:  
    - [Petastorm + TensorFlow for the HLF classifier](https://github.com/cerndb/SparkDLTrigger/blob/master/Training_HLF_Classifier/4.0d-Training-HLF-TF_Keras_Petastorm_Parquet.ipynb)  
    - [**Large Dataset:** Petastorm + TensorFLow for the Inclusive classifier](https://github.com/cerndb/SparkDLTrigger/blob/master/Training_HLF_Classifier/4.0d-Training-HLF-TF_Keras_Petastorm_Parquet.ipynb)

This file has been truncated. show original

Best,
Luca

sagiorge · March 17, 2023, 8:44am

Thank you for sharing the links in this post, I’m sure it will be very useful!

Kind Regards,
Sabrina