Hello,
I’m currently using Jupyter Notebooks on SWAN with the k8s Spark cluster to process and analyze data (with python). I’m trying to run inference on a collection of parquet files, processed as a spark dataframe. My goal is to feed the data to a pre-trained Tensorflow model without going through pandas.
In the SparkDLTrigger repository they show some examples using the TFRecord format and I would like to see how it performs.
Is there a way to implement this on the SWAN Spark cluster? How can one add the spark-tensorflow-connector package in the configurations?
If you are aware of other alternatives, feel free to share them.
In the work that you mentioned SparkDLTrigger we used Spark to convert data in the TFRecord format, which then can be processed natively by TensorFlow.
I understand this is what you want to do to, and I believe it should still work and make sense with the current releases of Spark and Tensorflow.
To enable Spark to write data in the TFRecord format you need to add a package/jars: spark-tensorflow-connector.
In the original work we used the TFrecord connector package from maven central, to do that we set the configuration spark.jars.packages to org.tensorflow:spark-tensorflow-connector_2.11:1.14.0
That version would only work with Spark 2.4.x (available in LCG97a when using SWAN). More recent LCG versions, including the current LCG102b have Spark 3.x which requires spark-tensorflow-connector_2.12, which unfortunately has not yet been pushed to maven central by the TensorFlow team from what I can see. However the connector is available and should rather be compiled (using mvn package) from github see ecosystem/spark/spark-tensorflow-connector at master · tensorflow/ecosystem · GitHub
Another option “to feed parquet to Spark” is to use the petastorm package (available as open source from Uber).
I hope this help. Feel free to contact me directly if you have additional questions.
Hi,
I am coming back on this as I finally found some time to write a short note with some example notebooks that could be useful for this and for the general task of feeding data to TensorFlow and PyTorch: “Three methods to feed Parquet data to TensorFlow or Pytorch”