Which Spark Cluster configuration to choose

jfernan · July 1, 2020, 5:40pm

Dear all,

I am kind of new to SWAN, I have managed to run over my eos and read/process csv data from there (<1GB) in order to create a ML model, but now that I want to increase the data sample, I was wondering which Spark Cluster configuration could be optimal.

From https://github.com/swan-cern/help/blob/master/spark/clusters.md I understand that Cloud containers should be my choice in order to increase the performance, but I am not sure if I got it right since I seem to see the opposite when choosing that, at least compared to Analytix.

I am not sure if it is related to the fact that Cloud Containers do not have 97 version of the software stack, so that I had to choose 96 instead (or bleeding edge). Besides I could not find info about QA Spark Cluster, so in general any tip or documentation about recommended option is very welcome.

Thank you very much in advance.