Spark is built with R support and published to the LCG releases, although there is no spark connector to simplify connection to the Spark clusters, users can still establish spark connection as below and perform data analysis with SparkR
-
Start the SWAN session with LCG96 software stack and analytix spark cluster
-
Open notebook with R kernel and start connection to the cluster as below
load SparkR library
library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))
spark configuration list
conf <- list(spark.driver.memory = “2g”,
spark.executor.memory = “2g”,
spark.driver.host = Sys.getenv(“SERVER_HOSTNAME”),
spark.driver.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][1],
spark.blockManager.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][2],
spark.ui.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][3],
spark.executorEnv.LD_LIBRARY_PATH = Sys.getenv(“LD_LIBRARY_PATH”),
spark.executorEnv.JAVA_HOME = Sys.getenv(“JAVA_HOME”),
spark.executorEnv.SPARK_HOME = Sys.getenv(“SPARK_HOME”),
spark.executorEnv.SPARK_EXTRA_CLASSPATH = Sys.getenv(“SPARK_DIST_CLASSPATH”)
)create spark session
sparkR.session(master = “yarn”,
appName = “SparkR_swan”,
sparkConfig = conf)load data from HDFS to Spark dataframe
customer <- read.df(path = “/user/pkothuri/TPCDS/tpcds/customer”, source = “parquet”)
head(customer) -
References:
- Spark Configuration - https://spark.apache.org/docs/2.4.5/configuration
- Apache Spark documentation on R - https://spark.apache.org/docs/2.4.5/sparkr.html
- Databricks documentation on R - https://docs.databricks.com/spark/latest/sparkr/overview.html
Please contact swan-admins@cern.ch for consulting regarding SparkR on SWAN with CERN IT Spark clusters