Using SparkR with IT Spark Clusters

Spark is built with R support and published to the LCG releases, although there is no spark connector to simplify connection to the Spark clusters, users can still establish spark connection as below and perform data analysis with SparkR

  1. Start the SWAN session with LCG96 software stack and analytix spark cluster

  2. Open notebook with R kernel and start connection to the cluster as below

    load SparkR library

    library(SparkR, lib.loc = c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”)))

    spark configuration list

    conf <- list(spark.driver.memory = “2g”,
    spark.executor.memory = “2g”,
    spark.driver.host = Sys.getenv(“SERVER_HOSTNAME”),
    spark.driver.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][1],
    spark.blockManager.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][2],
    spark.ui.port = strsplit(Sys.getenv(“SPARK_PORTS”), “,”)[[1]][3],
    spark.executorEnv.LD_LIBRARY_PATH = Sys.getenv(“LD_LIBRARY_PATH”),
    spark.executorEnv.JAVA_HOME = Sys.getenv(“JAVA_HOME”),
    spark.executorEnv.SPARK_HOME = Sys.getenv(“SPARK_HOME”),
    spark.executorEnv.SPARK_EXTRA_CLASSPATH = Sys.getenv(“SPARK_DIST_CLASSPATH”)
    )

    create spark session

    sparkR.session(master = “yarn”,
    appName = “SparkR_swan”,
    sparkConfig = conf)

    load data from HDFS to Spark dataframe

    customer <- read.df(path = “/user/pkothuri/TPCDS/tpcds/customer”, source = “parquet”)
    head(customer)

  3. References:

Please contact swan-admins@cern.ch for consulting regarding SparkR on SWAN with CERN IT Spark clusters