SparkR vs sparklyr for interacting with Spark from R

This post grew out of some notes I was making on the differences between SparkR and sparklyr, two packages that provide an R interface to Spark. I’m currently working on a project where I’ll be interacting with data in Spark, so wanted to get a sense of options using R. Those unfamiliar with sparklyr might benefit from reading the first half of this previous post, where I cover the idea of having R objects for connections to Spark DataFrames. The table below provides a rough summary of my conclusions comparing the two packages.

Feature SparkR sparklyr
Data input & output + + + +
Data manipulation - + + +
Documentation + + + +
Ease of setup + + + +
Function naming - - + + +
Installation + + +
Machine learning + + +
Range of functions + + + + +
Running arbitrary code + + +
Tidyverse compatability - - - + + +

While there aren’t any direct conflicts between SparkR and sparklyr, I have used SparkR:: and sparklyr:: in places to make it clear which package a function is from. It’s also not the case that you can use functions from Spark and sparklyr with the same object. If you call a function from sparklyr on a Spark DataFrame created with SparkR then you’ll just get an error.

In the post I comment on functionality that is, to my knowledge, missing with SparkR and sparklyr. Please let me know in the comments if any of these claims are incorrect and I will update the post.

Below are just the packages that are used in this post. Those wanting to follow along will need to update to tidyverse 1.2.0 such that stringr is loaded in by a call to library(tidyverse). Readers are recommended to follow along with either the sparklyr or SparkR code in a single session, not both.

library(tidyverse)
library(SparkR)
library(sparklyr)

Installation

SparkR

At time of writing SparkR is not on CRAN due to a “policy violation” (a quick Google suggests that this has happened before). This means that installing the SparkR package is not as straight forward as usual. When calling install packages we need to provide the URL of the archived version of the package, set repos = NULL and also type = "source".

install.packages("https://cran.r-project.org/src/contrib/Archive/SparkR/SparkR_2.1.2.tar.gz", 
                 repos = NULL, 
                 type = "source")

Note that this will install SparkR 2.1.2, if we want version 2.2.0 we need to head over to GitHub. Once we’ve got SparkR installed we’re also going to need Spark itself. SparkR includes an install.spark() function which works fine for me.

sparklyr

We can install sparklyr like we would any other packages, and it also provides us with a spark_install() function.

install.packages("sparklyr")

sparklyr::spark_install()

One thing to be aware of is that SparkR and sparklyr will, by default, install Spark in different places. SparkR will install Spark to the cache directory, such as ~/.cache/spark on Linux, ~/Library/Caches/spark on Mac OS X or %LOCALAPPDATA%\Apache\Spark\Cache on Windows. By default sparklyr will install spark to your home directory by default, e.g. ~/spark on Linux.

Setting up a connection

Setup is pretty simple for both packages. With SparkR we set up a connection by providing our master URL. We also have other arguments for things like Spark package dependencies.

SparkR::sparkR.session(master = "local")

For sparklyr we also have a single function call with various options. A lot of the functions in sparklyr require us to provide our Spark connection, so it’s a good idea to create an object for it. Note that I’m having to provide spark_home here because I’ve been messing around with SparkR and sparklyr on my PC. If you’re doing a fresh install of sparklyr then spark_connect(master = "local") will work fine.

sc <- sparklyr::spark_connect(master = "local", 
                              spark_home = "/home/ed/.cache/spark/spark-2.1.2-bin-hadoop2.7")

Function naming

SparkR

As I’ve mentioned elsewhere, SparkR has some serious conflicts with dplyr. It has over 20 functions that share names with dplyr functions. This is clearly a little frustrating given the popularity of dplyr. The names used for functions in SparkR are also a bit inconsistent. For example, there is:

  • as.DataFrame()
  • add_months()
  • isNull()

Ultimately this just makes it harder to remember function names or group similar functions in mind. That said, when it comes to fitting machine learning models all the models are prefixed with spark., which is nice. Equally, functions to configure Spark in various ways are prefixed sparkR..

sparklyr

The situation is rather different with sparklyr. Here functions are named with consistent prefixes to create clear groupings. Functions to do feature transformations are all prefixed ft_.

ls("package:sparklyr", pattern = "^ft_") %>%
  head()
## [1] "ft_binarizer"                 "ft_bucketizer"               
## [3] "ft_count_vectorizer"          "ft_discrete_cosine_transform"
## [5] "ft_elementwise_product"       "ft_index_to_string"

Machine learning functions are all prefixed ml_.

ls("package:sparklyr", pattern = "^ml_") %>%
  head()
## [1] "ml_als_factorization"             "ml_binary_classification_eval"   
## [3] "ml_classification_eval"           "ml_create_dummy_variables"       
## [5] "ml_decision_tree"                 "ml_generalized_linear_regression"

Functions that work on a Spark DataFrame are prefixed sdf_.

ls("package:sparklyr", pattern = "^sdf_") %>%
  head()
## [1] "sdf_along"      "sdf_bind_cols"  "sdf_bind_rows"  "sdf_broadcast" 
## [5] "sdf_checkpoint" "sdf_coalesce"

Finally, functions that configure Spark, read in data, etc. are prefixed spark_.

ls("package:sparklyr", pattern = "^spark_") %>%
  head()
## [1] "spark_apply"              "spark_apply_bundle"      
## [3] "spark_apply_log"          "spark_available_versions"
## [5] "spark_compilation_spec"   "spark_compile"

How much this sort of consistency matters comes down to personal preference, but I find it makes a big difference to the fluency with which I can write code. If you’re new to Spark, as I am, it also makes learning a lot easier when the functions comes in easily identifiable groups.

Data input and output

SparkR provides a range of functions to read in data and copy it to Spark, as well as similar functions for writing data. In both cases we’re reading or writing data directly into or out of Spark.

ls("package:SparkR", pattern = "read")
## [1] "read.df"      "read.jdbc"    "read.json"    "read.ml"     
## [5] "read.orc"     "read.parquet" "read.text"

sparklyr also has a set of similar functions for data input and output.

ls("package:sparklyr", pattern = "spark_read")
## [1] "spark_read_csv"     "spark_read_jdbc"    "spark_read_json"   
## [4] "spark_read_parquet" "spark_read_source"  "spark_read_table"  
## [7] "spark_read_text"

In addition we can transfer an existing R data.frame to Spark using sparklyr::copy_to(). This does the same thing as SparkR::as.DataFrame().

Data manipulation

SparkR

Data manipulation is probably where the biggest differences between SparkR and sparklyr emerge. SparkR has its own versions of the core dplyr verbs mutate(), select(), filter(), summarize(), and arrange(). However, they work slightly differently to dplyr in that they don’t support non-standard evaluation. For example, the code below would generate an error.

mtcars_sparkr <- SparkR::as.DataFrame(mtcars)

mtcars_sparkr %>%
  SparkR::filter(cyl == 4)
Error in SparkR::filter(., cyl == 4) : object 'cyl' not found 

For this code to work we either have to do mtcars_spark$cyl == 4 or "cyl" == 4. This issue applies to all the SparkR equivalents of dplyr verbs. Spark also has a group_by() function, which does the same thing as dplyr’s group_by(). By having functions that match dplyr names but work differently, SparkR places a higher cognitive load on the user. The more familiar you are with dplyr the more often you’re going to trip up.

We can perform various joins with SparkR using the join() function, which takes a joinType argument. Spark also has a range of functions for operating on data in various ways. For example, we could countDistinct() values, dropDuplicates(), or calculate approxQuantile().

sparklyr

In contrast, sparklyr works with dplyr out of the box. As a result, there is no additional cognitive load involved in manipulating a Spark DataFrame with sparklyr. sparklyr also allows us to transform individual columns with the ft_* functions. For example, we could turn a numeric column into discrete buckets with ft_bucketizer().

sparklyr also has a far nicer print method than SparkR. When you print a tbl_spark you get the head of your data in a readable form.

mtcars_sparklyr <- copy_to(sc, mtcars)

mtcars_sparklyr
# Source:   table<mtcars> [?? x 11]
# Database: spark_connection
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
# 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
# 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
# 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
# 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
# 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
# 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
#10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# ... with more rows

Running arbitrary code

SparkR

SparkR lets us run apply arbitrary R code on a Spark DataFrame with a set of apply functions.

## [1] "dapply"        "dapplyCollect" "gapply"        "gapplyCollect"
## [5] "spark.lapply"

With dapply() and gapply() we can apply a function to the partitions or groups of a Spark DataFrame, respectively. The variants with Collect will collect the result of applying the function into R – the functions will return an R data.frame from a Spark DataFrame. These apply functions are a bit clunky to use in that we have to provide a schema for the data when expect back. This involves specifying types and names for all the columns, and would be pretty unwieldy for a big data set. For an idea of what a schema looks like, here’s the one for mtcars if we were expecting back all the columns after using some apply function.

schema <- structType(structField("mpg", "double"),
                     structField("cyl", "double"),
                     structField("disp", "double"),
                     structField("hp", "double"),
                     structField("drat", "double"),
                     structField("wt", "double"),
                     structField("qsec", "double"),
                     structField("vs", "double"),
                     structField("am", "double"),
                     structField("gear", "double"),
                     structField("carb", "double"))

That said, in this case we could actually use the schema() function as a shortcut. Calling schema() on a Spark DataFrame will return the schema for it. This shortcut can’t be used in cases where we want new columns or column names.

spark.lapply() offers a way to distribute computation with Spark and works similarly to base::lapply(). We can also SparkR::collect() a Spark DataFrame into R at any point to get a standard data.frame to work with. Alternatively we can call SparkR::as.data.frame() to get an R data.frame from a Spark DataFrame. To transfer the data back to Spark we just use as.DataFrame() again.

sparklyr

sparklyr supports running arbitrary R code on a Spark DataFrame with sparklyr::spark_apply(). To adapt an example from the sparklyr documentation we could do:

mtcars_sparklyr %>%
  spark_apply(
    function(d) broom::tidy(lm(mpg ~ wt, d)),
    names = c("term", "estimate", "std.error", "statistic", "p.value"),
    group_by = "cyl"
    )

Given that sparklyr::spark_apply() works with partitions and has an optional group_by argument, it does what SparkR::dapply() and SparkR::gapply() do, but in a single function. We can also use dplyr::collect() with the tbl_spark used by sparklyr to get back a standard tibble.

Machine learning

Both packages provide functions to access the machine learning models offered in Spark by MLlib. As noted above, these functions are prefixed with spark. for SparkR and ml_ for sparklyr. SparkR offers a few models that sparklyr doesn’t, such as Gaussian mixtures models (spark.gaussianMixture()) and Isotonic Regression (spark.isoreg()). sparklyr also offers some features that SparkR doesn’t, such as allowing you to do one-vs-rest classification (ml_one_vs_rest()) and principal component analysis (ml_pca()).

SparkR

One of the first things we might want to do for machine learning is partition our data into training and test sets (a simple train/test split is used here to make the code easier to follow). With SparkR this can be done with randomSplit().

mtcars_sparkr_part <- mtcars_sparkr %>%
  randomSplit(weights = c(0.8, 0.2))

Now we have 2 partitions with 80% and 20% of our data to act as our training and test sets. To fit a random forest we’d just do:

# all column names but our outcome
features <- colnames(mtcars_sparkr)[colnames(mtcars_sparkr) != "am"]

# formula for the model
model_formula <- as.formula(str_c("am ~ ", str_c(features, collapse = " + ")))

# fit the random forest
fit_random_forest <- spark.randomForest(
  mtcars_sparkr_part[[1]], # the training partion
  formula = model_formula,
  type = "classification",
  maxDepth = 5L, 
  maxBins = 20L,
  numTrees = 100L, 
  impurity = "entropy",
  seed = 2017,
  subsamplingRate = 0.25 # aka featureSubsetStrategy
)

With SparkR we can call summary() on our model just like we’re used to in R. Depending on the model we’re fitting this will give us different information. For a logistic regression all we get back is the coefficients. For tree-based models we get back more information, but the formatting is pretty unwieldy. For example, the variable importance values for a random forest are formatted as a single character (!!).

[1] "(10,[0,1,2,3,4,5,6,7,8,9],[0.2684531564709065,0.016270869932325224,0.0754007657839239,0.0227552217038972,0.1534651932489192,0.1313723587921444,0.1450224704157068,0.008607718035945643,0.13392615435412844,0.04472609126210256])"

We can also get predictions from our model in the standard way.

predict_random_forest <- predict(fit_random_forest, mtcars_sparkr_part[[2]])

For a classification problem, some models will return probabilities for our different classes. These probabilities come in a format which SparkR is unable to automatically transform into an R data type.

predict_random_forest_df <- predict_random_forest %>%
  SparkR::as.data.frame()

head(predict_random_forest_df$probability)
[[1]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 216 

[[2]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 217 

[[3]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 218 

[[4]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 219 

[[5]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 220 

[[6]]
Java ref type org.apache.spark.ml.linalg.DenseVector id 221

Considerable work needs to be done to get these probabilities into R (see here).

sparklyr

With sparklyr we can partition our data in a way that is, perhaps, more intuitive and readable.

mtcars_sparklyr_part <- mtcars_sparklyr %>%
  sdf_partition(train = 0.8, test = 0.2)

We now have two named partitions that we can access either with mtcars_sparklyr_part$train or mtcars_sparklyr_part[1]. Fitting a model looks very similar to SparkR.

features <- colnames(mtcars_sparklyr)[colnames(mtcars_sparklyr) != "am"]

fit_random_forest <- ml_random_forest(
  mtcars_sparklyr_part$train, # the training partion
  response = "am",
  features = features, # the names minus the outcome
  col.sample.rate = 0.25,
  impurity = "entropy",
  max.bins = 32L, 
  max.depth = 5L,
  num.trees = 100L,  
  type = "classification"
)

Calling summary() on this model doesn’t give us anything particularly informative. However, sparklyr does offer some other functions for evaluating models. We can use ml_tree_feature_importance() to get feature importance for a tree-based model. The output of this function is far more nicely formatted than what we get with SparkR.

ml_tree_feature_importance(sc, fit_random_forest)
#             importance feature
# 1    0.230991451464364    gear
# 2    0.178459391689087    qsec
# 3     0.16359926346592      wt
# 4     0.14757564455443    drat
# 5   0.0976641436781515    disp
# 6   0.0799998860243653     mpg
# 7   0.0634276730442116      hp
# 8   0.0190372662120814    carb
# 9   0.0100737937050789     cyl
#10 0.00917148616230991      vs

We can also use ml_classification_eval() and ml_binary_classification_eval to calculate things like recall, precision, and AUC (see here). As far as I can tell there is no equivalent of ml_classification_eval() from SparkR.

Range of functions

Even accounting for all the conflicts with dplyr, SparkR has a lot more functions than sparklyr.

length(ls("package:SparkR"))
## [1] 313
length(ls("package:sparklyr"))
## [1] 169

For example, SparkR offers functionality to do some reasonably niche things like add_months() to a date column or convert degrees toRadians(). Personally, I find the range of functions offered by SparkR a bit overwhelming, particularly with the inconsistencies in naming. However, in time they may become useful. The full documentation for SparkR can be used to get information on all the functions it offers.

sparklyr has fewer functions but it’s a bit easier to navigate what’s available. As far as I can tell sparklyr has all it’s core functionality sorted bar a few missing models. Bearing in mind we’re only on version 0.6.4 of sparklyr (versus version 2.2.0 of SparkR), I’m sure there’s plenty more to come.

Conclusions

This post grew out of a set of notes I was creating for myself comparing sparklyr and Spark. Hopefully it will be useful for people deciding which package to use. The most obvious differences between the packages come in terms of the function naming conventions and tidyverse compatibility. For some people this will be a big deal, for others it won’t really matter. As a tidyverse evangelist, the compatibility offer by sparklyr is a big bonus.

I’ve also found learning sparklyr a lot easier, in part because of the naming conventions. I suspect that sparklyr would also be easier to use for big projects where you want to try lots of models. By offering functions to easily evaluate models, sparklyr would be great for a workflow where you want to write your own functions to quickly evaluate models or tuning parameters.

Finally, lets reproduce the summary table at the top of this document with the code that created it.

library(knitr)
library(kableExtra)

comparison_table <- data_frame(
                    Package = c("SparkR", "sparklyr"),
               Installation = c("+", "+ +"),
            `Ease of setup` = c("+ +", "+ +"),
          `Function naming` = c("- -", "+ + +"),
      `Data input & output` = c("+ +", "+ +"),
        `Data manipulation` = c("-", "+ + +"),
   `Running arbitrary code` = c(" +", "+ +"),
         `Machine learning` = c("+", "+ +"),
       `Range of functions` = c("+ + +", "+ +"),
  `Tidyverse compatability` = c("- - -", "+ + +"),
              Documentation = c("+ +", "+ +")) %>%
  gather(Feature, Score, -Package) %>%
  spread(Package, Score) %>%
  select(Feature, SparkR, sparklyr)

comparison_table %>%
  mutate_at(
    .vars = vars(2:3), 
    .funs = function(x) { 
      cell_spec(
             x = x, 
        format = "html",
          bold = TRUE,
         color = if_else(str_detect(x, "\\+"), "green", "red")) }) %>%
  kable(
    format = "html", 
    align = "c",
    escape = FALSE) %>%
  kable_styling(bootstrap_options = "hover")
Feature SparkR sparklyr
Data input & output + + + +
Data manipulation - + + +
Documentation + + + +
Ease of setup + + + +
Function naming - - + + +
Installation + + +
Machine learning + + +
Range of functions + + + + +
Running arbitrary code + + +
Tidyverse compatability - - - + + +
comments powered by Disqus