Sparklyr 1.9.2
Avoids the cross-wire when pulling an object from a lazy table instead of pulling a field (#3494)
Converts spark_write_delta() to method
In simulate_vars_spark(), it avoids calling a function named ‘unknown’ in case a package has added such a function name in its environment (#3497)
Sparklyr 1.9.1
Removes use of
%||%in worker’s R scripts to avoid reference error (#3487)Restores support for Spark 2.4 with Scala 2.11 (#3485)
Addresses changes in Spark 4.0 from when it was in preview to now.
ml_load()now uses Spark’s file read to obtain the metadata for the models instead of R’s file read. This approach accounts for when the Spark Context is reading different mounted file protocols and mounted paths (#3478).
Sparklyr 1.9.0
Improvements
Adds support for Spark 4 and Scala 2.13 (#3479):
Adds new Java folder for Spark 4.0.0 with updated code
Adds new JAR file to handle Spark 4+
Updates to different spots in the R code to start handling version 4, as well as releases marked as “preview” by the Spark project
Removes JARs using Scala 2.11
Updates the Spark versions to use for CI
Fixes
sdf_sql()now returns nothing, including an error, when the query outputs an empty dataset (#3439)
Sparklyr 1.8.6
- Addresses issues with R 4.4.0. The root cause was that version checking functions changed how the work.
package_version()no longer acceptsnumeric_version()output. Wrapped thepackage_version()function to coerce the argument if it’s anumeric_versionclass- Comparison operators (
<,>=, etc.) forpackageVersion()do no longer accept numeric values. The changes were to pass the version as a character
- Adding support for Databricks “autoloader” (format:
cloudFiles) for streaming ingestion of files(stream_read_cloudfiles)(@zacdav-db #3432):stream_write_table()stream_read_table()
- Made changes to
stream_write_generic(@zacdav-db #3432):toTablemethod doesn’t allow callingstart, addedto_tableparam that adjusts logicpathoption not propagated whento_tableisTRUE
- Upgrades to Roxygen version 7.3.1
Sparklyr 1.8.5
Fixes
Fixes quoting issue with
dbplyr2.5.0 (#3429)Fixes Windows OS identification (#3426)
Package improvements
Removes dependency on
tibble, all calls are now redirected todplyr(#3399)Removes dependency on
rapddirs(#3401):- Backwards compatibility with
sparklyr0.5 is no longer needed - Replicates selection of cache directory
- Backwards compatibility with
Converts
spark_apply()to a method (#3418)
Spark improvements
- Spark 2.3 is no longer considered maintained as of September 2019
- Removes Java folder for versions 2.3 and below
- Merges Scala file sets into Spark version 2.4
- Re-compiles JARs for version 2.4 and above
- Updates Delta-to-Spark version matching when using
deltaas one of thepackageswhen connecting (#3414)
Sparklyr 1.8.4
Compatability with new dbplyr version
Fixes
db_connection_describe()S3 consistency error (@t-kalinowski)Addresses new error from
dbplyrthat fails when you try to access components from a remotetblusing$Bumps the version of
dbplyrto switch between the two methods to create temporary tablesAddresses new
translate_sql()hard requirement to pass aconobject. Done by passing the current connection orsimulate_hive()
Fixes
Small fix to spark_connect_method() arguments. Removes ‘hadoop_version’
Improvements to handling
pysparklyrload (@t-kalinowski)Fixes ‘subscript out of bounds’ issue found by
pysparklyr(@t-kalinowski)Updates available Spark download links
Improvements
- Removes dependency on the following packages:
digestbase64encellipsis
- Converts
ml_fit()into a S3 method forpysparklyrcompatibility
Test improvements
Improvements and fixes to tests (@t-kalinowski)
Fixes test jobs that include should have included Arrow but did not
Updates to the Spark versions to be tested
Re-adds tests for development
dbplyr
Sparklyr 1.8.3
Improvements
Spark error message relays are now cached instead of the entire content displayed as an R error. This used to overwhelm the interactive session’s console or Notebook, because of the amount of lines returned by the Spark message. Now, by default, it will return the top of the Spark error message, which is typically the most relevant part. The full error can still be accessed using a new function called
spark_last_error()Reduces redundancy on several tests
Handles SQL quoting when the table reference contains multiple levels. The common time someone would encounter an issue is when a table name is passed using
in_catalog(), orin_schema().
Java
- Adds Scala scripts to handle changes in the upcoming version of Spark (3.5)
- Adds new JAR file to handle Spark 3.0 to 3.4
- Adds new JAR file to handle Spark 3.5 and above
Fixes
It prevents an error when
na.rm = TRUEis explicitly set withinpmax()andpmin(). It will now also purposely fail ifna.rmis set toFALSE. The default of these functions in base R is forna.rmto beFALSE, but ever since these functions were released, there has been no warning or error. For now, we will keep that behavior until a better approach can be figured out. (#3353)spark_install()will now properly match when a partial version is passed to the function. The issue was that passing ‘2.3’ would match to ‘3.2.3’, instead of ‘2.3.x’ (#3370)
Package integration
Adds functionality to allow other packages to provide
sparklyradditional back-ends. This effort is mainly focused on adding the ability to integrate with Spark Connect and Databricks Connect through a new package.New exported functions to integrate with the RStudio IDE. They all have the same
spark_ide_prefixModifies several read functions to become exported methods, such as
sdf_read_column().Adds
spark_integ_test_skip()function. This is to allow other packages to usesparklyr’s test suite. It enables a way to the external package to indicate if a given test should run or be skipped.If installed,
sparklyrwill load thepysparklyrpackage
Sparklyr 1.8.2
New Features
Adds Azure Synapse Analytics connectivity (@Bob-Chou , #3336)
Adds support for “parameterized” queries now available in Spark 3.4 (@gregleleu #3335)
Adds new DBI methods:
dbValidanddbDisconnect(@alibell, #3296)Adds
overwriteparameter todbWriteTable()(@alibell, #3296)Adds
databaseparameter todbListTables()(@alibell, #3296)Adds ability to turn off predicate support (where(), across()) using options(“sparklyr.support.predicates” = FALSE). Defaults to TRUE. This should accelerate
dplyrcommands because it won’t need to process column types for every single piped command
Fixes
Fixes Spark download locations (#3331)
Fix various rlang deprecation warnings (@mgirlich, #3333).
Misc
- Switches upper version of Spark to 3.4, and updates JARS (#3334)
Sparklyr 1.8.1
Bug Fixes
- Fixes consistency issues with dplyr’s sample_n(), slice(), op_vars(), and sample_frac()
Internal functionality
- Adds R-devel to GHA testing
Sparklyr 1.8.0
Bug Fixes
Addresses Warning from CRAN checks
Addresses option(stringsAsFactors) usage
Fixes root cause of issue processing pivot wider and distinct (#3317 & #3320)
Updates local Spark download sources
Sparklyr 1.7.9
Bug Fixes
Better resolves intermediate column names when using
dplyrverbs for data transformation (#3286)Fixes
pivot_wider()issues with simpler cases (#3289)Updates Spark download locations (#3298)
Better resolution of intermediate column names (#3286)
Sparklyr 1.7.8
New features
Adds new metric extraction functions:
ml_metrics_binary(),ml_metrics_regression()andml_metrics_multiclass(). They work closer to howyardstickmetric extraction functions work. They expect a table with the predictions and actual values, and returns a concisetibblewith the metrics. (#3281)Adds new
spark_insert_table()function. This allows one to insert data into an existing table definition without redefining the table, even when overwriting the existing data. (#3272 @jimhester)
Bug Fixes
- Restores “validator” functions to regression models. Removing them in a previous version broke
ml_cross_validator()for regression models. (#3273)
Spark
Adds support to Spark 3.3 local installation. This includes the ability to enable and setup log4j version 2. (#3269)
Updates the JSON file that
sparklyruses to find and download Spark for local use. It is worth mentioning that starting with Spark 3.3, the Hadoop version number is no longer using a minor version for its download link. So, instead of requesting 3.2, the version to request is 3.
Internal functionality
Removes workaround for older versions of
arrow. Bumpsarrowversion dependency, from 0.14.0 to 0.17.0 (#3283 @nealrichardson)Removes code related to backwards compatibility with
dbplyr.sparklyrrequiresdbplyrversion 2.2.1 or above, so the code is no longer needed. (#3277)Begins centralizing ML parameter validation into a single function that will run the proper
castfunction for each Spark parameter. It also starts using S3 methods, instead of searching for a concatenated function name, to find the proper parameter validator. Regression models are the first ones to use this new method. (#3279)sparklyrcompilation routines have been improved and simplified.
spark_compile()now provides more informative output when used. It also adds tests to compilation to make sure. It also adds a step to install Scala in the corresponding GHAs. This is so that the new JAR build tests are able to run. (#3275)Stops using package environment variables directly. Any package level variable will be handled by a
genvprefixed function to set and retrieve values. This avoids the risk of having the exact same variable initialized on more than on R script. (#3274)Adds more tests to improve coverage.
Misc
- Addresses new CRAN HTML check NOTEs. It also adds a new GHA action to run the same checks to make sure we avoid new issues with this in the future.
Sparklyr 1.7.7
dplyr
- Makes sure to run previous
dplyractions before sampling (#3276)
Misc
- Ensures compatibility with the upcoming, and current, versions of
dbplyr
Sparklyr 1.7.6
Misc
Ensures compatibility with Spark version 3.2 (#3261)
Compatibility with new
dbplyrversion (@mgirlich)Removes
stringrdependencyFixes
augment()when the model was fitted viaparsnip(#3233)
Sparklyr 1.7.5
Misc
Addresses deprecation of
rlang::is_env()function. (@lionel- #3217)Updates
pivot_wider()to support new version oftidyr(@DavisVaughan #3215)
Sparklyr 1.7.4
Misc
- Edgar Ruiz (https://github.com/edgararuiz) will be the new maintainer of {sparklyr} moving forward.
Sparklyr 1.7.3
Data
Implemented support for the
.groupsparameter fordplyr::summarize()operations on Spark dataframesFixed the incorrect handling of the
remove = TRUEoption forseparate.tbl_spark()Optimized away an extra count query when collecting Spark dataframes from Spark to R.
Misc
By default, use links from the https://dlcdn.apache.org site for downloading Apache Spark when possible.
Attempt to continue
spark_install()process even if the Spark version specified is not present ininst/extdata/versions*.jsonfiles (in which casesparklyrwill guess the URL of the tar ball based on the existing and well-known naming convention used by https://archive.apache.org, i.e., https://archive.apache.org/dist/spark/spark-\({spark version}/spark-\){spark version}-bin-hadoop${hadoop version}.tgz)Revised
inst/extdata/versions*.jsonfiles to reflect recent releases of Apache Spark.Implemented
sparklyr_get_backend_port()for querying the port number used by thesparklyrbackend.
Sparklyr 1.7.2
Connections
Added support for notebook-scoped libraries on Databricks connections. R library tree paths (i.e., those returned from
.libPaths()) are now shared between driver and worker in sparklyr for Databricks connection use cases.Java version validation function of
sparklyrwas revised to be able to parsejava -versionoutputs containing only major version or outputs containing data values.Spark configuration logic was revised to ensure “sparklyr.cores.local” takes precedence over “sparklyr.connect.cores.local”, as the latter is deprecated.
Renamed “sparklyr.backend.threads” (an undocumented, non-user-facing,
sparklyrinternal-only configuration) to “spark.sparklyr-backend.threads” so that it has the required “spark.” prefix and is configurable throughsparklyr::spark_config().For Spark 2.0 or above, if
org.apache.spark.SparkEnv.get()returns a non- null env object, thensparklyrwill use that env object to configure “spark.sparklyr-backend.threads”.Support for running custom callbacks before the
sparklyrbackend starts processing JVM method calls was added for Databricks-related use cases, which will be useful for implementing ADL credential pass-through.
Data
Revised
spark_write_delta()to usedelta.iolibrary version 1.0 when working with Apache Spark 3.1 or above.Fixed a problem with
dbplyr::remote_name()returningNULLon Spark dataframes returned from adplyr::arrange()operation followed bydplyr::compute()(e.g.,<a spark_dataframe> %>% arrange(<some column>) %>% compute()).Implemented
tidyr::replace_na()interface for Spark dataframes.The
n_distinct()summarizer for Spark dataframes was revised substantially to properly supportna.rm = TRUEorna.rm = FALSEuse cases when performingdplyr::summarize(<colname> = n_distinct(...))types of operations on Spark dataframes.Spark data interface functions that create Spark dataframes will no longer check whether any Spark dataframe with identical name exists when the dataframe being created has a randomly generated name (as randomly generated table name will contain a UUID and any chance of name collision is vanishingly small).
Documentation
- Create usage example for
ml_prefixspan().
Sparklyr 1.7.1
Connections
- Fixed an issue with connecting to Apache Spark 3.1 or above.
Sparklyr 1.7.0
Data
Revised
tidyr::fill()implementation to respect any ‘ORDER BY’ clause from the input while ensuring the same ‘ORDER BY’ operation is never duplicated twice in the generated Spark SQL queryHelper functions such as
sdf_rbeta(),sdf_rbinom(), etc were implemented for generating Spark dataframes containing i.i.d. samples from commonly used probability distributions.Fixed a bug with
compute.tbl_spark()’s handling of positional args.Fixed a bug that previously affected
dplyr::tbl()when the source table is specified usingdbplyr::in_schema().Internal calls to
sdf_schema.tbl_spark()andspark_dataframe.tbl_spark()are memoized to reduce performance overhead from repeatedspark_invoke()s.spark_read_image()was implemented to support image files as data sources.spark_read_binary()was implemented to support binary data sources.A specialized version of
tbl_ptype()was implemented so that no data will be collected from Spark to R whendplyrcallstbl_ptype()on a Spark dataframe.Added support for
databaseparameter tosrc_tbls.spark_connection()(e.g.,src_tbls(sc, database = "default")wherescis a Spark connection).Fixed a null pointer issue with
spark_read_jdbc()andspark_write_jdbc().
Distributed R
spark_apply()was improved to supporttibbleinputs containing list columns.Spark dataframes created by
spark_apply()will be cached by default to avoid re-computations.spark_apply()anddo_spark()now supportqsand custom serializations.The experimental
auto_deps = TRUEmode was implemented forspark_apply()to infer required R packages for the closure, and to only copy required R packages to Spark worker nodes when executing the closure.
Extensions
Sparklyr extensions can now customize dbplyr SQL translator env used by
sparklyrby supplying their own dbplyr SQL variant when callingspark_dependency()(see https://github.com/r-spark/sparklyr.sedona/blob/1455d3dea51ad16114a8112f2990ec542458aee2/R/dependencies.R#L38 for an example).jarray()was implemented to convert a R vector into anArray[T]reference. A reference returned byjarray()can be passed toinvoke*family of functions requiring anArray[T]as a parameter where T is some type that is more specific thanjava.lang.Object.jfloat()function was implemented to cast any numeric type in R tojava.lang.Float.jfloat_array()was implemented to instantiateArray[java.lang.Float]from numeric values in R.
Serialization
Added null checks that were previously missing when collecting array columns from Spark dataframe to R.
array<byte>andarray<boolean>columns in a Spark dataframe will be collected asraw()andlogical()vectors, respectively, in R rather than integer arrays.Fixed a bug that previously caused invoke params containing
NaNs to be serialized incorrectly.
Spark ML
ml_compute_silhouette_measure()was implemented to evaluate the Silhouette measure of k-mean clustering results.spark_read_libsvm()now supports specifications of additional options via theoptionsparameter. Additional libsvm data source options currently supported by Spark includenumFeaturesandvectorType(see https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html).ml_linear_svc()will emit a warning ifweight_colis specified while working with Spark 3.0 or above, as it is no longer supported in recent versions of Spark.Fixed an issue with
ft_one_hot_encoder.ml_pipeline()not working as expected.
Sparklyr 1.6.3
Data
Reduced the number of
invoke()calls needed forsdf_schema()to avoid performance issues when processing Spark dataframes with non-trivial number of columnsImplement memoization for
spark_dataframe.tbl_spark()andsdf_schema.tbl_spark()to reduce performance overhead for somedplyruse cases involving Spark dataframes with non-trivial number of columns
Sparklyr 1.6.2
Data
- A previous bug fix related to
dplyr::compute()caching a Spark view needed to be further revised to take effect with dbplyr backend API edition 2
Sparklyr 1.6.1
Data
sdf_distinct()is implemented to be an R interface fordistinct()operation on Spark dataframes (NOTE: this is different from thedplyr::distinct()operation, asdplyr::distinct()operation on a Spark dataframe now supports.keep_all = TRUEand has more complex ordering requirements)Fixed a problem of some expressions being evaluated twice in
transmute.tbl_spark()(see tidyverse/dbplyr#605)dbExistsTable()now performs case insensitive comparison with table names to be consistent with how table names are handled by Spark catalog APIFixed a bug with
sql_query_save()not overwriting a temp table with identical nameRevised
sparklyr:::process_tbl_name()to correctly handle inputs that are not table namesBug fix:
db_save_query.spark_connection()should also cache the view it created in Spark
Sparklyr 1.6.0
Data
Made
sparklyrcompatible with both dbplyr edition 1 and edition 2 APIsRevised
sparklyr’s integration withdbplyrAPI so thatdplyr::select(),dplyr::mutate(), anddplyr::summarize()verbs on Spark dataframes involvingwhere()predicates can be correctly translated to Spark SQL (e.g., one can havesdf %>% select(where(is.numeric))andsdf %>% summarize(across(starts_with("Petal"), mean)), etc)Implemented
dplyr::if_all()anddplyr::if_any()support for Spark dataframesAdded support for
partition_byoption instream_write_*methodsFixed a bug with URI handling affecting all
spark_read_*methodsAvoided repeated creations of SimpleDataFormat objects and setTimeZone calls while collecting Data columns from a Spark dataframe
Schema specification for struct columns in
spark_read_*()methods are now supported (e.g.,spark_read_json(sc, path, columns = list(s = list(a = "integer, b = "double")))says expect a struct column namedswith each element containing a field namedaand a field namedb)sdf_quantile()andft_quantile_discretizer()now support approximation of weighted quantiles using a modified version of the Greenwald-Khanna algorithm that takes relative weight of each data point into consideration.Fixed a problem of some expressions being evaluated twice in
transmute.tbl_spark()(see tidyverse/dbplyr#605)Made
dplyr::distinct()behavior for Spark dataframes configurable: settingoptions(sparklyr.dplyr_distinct.impl = "tbl_lazy)will switchdplyr::distinct()implementation to a basic one that only adds ‘DISTINCT’ clause to the current Spark SQL query, does not support the.keep_all = TRUEoption, and (3) does not have any ordering guarantee for the output.
Serialization
spark_write_rds()was implemented to support exporting all partitions of a Spark dataframe in parallel into RDS (version 2) files. Such RDS files will be written to the default file system of the Spark instance (i.e., local file if the Spark instance is running locally, or a distributed file system such as HDFS if the Spark instance is deployed over a cluster). The resulting RDS files, once downloaded onto the local file system, should be deserialized into R dataframes usingcollect_from_rds()(which callsreadRDS()internally and also performs some important post-processing steps to support timestamp columns, date columns, and struct columns properly in R).copy_to()can now import list columns of temporal values within a R dataframe as arrays of Spark SQL date/timestamp types when working with Spark 3.0 or aboveFixed a bug with
copy_to()’s handling of NA values in list columns of a R dataframeSpark map type will be collected as list instead of environment in R in order to support empty string as key
Fixed a configuration-related bug in
sparklyr:::arrow_enabled()Implemented spark-apply-specific configuration option for Arrow max records per batch, which can be different from the
spark.sql.execution.arrow.maxRecordsPerBatchvalue from Spark session config
Connections
Created convenience functions for working with Spark runtime configurations
Fixed buggy exit code from the
spark-submitprocess launched by sparklyr
Spark ML
Implemented R interface for Power Iteration Clustering
The
handle_invalidoption is added toft_vector_indexer()(supported by Spark 2.3 or above)
Misc
- Fixed a bug with
~within some path components not being normalized insparklyr::livy_install()
Sparklyr 1.5.2
Connections
Fixed
op_vars()specification indplyr::distinct()verb for Spark dataframesspark_disconnect()now closes the Spark monitoring connection correctly
Data
Implement support for stratified sampling in
ft_dplyr_transformer()Added support for
na.rmin dplyrrowSums()function for Spark dataframes
Sparklyr 1.5.1
Connections
A bug in how multiple
--confvalues were handled in some scenarios within the spark-submit shell args which was introduced in sparklyr 1.4 has been fixed now.A bug with
livy.jarsconfiguration was fixed (#2843)
Data
tbl()methods were revised to be compatible withdbplyr2.0 when handling inputs of the form"<schema name>.<table name>"
Sparklyr 1.5.0
Connections
spark_web()has been revised to work correctly in environments such as RStudio Server or RStudio Cloud where the Spark web UI URLs such as “http://localhost:4040/jobs/” needs to be translated withrstudioapi::translateLocalUrl()to be accessible.The problem with bundle file name collisions when
session_idis not provided has been fixed inspark_apply_bundle().Support for
sparklyr.livy.sourcesis removed completely as it is no longer needed as a workaround when Spark version is specified.
Data
stream_lag()is implemented to provide the equivalent functionality ofdplyr::lag()for streaming Spark dataframes while also supporting additional filtering of “outdated” records based on timestamp threshold.A specialized version of
dplyr::distinct()is implemented for Spark dataframes that supports.keep_all = TRUEand correctly satisfies the “rows are a subset of the input but appear in the same order” requirement stated in thedplyrdocumentation.The default value for the
repartitionparameter ofsdf_seq()has been corrected.Some implementation detail was revised to make
sparklyr1.5 fully compatible withdbplyr2.0.sdf_expand_grid()was implemented to support roughly the equivalent ofexpand.grid()for Spark dataframes while also offering additional Spark- specific options such as broadcast hash joins, repartitioning, and caching of the resulting Spark dataframe in memory.sdf_quantile()now supports calculation for multiple columns.Both
lead()andlag()methods for dplyr interface ofsparklyrare fixed to correctly accept theorder_byparameter.The
cumprod()window aggregation function for dplyr was reimplemented to correctly handle null values in Spark dataframes.Support for
missingparameter is implemented for theifelse()/if_else()function for dplyr.A
weighted.mean()summarizer was implemented for dplyr interface ofsparklyr.A workaround was created to ensure
NA_real_is handled correctly within the contexts ofdplyr::mutate()anddplyr::transmute()methods (e.g.,sdf %>% dplyr::mutate(z = NA_real_)should result in a column named “z” with double-precision SQL type)Support for R-like subsetting operator (
[) was implemented for selecting a subset of columns from a Spark dataframe.The
rowSums()function was implemented for dplyr interface ofsparklyr.The
sdf_partition_sizes()function was created to enable efficient query of partition sizes within a Spark dataframe.Stratified sampling for Spark dataframes has been implemented and can be expressed using dplyr grammar as
<spark dataframe> %>% dplyr::group_by(<columns>) %>% dplyr::sample_n(...)or<spark dataframe> %>% dplyr::group_by(<columns>) %>% dplyr::sample_frac(...)where<columns>is a list of grouping column(s) defining the strata (i.e., the sampling specified bydplyr::sample_n()ordplyr::sample_frac()will be applied to each group defined bydplyr::group_by(<columns>))The implementations of
dplyr::sample_n()anddplyr::sample_frac()have been revised to first perform aggregations on individual partitions before merging aggregated results from all partitions, which is more efficient thanmapPartitions()followed byreduce().sdf_unnest_longer()andsdf_unnest_wider()were implemented and offer the equivalents oftidyr::unnest_longer()andtidyr::unnest_wider()for for Spark dataframes.
Serialization
copy_to()now serializes R dataframes into RDS format instead of CSV format ifarrowis unavailable. RDS serialization is approximately 48% faster than CSV and allows multiple correctness issues related to CSV serialization to be fixed easily insparklyr.copy_to()andcollect()now correctly preserveNA_real_(NA_real_from a R dataframe, once translated asnullin a Spark dataframe, used to be incorrectly collected asNaNin previous versions ofsparklyr).copy_to()can now distinguish"NA"fromNAas expected.copy_to()now supports importing binary columns from R dataframes to Spark.Reduced serialization overhead in Spark-based
foreachparallel backend created withregisterDoSpark().
Sparklyr 1.4.0
Connections
RAPIDS GPU acceleration plugin can now be enabled with
spark_connect(..., package = "rapids")and configured withspark_configoptions prefixed with “spark.rapids.”Enabled support for http{,s} proxy plus additional CURL options for Livy connections
In sparklyr error message, suggest
options(sparklyr.log.console = TRUE)as a trouble-shooting step whenever the “sparklyr gateway not responding” error occursAddressed an inter-op issue with Livy + Spark 2.4 (https://github.com/sparklyr/sparklyr/issues/2641)
Added configurable retries for Gateway ports query (https://github.com/sparklyr/sparklyr/pull/2654)
App name setting now takes effect as expected in YARN cluster mode (https://github.com/sparklyr/sparklyr/pull/2675)
Data
Support for newly introduced higher-order functions in Spark 3.0 (e.g.,
array_sort,map_filter,map_zip_with, and many others)Implemented parallelizable weighted sampling methods for sampling from a Spark data frames with and without replacement using exponential variates
Replaced
dplyr::sample_*implementations based onTABLESAMPLEwith alternative implementation that can return exactly the number of rows or fraction specified and also properly support sampling with-replacement, without-replacement, and repeatable sampling use casesAll higher-order functions and sampling methods are made directly accessible through
dplyrverbsMade
greplpart of thedplyrinterface for Spark data framesTidyr verbs such as
pivot_wider,pivot_longer,nest,unnest,separate,unite, andfillnow have specialized implementations insparklyrfor working with Spark data framesMade
dplyr::inner_join,dplyr::left_join,dplyr::right_join, anddplyr::full_joinreplace'.'with'_'insuffixparameter when working with Spark data frames (https://github.com/sparklyr/sparklyr/issues/2648)
Distributed R
Fixed an issue with global variables in
registerDoSpark(https://github.com/sparklyr/sparklyr/pull/2608)Revised
spark_read_compat_paramto avoid collision on names assigned to different Spark data frames
Misc
Fixed a rendering issue with HTML reference pages
Made test reporting in Github CI workflows more informative (https://github.com/sparklyr/sparklyr/pull/2672)
Spark ML
ft_robust_scalerwas created as the R interface for theRobustScalerfunctionality in Spark 3 or above
Sparklyr 1.3.1
Distributed R
- Fixed a bug in ordering of parameters for a lamba expression when the lambda expression passed to a
hof_*method is specified with a R formula and the lambda takes 2 parameters
Sparklyr 1.3.0
Spark ML
ml_evaluate()methods are implemented for ML clustering and classification models
Distributed R
Created helper methods to integrate Spark SQL higher-order functions with
dplyr::mutateImplemented option to pass partition index as a named parameter to
spark_apply()transform functionEnabled transform function of
spark_apply()to return nested listsAdded option to return R objects instead of Spark data frame rows from transform function of
spark_applysdf_collect()now supports fetching Spark data frame row-by-row rather than column-by-column, and fetching rows using iterator instead of collecting all rows into memorySupport for
partitionwhen using barrier execution inspark_apply(#2454)
Connections
Sparklyr can now connect with Spark 2.4 built with Scala 2.12 using
spark_connect(..., scala_version = "2.12")Hive integration can now be disabled by configuration in
spark_connect()(#2465)A JVM object reference counting bug affecting secondary Spark connections was fixed (#2515)
Revised JObj envs initialization for Databricks connections (#2533)
Serialization
Timezones, if present in data, are correctly represented now in Arrow serialization
Embedded nul bytes are removed from strings when reading strings from Spark to R (#2250)
Support to collect objectts of type
SeqWrapper(#2441)
Data
Created helper methods to integrate Spark SQL higher-order functions with
dplyr::mutateNew
spark_read()method to allow user-defined R functions to be run on Spark workers to import data into a Spark data framespark_write()method is implemented allow user-defined functions to be run on Spark workers to export data from a Spark data frameAvro functionalities such as
spark_read_avro(),spark_write_avro(),sdf_from_avro(), andsdf_to_avro()are implemented and can be optionally enabled withspark_connect(..., package = "avro")
Extensions
- Fixed a bug where Spark package repositories specification was not honored by
spark_dependency(). Therepositoriesparameter ofspark_dependency()now works as expected.
Misc
Fixed warnings for deprecated functions (#2431)
More test coverage for Databricks Connect and Databricks Notebook modes
Embedded R sources are now included as resources rather than as a Scala string literal in
sparklyr-*.jarfiles, so that they can be updated without re-compilation of Scala source filesA mechanism is created to verify embedded sources in
sparklyr-*.jarfiles are in-sync with current R source files and this verification is now part of the Github CI workflow forsparklyr
Sparklyr 1.2.0
Distributed R
Add support for using Spark as a foreach parallel backend
Fixed a bug with how
columnsparameter was interpreted inspark_apply
Data
Allow
sdf_query_planto also get analyzed planAdd support for serialization of R date values into corresponding Hive date values
Fixed the issue of date or timestamp values representing the UNIX epoch (1970-01-01) being deserialized incorrectly into NAs
Better support for querying and deserializing Spark SQL struct columns when working with Spark 2.4 or above
Add support in
copy_to()for columns with nested lists (#2247).Significantly improve
collect()performance for columns with nested lists (#2252).
Connection
Add support for Databricks Connect
Add support for
copy_toin Databricks connectionEnsure spark apply bundle files created by multiple Spark sessions don’t overwrite each other
Fixed an interop issue with spark-submit when running with Spark 3 preview
Fixed an interop issue with Sparklyr gateway connection when running with Spark 3 preview
Fixed a race condition of JVM object with refcount 1 being removed from JVM object tracker before pending method invocation(s) on them could be initiated (NOTE: previously this would only happen when the R process was running under high memory pressure)
Allow a chain of JVM method invocations to be batched into 1
invokecallRemoval of unneeded objects from JVM object tracker no longer blocks subsequent JVM method invocations
Add support for JDK11 for Spark 3 preview.
Misc
Support for installing Spark 3.0 Preview 2.
Emit more informative error message if network interface required for
spark_connectis not upFixed a bug preventing more than 10 rows of a Spark table to be printed from R
Fixed a spelling error in
printmethod forml_model_naive_bayesobjectsMade
sdf_drop_duplicatesan exported function (previously it was not exported by mistake)Fixed a bug in
summary()ofml_linear_regression
Sparklyr 1.1.0
Distributed R
- Add support for barrier execution mode with
barrier = TRUEinspark_apply()(@samuelmacedo83, #2216).
Streaming
Add support for
stream_read_delta()andstream_write_delta().Fixed typo in
stream_read_socket().
Data
Allow using Scala types in schema specifications. For example,
StringTypein thecolumnsparameter forspark_read_csv()(@jozefhajnala, #2226)Add support for
DBI 1.1to implement missingdbQuoteLiteralsignature (#2227).
Livy
Add support for Livy 0.6.0.
Deprecate uploading sources to Livy, a jar is now always used and the
versionparameter inspark_connect()is always required.Add config
sparklyr.livy.branchto specify the branch used for the sparklyr JAR.Add config
sparklyr.livy.jarto configure path or URL to sparklyr JAR.
Data
- Add support for
partition_bywhen usingspark_write_delta()(#2228).
Sparklyr 1.0.5
Serialization
- R environments are now sent to Scala Maps rather than
java.util.Map[Object, Object](#1058).
Data
Allow
sdf_sql()to accept glue strings (@yutannihilation, #2171).Support to read and write from Delta Lake using
spark_read_delta()andspark_write_delta()(#2148).
Connections
spark_connect()supports newpackagesparameter to easily enablekafkaanddelta(#2148).spark_disconnect()returns invisibly (#2028).
Configuration
- Support to specify config file location using the
SPARKLYR_CONFIG_FILEenvironment variable (@AgrawalAmey, #2153).
Compilation
- Support for Scala 12 (@lu-wang-dl, #2154).
YARN
- Fix
curl_fetch_memoryerror when using YARN Cluster mode (#2157).
Sparklyr 1.0.4
Arrow
- Support for Apache Arrow 0.15 (@nealrichardson, #2132).
Sparklyr 1.0.3
Kuberenetes
- Support for port forwarding in Windows using RStudio terminal.
dplyr
- Fix support for
compute()in Spark 1.6 (#2099)
Data
- The
spark_read_()functions now support multiple parameters (@jozefhajnala, #2118).
Connections
- Fix for Qubole connections for single user and multiple sessions (@vipul1409, #2128).
Sparklyr 1.0.2
Connections
- Support for Qubole connections using
mode = "quobole"(@vipul1409, #2039).
Extensions
- When
invoke()fails due to mismatched parameters, warning with info is logged.
RStudio
- Spark UI path can now be accessed even when the R session and Spark are bussy.
Distributed
Configuration setting
sparklyr.apply.serializercan be used to select serializer version inspark_apply().Fix for
spark_apply_log()and useRClosureas logging component.
ML
ml_corr()retrieve atibblefor better formatting.
Misc
- Support for Spark 2.3.3 and 2.4.3.
Data
The
infer_schemaparameter now defaults tois.null(column).The
spark_read_()functions support loading data with namedpathbut no explicitname.
Sparklyr 1.0.1
ML
ml_lda(): Allow passing of optional arguments via...to regex tokenizer, stop words remover, and count vectorizer components in the formula API.Implemented
ml_evaluate()for logistic regression, linear regression, and GLM models.Implemented
print()method forml_summaryobjects.Deprecated
compute_cost()for KMeans in Spark 2.4 (#1772).Added missing internal constructor for clustering evaluator (#1936).
sdf_partition()has been renamed tosdf_random_split().Added
ft_one_hot_encoder_estimator()(#1337).
Misc
Added
sdf_crosstab()to create contingency tables.Fix
tibble::as.tibble()deprecation warning.
Connections
- Reduced default memory for local connections when Java x64 is not installed (#1931).
Batches
- Add support in
spark-submitwith R file to pass additional arguments to R file (#1942).
Distributed R
- Fix support for multiple library paths when using
spark.r.libpaths(@mattpollock, #1956).
Extensions
Support for creating an Spark extension package using
spark_extension().Add support for repositories in
spark_dependency().
DataFrames
- Fix
sdf_bind_cols()when usingdbplyr1.4.0.
Kubernetes
- Fix regression in
spark_config_kubernetes()configuration helper.
Sparklyr 1.0.0
Arrow
- Support for Apache Arrow using the
arrowpackage.
ML
The
datasetparameter for estimator feature transformers has been deprecated (#1891).ml_multilayer_perceptron_classifier()gains probabilistic classifier parameters (#1798).Removed support for all undocumented/deprecated parameters. These are mostly dot case parameters from pre-0.7.
Remove support for deprecated
function(pipeline_stage, data)signature insdf_predict/transform/fitfunctions.Soft deprecate
sdf_predict/transform/fitfunctions. Users are advised to useml_predict/transform/fitfunctions instead.Utilize the ellipsis package to provide warnings when unsupported arguments are specified in ML functions.
Livy
Support for sparklyr extensions when using Livy.
Significant performance improvements by using
versioninspark_connect()which enables using the sparklyr JAR rather than sources.Improved memory use in Livy by using string builders and avoid print backs.
Data
Fix for
DBI::sqlInterpolate()and related methods to properly quote parameterized queries.copy_to()names tablessparklyr_tmp_instead ofsparklyr_for consistency with other temp tables and to avoid rendering them under the connections pane.copy_to()andcollect()are not re-exported since they are commonly used even when usingDBIor outside data analysis use cases.Support for reading
pathas the second parameter inspark_read_*()when no name is specified (e.g.spark_read_csv(sc, "data.csv")).Support for batches in
sdf_collect()anddplyr::collect()to retrieve data incrementally using a callback function provided through acallbackparameter. Useful when retrieving larger datasets.Support for batches in
sdf_copy_to()anddplyr::copy_to()by passing a list of callbacks that retrieve data frames. Useful when uploading larger datasets.spark_read_source()now has apathparameter for specifying file path.Support for
wholeparameter forspark_read_text()to read an entire text file without splitting contents by line.
Broom
- Implemented
tidy(),augment(), andglance()forml_lda()andml_als()models (@samuelmacedo83)
Connections
Local connection defaults now to 2GB.
Support to install and connect based on major Spark versions, for instance:
spark_connect(master = "local", version = "2.4").Support for installing and connecting to Spark 2.4.
Serialization
- Faster retrieval of string arrays.
YARN
New YARN action under RStudio connection pane extension to launch YARN UI. Configurable through the
sparklyr.web.yarnconfiguration setting.Support for property expansion in
yarn-site.xml(@lgongmsft, #1876).
Distributed R
- The
memoryparameter inspark_apply()now defaults toFALSEwhen thenameparameter is not specified.
Other
Removed dreprecated
sdf_mutate().Remove exported
ensure_functions which were deprecated.Fixed missing Hive tables not rendering under some Spark distributions (#1823).
Remove dependency on broom.
Fixed re-entrancy job progress issues when running RStudio 1.2.
Tables with periods supported by setting
sparklyr.dplyr.period.splitstoFALSE.sdf_len(),sdf_along()andsdf_seq()default to 32 bit integers but allow support for 64 bits throughbitsparameter.Support for detecting Spark version using
spark-submit.
Sparklyr 0.9.4
Improved multiple streaming documentation examples (#1801, #1805, #1806).
Fix issue while printing Spark data frames under
tibble2.0.0 (#1829).Support for
stream_write_console()to write to console log.Support for
stream_read_scoket()to read socket streams.Fix to
spark_read_kafka()to remove unusedpath.
Sparklyr 0.9.3
Fix to make
spark_config_kubernetes()work with variablejarparameters.Support to install and use Spark 2.4.0.
Improvements and fixes to
spark_config_kubernetes()parameters.Support for
sparklyr.connect.ondisconnectconfig setting to allow cleanup of resources when using kubernetes.spark_apply()andspark_apply_bundle()properly dereference symlinks when creating package bundle (@awblocker, #1785)Fix
tableNamewarning triggered while connecting.Deprecate
sdf_mutate()(#1754).Fix requirement to specify
SPARK_HOME_VERSIONwhenversionparameter is set inspark_connect().Cloudera autodetect Spark version improvements.
Fixed default for
sessioninreactiveSpark().Removed
stream_read_jdbc()andstream_write_jdbc()since they are not yet implemented in Spark.Support for collecting NA values from logical columns (#1729).
Proactevely clean JVM objects when R object is deallocated.
Sparklyr 0.9.2
Support for Spark 2.3.2.
Fix installation error with older versions of
rstudioapi(#1716).Fix missing callstack and error case while logging in
spark_apply().Proactevely clean JVM objects when R object is deallocated.
Broom
- Implemented
tidy(),augment(), andglance()forml_linear_svc()andml_pca()models (@samuelmacedo83)
Sparklyr 0.9.2
Support for Spark 2.3.2.
Fix installation error with older versions of
rstudioapi(#1716).Fix missing callstack and error case while logging in
spark_apply().Fix regression in
sdf_collect()failing to collect tables.Fix new connection RStudio selectors colors when running under OS X Mojave.
Support for launching Livy logs from connection pane.
Sparklyr 0.9.2
Removed
overwriteparameter inspark_read_table()(#1698).Fix regression preventing using R 3.2 (#1695).
Additional jar search paths under Spark 2.3.1 (#1694)
Sparklyr 0.9.1
Terminate streams when Shiny app terminates.
Fix
dplyr::collect()with Spark streams and improve printing.Fix regression in
sparklyr.sanitize.column.names.verbosesetting which would cause verbose column renames.Fix to
stream_write_kafka()andstream_write_jdbc().
Sparklyr 0.9.0
Streaming
Support for
stream_read_*()andstream_write_*()to read from and to Spark structured streams.Support for
dplyr,sdf_sql(),spark_apply()and scoring pipeline in Spark streams.Support for
reactiveSpark()to create ashinyreactive over a Spark stream.Support for convenience functions
stream_*()to stop, change triggers, print, generate test streams, etc.
Monitoring
Support for interrupting long running operations and recover gracefully using the same connection.
Support cancelling Spark jobs by interrupting R session.
Support for monitoring job progress within RStudio, required RStudio 1.2.
Progress reports can be turned off by setting
sparklyr.progresstoFALSEinspark_config().
Kubernetes
Added config
sparklyr.gateway.routingto avoid routing to ports since Kubernetes clusters have unique spark masters.Change backend ports to be choosen deterministically by searching for free ports starting on
sparklyr.gateway.portwhich default to8880. This allows users to enable port forwarding withkubectl port-forward.Added support to set config
sparklyr.events.aftersubmitto a function that is called afterspark-submitwhich can be used to automatically configure port forwarding.
Batches
- Added support for
spark_submit()to assist submitting non-interactive Spark jobs.
Spark ML
- (Breaking change) The formula API for ML classification algorithms no longer indexes numeric labels, to avoid the confusion of
0being mapped to"1"and vice versa. This means that if the largest numeric label isN, Spark will fit aN+1-class classification model, regardless of how many distinct labels there are in the provided training set (#1591). - Fix retrieval of coefficients in
ml_logistic_regression()(@shabbybanks, #1596). - (Breaking change) For model objects,
lazy valanddefattributes have been converted to closures, so they are not evaluated at object instantiation (#1453). - Input and output column names are no longer required to construct pipeline objects to be consistent with Spark (#1513).
- Vector attributes of pipeline stages are now printed correctly (#1618).
- Deprecate various aliases favoring method names in Spark.
ml_binary_classification_eval()ml_classification_eval()ml_multilayer_perceptron()ml_survival_regression()ml_als_factorization()
- Deprecate incompatible signatures for
sdf_transform()andml_transform()families of methods; the former should take atbl_sparkas the first argument while the latter should take a model object as the first argument. - Input and output column names are no longer required to construct pipeline objects to be consistent with Spark (#1513).
Data
Implemented support for
DBI::db_explain()(#1623).Fixed for
timestampfields when usingcopy_to()(#1312, @yutannihilation).Added support to read and write ORC files using
spark_read_orc()andspark_write_orc()(#1548).
Livy
Fixed
must share the same srcerror forsdf_broadcast()and other functions when using Livy connections.Added support for logging
sparklyrserver events and logging sparklyr invokes as comments in the Livy UI.Added support to open the Livy UI from the connections viewer while using RStudio.
Improve performance in Livy for long execution queries, fixed
livy.session.command.timeoutand support forlivy.session.command.intervalto control max polling while waiting for command response (#1538).Fixed Livy version with MapR distributions.
Removed
installcolumn fromlivy_available_versions().
Distributed R
Added
nameparameter tospark_apply()to optionally name resulting table.Fix to
spark_apply()to retain column types when NAs are present (#1665).spark_apply()now supportsrlanganonymous functions. For example,sdf_len(sc, 3) %>% spark_apply(~.x+1).Breaking Change:
spark_apply()no longer defaults to the input column names when thecolumnsparameter is nos specified.Support for reading column names from the R data frame returned by
spark_apply().Fix to support retrieving empty data frames in grouped
spark_apply()operations (#1505).Added support for
sparklyr.apply.packagesto configure default behavior forspark_apply()parameters (#1530).Added support for
spark.r.libpathsto configure package library inspark_apply()(#1530).
Connections
Default to Spark 2.3.1 for installation and local connections (#1680).
ml_load()no longer keeps extraneous table views which was cluttering up the RStudio Connections pane (@randomgambit, #1549).Avoid preparing windows environment in non-local connections.
Extensions
The
ensure_*family of functions is deprecated in favor of forge which doesn’t use NSE and provides more informative errors messages for debugging (#1514).Support for
sparklyr.invoke.traceandsparklyr.invoke.trace.callstackconfiguration options to trace allinvoke()calls.Support to invoke methods with
chartypes using single character strings (@lawremi, #1395).
Serialization
- Fixed collection of
Datetypes to support correct local JVM timezone to UTC ().
Documentation
- Many new examples for
ft_binarizer(),ft_bucketizer(),ft_min_max_scaler,ft_max_abs_scaler(),ft_standard_scaler(),ml_kmeans(),ml_pca(),ml_bisecting_kmeans(),ml_gaussian_mixture(),ml_naive_bayes(),ml_decision_tree(),ml_random_forest(),ml_multilayer_perceptron_classifier(),ml_linear_regression(),ml_logistic_regression(),ml_gradient_boosted_trees(),ml_generalized_linear_regression(),ml_cross_validator(),ml_evaluator(),ml_clustering_evaluator(),ml_corr(),ml_chisquare_test()andsdf_pivot()(@samuelmacedo83).
Broom
- Implemented
tidy(),augment(), andglance()forml_aft_survival_regression(),ml_isotonic_regression(),ml_naive_bayes(),ml_logistic_regression(),ml_decision_tree(),ml_random_forest(),ml_gradient_boosted_trees(),ml_bisecting_kmeans(),ml_kmeans()andml_gaussian_mixture()models (@samuelmacedo83)
Configuration
Deprecated configuration option
sparklyr.dplyr.compute.nocache.Added
spark_config_settings()to list allsparklyrconfiguration settings and describe them, cleaned all settings and grouped by area while maintaining support for previous settings.Static SQL configuration properties are now respected for Spark 2.3, and
spark.sql.catalogImplementationdefaults tohiveto maintain Hive support (#1496, #415).spark_config()values can now also be specified asoptions().Support for functions as values in entries to
spark_config()to enable advanced configuration workflows.
Sparklyr 0.8.4
Added support for
spark_session_config()to modify spark session settings.Added support for
sdf_debug_string()to print execution plan for a Spark DataFrame.Fixed DESCRIPTION file to include test packages as requested by CRAN.
Support for
sparklyr.spark-submitasconfigentry to allow customizing thespark-submitcommand.Changed
spark_connect()to give precedence to theversionparameter overSPARK_HOME_VERSIONand other automatic version detection mechanisms, improved automatic version detection in Spark 2.X.Fixed
sdf_bind_rows()withdplyr 0.7.5and prepend id column instead of appending it to match behavior.broom::tidy()for linear regression and generalized linear regression models now give correct results (#1501).
Sparklyr 0.8.3
- Support for Spark 2.3 in local windows clusters (#1473).
Sparklyr 0.8.2
Support for resource managers using
httpsinyarn-clustermode (#1459).Fixed regression for connections using Livy and Spark 1.6.X.
Sparklyr 0.8.1
- Fixed regression for connections using
modewithdatabricks.
Sparklyr 0.8.0
Spark ML
Added
ml_validation_metrics()to extract validation metrics from cross validator and train split validator models.ml_transform()now also takes a list of transformers, e.g. the result ofml_stages()on aPipelineModel(#1444).Added
collect_sub_modelsparameter toml_cross_validator()andml_train_validation_split()and helper functionml_sub_models()to allow inspecting models trained for each fold/parameter set (#1362).Added
parallelismparameter toml_cross_validator()andml_train_validation_split()to allow tuning in parallel (#1446).Added support for
feature_subset_strategyparameter in GBT algorithms (#1445).Added
string_order_typetoft_string_indexer()to allow control over how strings are indexed (#1443).Added
ft_string_indexer_model()constructor for the string indexer transformer (#1442).Added
ml_feature_importances()for extracing feature importances from tree-based models (#1436).ml_tree_feature_importance()is maintained as an alias.Added
ml_vocabulary()to extract vocabulary from count vectorizer model andml_topics_matrix()to extract matrix from LDA model.ml_tree_feature_importance()now works properly with decision tree classification models (#1401).Added
ml_corr()for calculating correlation matrices andml_chisquare_test()for performing chi-square hypothesis testing (#1247).ml_save()outputs message when model is successfully saved (#1348).ml_routines no longer capture the calling expression (#1393).Added support for
offsetargument inml_generalized_linear_regression()(#1396).Fixed regression blocking use of response-features syntax in some
ml_functions (#1302).Added support for Huber loss for linear regression (#1335).
ft_bucketizer()andft_quantile_discretizer()now support multiple input columns (#1338, #1339).Added
ft_feature_hasher()(#1336).Added
ml_clustering_evaluator()(#1333).ml_default_stop_words()now returns English stop words by default (#1280).Support the
sdf_predict(ml_transformer, dataset)signature with a deprecation warning. Also added a deprecation warning to the usage ofsdf_predict(ml_model, dataset). (#1287)Fixed regression blocking use of
ml_kmeans()in Spark 1.6.x.
Extensions
invoke*()method dispatch now supportsCharandShortparameters. Also,Longparameters now allow numeric arguments, but integers are supported for backwards compatibility (#1395).invoke_static()now supports calling Scala’s package objects (#1384).spark_connectionandspark_jobjclasses are now exported (#1374).
Distributed R
Added support for
profileparameter inspark_apply()that collects a profile to measure perpformance that can be rendered using theprofvispackage.Added support for
spark_apply()under Livy connections.Fixed file not found error in
spark_apply()while working under low disk space.Added support for
sparklyr.apply.options.rscript.beforeto run a custom command before launching the R worker role.Added support for
sparklyr.apply.options.vanillato be set toFALSEto avoid using--vanillawhile launching R worker role.Fixed serialization issues most commonly hit while using
spark_apply()with NAs (#1365, #1366).Fixed issue with dates or date-times not roundtripping with `spark_apply() (#1376).
Fixed data frame provided by
spark_apply()to not provide characters not factors (#1313).
Miscellaneous
Fixed typo in
sparklyr.yarn.cluster.hostaddress.timeot(#1318).Fixed regression blocking use of
livy.session.start.timeoutparameter in Livy connections.Added support for Livy 0.4 and Livy 0.5.
Livy now supports Kerberos authentication.
Default to Spark 2.3.0 for installation and local connections (#1449).
yarn-clusternow supported by connecting withmaster="yarn"andconfigentrysparklyr.shell.deploy-modeset tocluster(#1404).sample_frac()andsample_n()now work properly in nontrivial queries (#1299)sdf_copy_to()no longer gives a spurious warning when user enters a multiline expression forx(#1386).spark_available_versions()was changed to only return available Spark versions, Hadoop versions can be still retrieved usinghadoop = TRUE.spark_installed_versions()was changed to retrieve the full path to the installation folder.cbind()andsdf_bind_cols()don’t use NSE internally anymore and no longer output names of mismatched data frames on error (#1363).
Sparklyr 0.7.0
Added support for Spark 2.2.1.
Switched
copy_toserializer to use Scala implementation, this change can be reverted by setting thesparklyr.copy.serializeroption tocsv_file.Added support for
spark_web()for Livy and Databricks connections when using Spark 2.X.Fixed
SIGPIPEerror underspark_connect()immediately after aspark_disconnect()operation.spark_web()is is more reliable under Spark 2.X by making use of a new API to programmatically find the right address.Added support in
dbWriteTable()fortemporary = FALSEto allow persisting table across connections. Changed default value fortemporarytoTRUEto matchDBIspecification, for compatibility, default value can be reverted back toFALSEusing thesparklyr.dbwritetable.tempoption.ncol()now returns the number of columns instead ofNA, andnrow()now returnsNA_real_.Added support to collect
VectorUDTcolumn types with nested arrays.Fixed issue in which connecting to Livy would fail due to long user names or long passwords.
Fixed error in the Spark connection dialog for clusters using a proxy.
Improved support for Spark 2.X under Cloudera clusters by prioritizing use of
spark2-submitoverspark-submit.Livy new connection dialog now prompts for password using
rstudioapi::askForPassword().Added
schemaparameter tospark_read_parquet()that enables reading a subset of the schema to increase performance.Implemented
sdf_describe()to easily compute summary statistics for data frames.Fixed data frames with dates in
spark_apply()retrieved asDateinstead of doubles.Added support to use
invoke()with arrays of POSIXlt and POSIXct.Added support for
contextparameter inspark_apply()to allow callers to pass additional contextual information to thef()closure.Implemented workaround to support in
spark_write_table()formode = 'append'.Various ML improvements, including support for pipelines, additional algorithms, hyper-parameter tuning, and better model persistence.
Added
spark_read_libsvm()for reading libsvm files.Added support for separating struct columns in
sdf_separate_column().Fixed collection of
short,floatandbyteto properly return NAs.Added
sparklyr.collect.datecharsoption to enable collectingDateTypeandTimestampTimeascharactersto support compatibility with previos versions.Fixed collection of
DateTypeandTimestampTimefromcharacterto properDateandPOSIXcttypes.
Sparklyr 0.6.4
Added support for HTTPS for
yarn-clusterwhich is activated by settingyarn.http.policytoHTTPS_ONLYinyarn-site.xml.Added support for
sparklyr.yarn.cluster.accepted.timeoutunderyarn-clusterto allow users to wait for resources under cluster with high waiting times.Fix to
spark_apply()when package distribution deadlock triggers in environments where multiple executors run under the same node.Added support in
spark_apply()for specifying a list ofpackagesto distribute to each worker node.Added support in
yarn-clusterforsparklyr.yarn.cluster.lookup.prefix,sparklyr.yarn.cluster.lookup.usernameandsparklyr.yarn.cluster.lookup.bynameto control the new application lookup behavior.
Sparklyr 0.6.3
Enabled support for Java 9 for clusters configured with Hadoop 2.8. Java 9 blocked on ‘master=local’ unless ‘options(sparklyr.java9 = TRUE)’ is set.
Fixed issue in
spark_connect()where usingset.seed()before connection would cause session ids to be duplicates and connections to be reused.Fixed issue in
spark_connect()blocking gateway port when connection was never started to the backend, for isntasnce, while interrupting the r session while connecting.Performance improvement for quering field names from tables impacting tables and
dplyrqueries, most noticeable inna.omitwith several columns.Fix to
spark_apply()when closure returns adata.framethat contains no rows and has one or more columns.Fix to
spark_apply()while usingtryCatch()within closure and increased callstack printed to logs when error triggers within closure.Added support for the
SPARKLYR_LOG_FILEenvironment variable to specify the file used for log output.Fixed regression for
union_all()affecting Spark 1.6.X.Added support for
na.omit.cacheoption that when set toFALSEwill preventna.omitfrom caching results when rows are dropped.Added support in
spark_connect()foryarn-clusterwith hight-availability enabled.Added support for
spark_connect()withmaster="yarn-cluster"to query YARN resource manager API and retrieve the correct container host name.Fixed issue in
invoke()calls while using integer arrays that containNAwhich can be commonly experienced while usingspark_apply().Added
topics.descriptionunderml_lda()result.Added support for
ft_stop_words_remover()to strip out stop words from tokens.Feature transformers (
ft_*functions) now explicitly requireinput.colandoutput.colto be specified.Added support for
spark_apply_log()to enable logging in worker nodes while usingspark_apply().Fix to
spark_apply()forSparkUncaughtExceptionHandlerexception while running over large jobs that may overlap during an, now unnecesary, unregister operation.Fix race-condition first time
spark_apply()is run when more than one partition runs in a worker and both processes try to unpack the packages bundle at the same time.spark_apply()now adds generic column names when needed and validatesfis afunction.Improved documentation and error cases for
metricargument inml_classification_eval()andml_binary_classification_eval().Fix to
spark_install()to use the/logssubfolder to store locallog4jlogs.Fix to
spark_apply()when R is used from a worker node since worker node already contains packages but still might be triggering different R session.Fix connection from closing when
invoke()attempts to use a class with a method that contains a reference to an undefined class.Implemented all tuning options from Spark ML for
ml_random_forest(),ml_gradient_boosted_trees(), andml_decision_tree().Avoid tasks failing under
spark_apply()and multiple concurrent partitions running while selecting backend port.Added support for numeric arguments for
ninlead()for dplyr.Added unsupported error message to
sample_n()andsample_frac()when Spark is not 2.0 or higher.Fixed
SIGPIPEerror underspark_connect()immediately after aspark_disconnect()operation.Added support for
sparklyr.apply.env.underspark_config()to allowspark_apply()to initializae environment varaibles.Added support for
spark_read_text()andspark_write_text()to read from and to plain text files.Addesd support for RStudio project templates to create an “R Package using sparklyr”.
Fix
compute()to trigger refresh of the connections view.Added a
kargument toml_pca()to enable specification of number of principal components to extract. Also implementedsdf_project()to project datasets using the results ofml_pca()models.Added support for additional livy session creation parameters using the
livy_config()function.
Sparklyr 0.6.2
- Fix connection_spark_shinyapp() under RStudio 1.1 to avoid error while listing Spark installation options for the first time.
Sparklyr 0.6.1
Fixed error in
spark_apply()that may triggered when multiple CPUs are used in a single node due to race conditions while accesing the gateway service and another in theJVMObjectTracker.spark_apply()now supports explicit column types using thecolumnsargument to avoid sampling types.spark_apply()withgroup_byno longer requires persisting to disk nor memory.Added support for Spark 1.6.3 under
spark_install().Added support for Spark 1.6.3 under
spark_install()spark_apply()now logs the current callstack when it fails.Fixed error triggered while processing empty partitions in
spark_apply().Fixed slow printing issue caused by
printcalculating the total row count, which is expensive for some tables.Fixed
sparklyr 0.6issue blocking concurrentsparklyrconnections, which required to setconfig$sparklyr.gateway.remote = FALSEas workaround.
Sparklyr 0.6.0
Distributed R
Added
packagesparameter tospark_apply()to distribute packages across worker nodes automatically.Added
sparklyr.closures.rlangas aspark_config()value to support generic closures provided by therlangpackage.Added config options
sparklyr.worker.gateway.addressandsparklyr.worker.gateway.portto configure gateway used under worker nodes.Added
group_byparameter tospark_apply(), to support operations over groups of dataframes.Added
spark_apply(), allowing users to use R code to directly manipulate and transform Spark DataFrames.
External Data
Added
spark_write_source(). This function writes data into a Spark data source which can be loaded through an Spark package.Added
spark_write_jdbc(). This function writes from a Spark DataFrame into a JDBC connection.Added
columnsparameter tospark_read_*()functions to load data with named columns or explicit column types.Added
partition_byparameter tospark_write_csv(),spark_write_json(),spark_write_table()andspark_write_parquet().Added
spark_read_source(). This function reads data from a Spark data source which can be loaded through an Spark package.Added support for
mode = "overwrite"andmode = "append"tospark_write_csv().spark_write_table()now supports saving to default Hive path.Improved performance of
spark_read_csv()reading remote data wheninfer_schema = FALSE.Added
spark_read_jdbc(). This function reads from a JDBC connection into a Spark DataFrame.Renamed
spark_load_table()andspark_save_table()intospark_read_table()andspark_write_table()for consistency with existingspark_read_*()andspark_write_*()functions.Added support to specify a vector of column names in
spark_read_csv()to specify column names without having to set the type of each column.Improved
copy_to(),sdf_copy_to()anddbWriteTable()performance underyarn-clientmode.
dplyr
Support for
cumprod()to calculate cumulative products.Support for
cor(),cov(),sd()andvar()as window functions.Support for Hive built-in operators
%like%,%rlike%, and%regexp%for matching regular expressions infilter()andmutate().Support for dplyr (>= 0.6) which among many improvements, increases performance in some queries by making use of a new query optimizer.
sample_frac()takes a fraction instead of a percent to match dplyr.Improved performance of
sample_n()andsample_frac()through the use ofTABLESAMPLEin the generated query.
Databases
Added
src_databases(). This function list all the available databases.Added
tbl_change_db(). This function changes current database.
DataFrames
Added
sdf_len(),sdf_seq()andsdf_along()to help generate numeric sequences as Spark DataFrames.Added
spark_set_checkpoint_dir(),spark_get_checkpoint_dir(), andsdf_checkpoint()to enable checkpointing.Added
sdf_broadcast()which can be used to hint the query optimizer to perform a broadcast join in cases where a shuffle hash join is planned but not optimal.Added
sdf_repartition(),sdf_coalesce(), andsdf_num_partitions()to support repartitioning and getting the number of partitions of Spark DataFrames.Added
sdf_bind_rows()andsdf_bind_cols()– these functions are thesparklyrequivalent ofdplyr::bind_rows()anddplyr::bind_cols().Added
sdf_separate_column()– this function allows one to separate components of an array / vector column into separate scalar-valued columns.sdf_with_sequential_id()now supportsfromparameter to choose the starting value of the id column.Added
sdf_pivot(). This function provides a mechanism for constructing pivot tables, using Spark’s ‘groupBy’ + ‘pivot’ functionality, with a formula interface similar to that ofreshape2::dcast().
MLlib
Added
vocabulary.onlytoft_count_vectorizer()to retrieve the vocabulary with ease.GLM type models now support
weights.columnto specify weights in model fitting. (#217)ml_logistic_regression()now supports multinomial regression, in addition to binomial regression [requires Spark 2.1.0 or greater]. (#748)Implemented
residuals()andsdf_residuals()for Spark linear regression and GLM models. The former returns a R vector while the latter returns atbl_sparkof training data with aresidualscolumn added.Added
ml_model_data(), used for extracting data associated with Spark ML models.The
ml_save()andml_load()functions gain ametaargument, allowing users to specify where R-level model metadata should be saved independently of the Spark model itself. This should help facilitate the saving and loading of Spark models used in non-local connection scenarios.ml_als_factorization()now supports the implicit matrix factorization and nonnegative least square options.Added
ft_count_vectorizer(). This function can be used to transform columns of a Spark DataFrame so that they might be used as input toml_lda(). This should make it easier to invokeml_lda()on Spark data sets.
Broom
- Implemented
tidy(),augment(), andglance()from tidyverse/broom forml_model_generalized_linear_regressionandml_model_linear_regressionmodels.
R Compatibility
- Implemented
cbind.tbl_spark(). This method works by first generating index columns usingsdf_with_sequential_id()then performinginner_join(). Note that dplyr_join()functions should still be used for DataFrames with common keys since they are less expensive.
Connections
Increased default number of concurrent connections by setting default for
spark.port.maxRetriesfrom 16 to 128.Support for gateway connections
sparklyr://hostname:port/sessionand usingspark-submit --class sparklyr.Shell sparklyr-2.1-2.11.jar <port> <id> --remote.Added support for
sparklyr.gateway.serviceandsparklyr.gateway.remoteto enable/disable the gateway in service and to accept remote connections required for Yarn Cluster mode.Added support for Yarn Cluster mode using
master = "yarn-cluster". Either, explicitly setconfig = list(sparklyr.gateway.address = "<driver-name>")or implicitlysparklyrwill read thesite-config.xmlfor theYARN_CONF_DIRenvironment variable.Added
spark_context_config()andhive_context_config()to retrieve runtime configurations for the Spark and Hive contexts.Added
sparklyr.log.consoleto redirect logs to console, useful to troubleshootingspark_connect.Added
sparklyr.backend.argsas config option to enable passing parameters to thesparklyrbackend.Improved logging while establishing connections to
sparklyr.Improved
spark_connect()performance.Implemented new configuration checks to proactively report connection errors in Windows.
While connecting to spark from Windows, setting the
sparklyr.verboseoption toTRUEprints detailed configuration steps.Added
custom_headerstolivy_config()to add custom headers to the REST call to the Livy server
Compilation
Added support for
jar_depin the compilation specification to support additionaljarsthroughspark_compile().spark_compile()now prints deprecation warnings.Added
download_scalac()to assist downloading all the Scala compilers required to build usingcompile_package_jarsand provided support for using anyscalacminor versions while looking for the right compiler.
Backend
- Improved backend logging by adding type and session id prefix.
Miscellaneous
copy_to()andsdf_copy_to()auto generate anamewhen an expression can’t be transformed into a table name.Implemented
type_sum.jobj()(from tibble) to enable better printing of jobj objects embedded in data frames.Added the
spark_home_set()function, to help facilitate the setting of theSPARK_HOMEenvironment variable. This should prove useful in teaching environments, when teaching the basics of Spark and sparklyr.Added support for the
sparklyr.ui.connectionsoption, which adds additional connection options into the new connections dialog. Therstudio.spark.connectionsoption is now deprecated.Implemented the “New Connection Dialog” as a Shiny application to be able to support newer versions of RStudio that deprecate current connections UI.
Bug Fixes
When using
spark_connect()in local clusters, it validates thatjavaexists underJAVA_HOMEto help troubleshoot systems that have an incorrectJAVA_HOME.Improved
argument is of length zeroerror triggered while retrieving data with no columns to display.Fixed
Path does not existreferencinghdfsexception duringcopy_tounder systems configured withHADOOP_HOME.Fixed session crash after “No status is returned” error by terminating invalid connection and added support to print log trace during this error.
compute()now caches data in memory by default. To revert this beavior usesparklyr.dplyr.compute.nocacheset toTRUE.spark_connect()withmaster = "local"and a givenversionoverridesSPARK_HOMEto avoid existing installation mismatches.Fixed
spark_connect()under Windows issue whennewInstance0is present in the logs.Fixed collecting
longtype columns when NAs are present (#463).Fixed backend issue that affects systems where
localhostdoes not resolve properly to the loopback address.Fixed issue collecting data frames containing newlines
\n.Spark Null objects (objects of class NullType) discovered within numeric vectors are now collected as NAs, rather than lists of NAs.
Fixed warning while connecting with livy and improved 401 message.
Fixed issue in
spark_read_parquet()and other read methods in whichspark_normalize_path()would not work in some platforms while loading data using custom protocols likes3n://for Amazon S3.Resolved issue in
spark_save()/load_table()to support saving / loading data and added path parameter inspark_load_table()for consistency with other functions.
Sparklyr 0.5.5
- Implemented support for
connectionViewerinterface required in RStudio 1.1 andspark_connectwithmode="databricks".
Sparklyr 0.5.4
- Implemented support for
dplyr 0.6and Spark 2.1.x.
Sparklyr 0.5.3
- Implemented support for
DBI 0.6.
Sparklyr 0.5.2
Fix to
spark_connectaffecting Windows users and Spark 1.6.x.Fix to Livy connections which would cause connections to fail while connection is on ‘waiting’ state.
Sparklyr 0.5.0
Implemented basic authorization for Livy connections using
livy_config_auth().Added support to specify additional
spark-submitparameters using thesparklyr.shell.argsenvironment variable.Renamed
sdf_load()andsdf_save()tospark_read()andspark_write()for consistency.The functions
tbl_cache()andtbl_uncache()can now be using without requiring thedplyrnamespace to be loaded.spark_read_csv(..., columns = <...>, header = FALSE)should now work as expected – previously,sparklyrwould still attempt to normalize the column names provided.Support to configure Livy using the
livy.prefix in theconfig.ymlfile.Implemented experimental support for Livy through:
livy_install(),livy_service_start(),livy_service_stop()andspark_connect(method = "livy").The
mlroutines now acceptdataas an optional argument, to support calls of the form e.g.ml_linear_regression(y ~ x, data = data). This should be especially helpful in conjunction withdplyr::do().Spark
DenseVectorandSparseVectorobjects are now deserialized as R numeric vectors, rather than Spark objects. This should make it easier to work with the output produced bysdf_predict()with Random Forest models, for example.Implemented
dim.tbl_spark(). This should ensure thatdim(),nrow()andncol()all produce the expected result withtbl_sparks.Improved Spark 2.0 installation in Windows by creating
spark-defaults.confand configuringspark.sql.warehouse.dir.Embedded Apache Spark package dependencies to avoid requiring internet connectivity while connecting for the first through
spark_connect. Thesparklyr.csv.embeddedconfig setting was added to configure a regular expression to match Spark versions where the embedded package is deployed.Increased exception callstack and message length to include full error details when an exception is thrown in Spark.
Improved validation of supported Java versions.
The
spark_read_csv()function now accepts theinfer_schemaparameter, controlling whether the columns schema should be inferred from the underlying file itself. Disabling this should improve performance when the schema is known beforehand.Added a
do_.tbl_sparkimplementation, allowing for the execution ofdplyr::dostatements on Spark DataFrames. Currently, the computation is performed in serial across the different groups specified on the Spark DataFrame; in the future we hope to explore a parallel implementation. Note thatdo_always returns atbl_dfrather than atbl_spark, as the objects produced within ado_query may not necessarily be Spark objects.Improved errors, warnings and fallbacks for unsupported Spark versions.
sparklyrnow defaults totar = "internal"in its calls tountar(). This should help resolve issues some Windows users have seen related to an inability to connect to Spark, which ultimately were caused by a lack of permissions on the Spark installation.Resolved an issue where
copy_to()and other R => Spark data transfer functions could fail when the last column contained missing / empty values. (#265)Added
sdf_persist()as a wrapper to the Spark DataFramepersist()API.Resolved an issue where
predict()could produce results in the wrong order for large Spark DataFrames.Implemented support for
na.actionwith the various Spark ML routines. The value ofgetOption("na.action")is used by default. Users can customize thena.actionargument through theml.optionsobject accepted by all ML routines.On Windows, long paths, and paths containing spaces, are now supported within calls to
spark_connect().The
lag()window function now accepts numeric values forn. Previously, only integer values were accepted. (#249)Added support to configure Ppark environment variables using
spark.env.*config.Added support for the
TokenizerandRegexTokenizerfeature transformers. These are exported as theft_tokenizer()andft_regex_tokenizer()functions.Resolved an issue where attempting to call
copy_to()with an Rdata.framecontaining many columns could fail with a Java StackOverflow. (#244)Resolved an issue where attempting to call
collect()on a Spark DataFrame containing many columns could produce the wrong result. (#242)Added support to parameterize network timeouts using the
sparklyr.backend.timeout,sparklyr.gateway.start.timeoutandsparklyr.gateway.connect.timeoutconfig settings.Improved logging while establishing connections to
sparklyr.Added
sparklyr.gateway.portandsparklyr.gateway.addressas config settings.The
spark_log()function now accepts thefilterparameter. This can be used to filter entries within the Spark log.Increased network timeout for
sparklyr.backend.timeout.Moved
spark.jars.defaultsetting from options to Spark config.sparklyrnow properly respects the Hive metastore directory with thesdf_save_table()andsdf_load_table()APIs for Spark < 2.0.0.Added
sdf_quantile()as a means of computing (approximate) quantiles for a column of a Spark DataFrame.Added support for
n_distinct(...)within thedplyrinterface, based on call to Hive functioncount(DISTINCT ...). (#220)
Sparklyr 0.4.0
- First release to CRAN.