Creates the ‘label’ and ‘features’ columns

ml_prepare_dataset

Description

Usage

ml_prepare_dataset(
  x,
  formula = NULL,
  label = NULL,
  features = NULL,
  label_col = "label",
  features_col = "features",
  keep_original = TRUE,
  ...
)

Arguments

Arguments	Description
x	A `tbl_pyspark` object
formula	Used when `x` is a `tbl_spark`. R formula.
label	The name of the label column.
features	The name(s) of the feature columns as a character vector.
label_col	Label column name, as a length-one character vector.
features_col	Features column name, as a length-one character vector.
keep_original	Boolean flag that indicates if the output will contain, or not, the original columns from `x`. Defaults to `TRUE`.
…	Added for backwards compatibility. Not in use today.

Details

At this time, ‘Spark ML Connect’, does not include a Vector Assembler transformer. The main thing that this function does, is create a ‘Pyspark’ array column. Pipelines require a ‘label’ and ‘features’ columns. Even though it is is single column in the dataset, the ‘features’ column will contain all of the predictors insde an array. This function also creates a new ‘label’ column that copies the outcome variable. This makes it a lot easier to remove the ‘label’, and ‘outcome’ columns.

Value

A tbl_pyspark, with either the original columns from x, plus the ‘label’ and ‘features’ column, or, the ‘label’ and ‘features’ columns only.