Creates the ‘label’ and ‘features’ columns
ml_prepare_dataset
Description
Creates the ‘label’ and ‘features’ columns
Usage
ml_prepare_dataset(
x,
formula = NULL,
label = NULL,
features = NULL,
label_col = "label",
features_col = "features",
keep_original = TRUE,
...
)Arguments
| Arguments | Description |
|---|---|
| x | A tbl_pyspark object |
| formula | Used when x is a tbl_spark. R formula. |
| label | The name of the label column. |
| features | The name(s) of the feature columns as a character vector. |
| label_col | Label column name, as a length-one character vector. |
| features_col | Features column name, as a length-one character vector. |
| keep_original | Boolean flag that indicates if the output will contain, or not, the original columns from x. Defaults to TRUE. |
| … | Added for backwards compatibility. Not in use today. |
Details
At this time, ‘Spark ML Connect’, does not include a Vector Assembler transformer. The main thing that this function does, is create a ‘Pyspark’ array column. Pipelines require a ‘label’ and ‘features’ columns. Even though it is is single column in the dataset, the ‘features’ column will contain all of the predictors insde an array. This function also creates a new ‘label’ column that copies the outcome variable. This makes it a lot easier to remove the ‘label’, and ‘outcome’ columns.
Value
A tbl_pyspark, with either the original columns from x, plus the ‘label’ and ‘features’ column, or, the ‘label’ and ‘features’ columns only.