Feature Transformation - Bucketizer (Transformer)

ft_bucketizer

Description

Similar to R’s cut function, this transforms a numeric column into a discretized column, with breaks specified through the splits parameter.

Usage

ft_bucketizer(
  x,
  input_col = NULL,
  output_col = NULL,
  splits = NULL,
  input_cols = NULL,
  output_cols = NULL,
  splits_array = NULL,
  handle_invalid = "error",
  uid = random_string("bucketizer_"),
  ...
)

Arguments

Arguments	Description
x	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
input_col	The name of the input column.
output_col	The name of the output column.
splits	A numeric vector of cutpoints, indicating the bucket boundaries.
input_cols	Names of input columns.
output_cols	Names of output columns.
splits_array	Parameter for specifying multiple splits parameters. Each element in this array can be used to map continuous features into buckets.
handle_invalid	(Spark 2.1.0+) Param for how to handle invalid entries. Options are ‘skip’ (filter out rows with invalid values), ‘error’ (throw an error), or ‘keep’ (keep invalid values in a special additional bucket). Default: “error”
uid	A character string used to uniquely identify the feature transformer.
…	Optional arguments; currently unused.

Value

The object returned depends on the class of x. If it is a spark_connection, the function returns a ml_estimator or a ml_estimator object. If it is a ml_pipeline, it will return a pipeline with the transformer or estimator appended to it. If a tbl_spark, it will return a tbl_spark with the transformation applied to it.

Examples



library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl %>%
  ft_bucketizer(
    input_col = "Sepal_Length",
    output_col = "Sepal_Length_bucket",
    splits = c(0, 4.5, 5, 8)
  ) %>%
  select(Sepal_Length, Sepal_Length_bucket, Species)

--- title: "Feature Transformation - Bucketizer (Transformer)" execute: eval: true freeze: true --- ## ft_bucketizer ## Description Similar to R's `cut` function, this transforms a numeric column into a discretized column, with breaks specified through the `splits` parameter. ## Usage ```r ft_bucketizer( x, input_col = NULL, output_col = NULL, splits = NULL, input_cols = NULL, output_cols = NULL, splits_array = NULL, handle_invalid = "error", uid = random_string("bucketizer_"), ... ) ``` ## Arguments |Arguments|Description| |---|---| | x | A `spark_connection`, `ml_pipeline`, or a `tbl_spark`. | | input_col | The name of the input column. | | output_col | The name of the output column. | | splits | A numeric vector of cutpoints, indicating the bucket boundaries. | | input_cols | Names of input columns. | | output_cols | Names of output columns. | | splits_array | Parameter for specifying multiple splits parameters. Each element in this array can be used to map continuous features into buckets. | | handle_invalid | (Spark 2.1.0+) Param for how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (keep invalid values in a special additional bucket). Default: "error" | | uid | A character string used to uniquely identify the feature transformer. | | ... | Optional arguments; currently unused. | ## Value The object returned depends on the class of `x`. If it is a `spark_connection`, the function returns a `ml_estimator` or a `ml_estimator` object. If it is a `ml_pipeline`, it will return a pipeline with the transformer or estimator appended to it. If a `tbl_spark`, it will return a `tbl_spark` with the transformation applied to it. ## See Also Other feature transformers: `ft_binarizer()`, `ft_chisq_selector()`, `ft_count_vectorizer()`, `ft_dct()`, `ft_elementwise_product()`, `ft_feature_hasher()`, `ft_hashing_tf()`, `ft_idf()`, `ft_imputer()`, `ft_index_to_string()`, `ft_interaction()`, `ft_lsh`, `ft_max_abs_scaler()`, `ft_min_max_scaler()`, `ft_ngram()`, `ft_normalizer()`, `ft_one_hot_encoder()`, `ft_one_hot_encoder_estimator()`, `ft_pca()`, `ft_polynomial_expansion()`, `ft_quantile_discretizer()`, `ft_r_formula()`, `ft_regex_tokenizer()`, `ft_robust_scaler()`, `ft_sql_transformer()`, `ft_standard_scaler()`, `ft_stop_words_remover()`, `ft_string_indexer()`, `ft_tokenizer()`, `ft_vector_assembler()`, `ft_vector_indexer()`, `ft_vector_slicer()`, `ft_word2vec()` ## Examples ```r library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) iris_tbl %>% ft_bucketizer( input_col = "Sepal_Length", output_col = "Sepal_Length_bucket", splits = c(0, 4.5, 5, 8) ) %>% select(Sepal_Length, Sepal_Length_bucket, Species) ```