data(Sonar, package = "mlbench")
c(10, 20 , 30 , 40 , 50), 1] <- NA
Sonar[c(15, 25 , 35 , 45 , 55), 2] <- NA Sonar[
6 Preprocess
Data preprocessing is an important step in data pipelines.
Let’s start with the Sonar dataset and add some missing values for this example.
6.1 Check data
To check your data, simply enough use the check_data()
function:
check_data(Sonar)
Sonar: A data.table with 208 rows and 61 columns
Data types
* 60 numeric features
* 0 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 2 features include 'NA' values; 10 'NA' values total
* 2 numeric
Recommendations
* Consider imputing missing values or use complete cases only
The output produces a list of useful information about your dataset, followed by recommendations.
6.2 Preprocess
To clean / preprocess the data, use the preprocess()
command. In this case we want to impute missing data. By default, preprocess()
uses the missRanger package to predict missing values from the available data using random forest in an iterative procedure.
<- preprocess(Sonar, impute = TRUE) Sonar.pre
06-30-24 10:57:03 Hello, egenn [preprocess]
06-30-24 10:57:03 Imputing missing values using predictive mean matching with missRanger... [preprocess]
Missing value imputation by random forests
Variables to impute: V1, V2
Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class
iter 1
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
iter 2
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
iter 3
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
06-30-24 10:57:03 Completed in 0.01 minutes (Real: 0.62; User: 1.16; System: 0.07) [preprocess]
Let’s now check our preprocessed data:
check_data(Sonar.pre)
Sonar.pre: A data.table with 208 rows and 61 columns
Data types
* 60 numeric features
* 0 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 0 missing values
Recommendations
* Everything looks good
6.2.1 Preprocessing options
The preprocess()
function accepts the following arguments. See its documentation for details.
completeCases
removeCases.thres
removeFeatures.thres
missingness
impute
integer2factor
integer2numeric
logical2factor
logical2numeric
numeric2factor
numeric2factor.levels
len2fac
character2factor
factorNA2missing
factorNA2missing.level
scale
center
removeConstants
removeDuplicates
oneHot