Skip to main content

Table 7 Data Cleaning

From: Predicting academic success in higher education: literature review and best practices

 

Strategies

Methods

Cases

Advantaged

Disadvantages

Missing data

Listwise deletion

Instance/row deletion

Records contain few missing values

Does not affect the ability of the prediction model if the size of data set is large

Affects the ability of the prediction model if the size of data set is small

Feature/column deletion

Column contain too many missing values

Does not affect the ability of the prediction model if the size of data set is large

Affects the ability of the prediction model if the number of attributes is small

Imputation (Replacement)

Numeric values: (median or mean) of the student,

Nominal values: (mode) of the student.

Missing data such as grade or marks

Preserve the data

Can introduce bias in the analysis

Numeric values: (median or mean) of the feature,

Nominal values: (mode) of the feature.

Other missing data

Outlier data

Remove the outlier’s data

Incorrectly entered or outliers outside the population of interest.

Does not affect the ability of the prediction model if the size of data set is large

Affects the ability of the prediction model if the size of data set is small

Bin the data

Too extreme outliers that remain outliers after transformation

Easier to understand and handle

Improve the ability of the prediction model

Leave the outliers

When outliers are from the population of interest

Preserve the data

Affects the ability of the prediction model