Skip to main content

Table 7 Data Cleaning

From: Predicting academic success in higher education: literature review and best practices

 StrategiesMethodsCasesAdvantagedDisadvantages
Missing dataListwise deletionInstance/row deletionRecords contain few missing valuesDoes not affect the ability of the prediction model if the size of data set is largeAffects the ability of the prediction model if the size of data set is small
Feature/column deletionColumn contain too many missing valuesDoes not affect the ability of the prediction model if the size of data set is largeAffects the ability of the prediction model if the number of attributes is small
Imputation (Replacement)Numeric values: (median or mean) of the student,
Nominal values: (mode) of the student.
Missing data such as grade or marksPreserve the dataCan introduce bias in the analysis
Numeric values: (median or mean) of the feature,
Nominal values: (mode) of the feature.
Other missing data
Outlier dataRemove the outlier’s dataIncorrectly entered or outliers outside the population of interest.Does not affect the ability of the prediction model if the size of data set is largeAffects the ability of the prediction model if the size of data set is small
Bin the dataToo extreme outliers that remain outliers after transformationEasier to understand and handle
Improve the ability of the prediction model
Leave the outliersWhen outliers are from the population of interestPreserve the dataAffects the ability of the prediction model