Predicting academic success in higher education: literature review and best practices

Table 7 Data Cleaning

	Strategies	Methods	Cases	Advantaged	Disadvantages
Missing data	Listwise deletion	Instance/row deletion	Records contain few missing values	Does not affect the ability of the prediction model if the size of data set is large	Affects the ability of the prediction model if the size of data set is small
	Listwise deletion	Feature/column deletion	Column contain too many missing values	Does not affect the ability of the prediction model if the size of data set is large	Affects the ability of the prediction model if the number of attributes is small
	Imputation (Replacement)	Numeric values: (median or mean) of the student, Nominal values: (mode) of the student.	Missing data such as grade or marks	Preserve the data	Can introduce bias in the analysis
	Imputation (Replacement)	Numeric values: (median or mean) of the feature, Nominal values: (mode) of the feature.	Other missing data	Preserve the data	Can introduce bias in the analysis
Outlier data	Remove the outlier’s data		Incorrectly entered or outliers outside the population of interest.	Does not affect the ability of the prediction model if the size of data set is large	Affects the ability of the prediction model if the size of data set is small
	Bin the data		Too extreme outliers that remain outliers after transformation	Easier to understand and handle Improve the ability of the prediction model	–
	Leave the outliers		When outliers are from the population of interest	Preserve the data	Affects the ability of the prediction model