Prediction of Student’s performance by modelling small dataset size

Prediction of student’s performance became an urgent desire in most of educational entities and institutes. That is essential in order to help at-risk students and assure their retention, providing the excellent learning resources and experience, and improving the university’s ranking and reputation. However, that might be difficult to be achieved for startup to mid-sized universities, especially those which are specialized in graduate and post graduate programs, and have small students’ records for analysis. So, the main aim of this project is to prove the possibility of training and modeling a small dataset size and the feasibility of creating a prediction model with credible accuracy rate. This research explores as well the possibility of identifying the key indicators in the small dataset, which will be utilized in creating the prediction model, using visualization and clustering algorithms. Best indicators were fed into multiple machine learning algorithms to evaluate them for the most accurate model. Among the selected algorithms, the results proved the ability of clustering algorithm in identifying key indicators in small datasets. The main outcomes of this study have proved the efficiency of support vector machine and learning discriminant analysis algorithms in training small dataset size and in producing an acceptable classification’s accuracy and reliability test rates.


Introduction
Extensive efforts have been made in order to predict student performance for different aims, like: detecting at risk students, assurance of student retention, course and resource allocations, and many others. This research aims to predict student performance to engage distinct students in researches and innovative projects that could improve universities reputation and ranking nationally and internationally. However, analyzing students records for startup to medium size institutes or schools, like the British University in Dubai which have small size of students records, have never been explored in educational or learning analytics domain. Yet, that were investigated in other fields, like: health sciences and Chemists (Ingrassia & Morlini, 2005;Pasini, 2015). So, this project aims to explore the utilization possibility of small students' dataset size in educational domains.
Additionally, in most researches that were aimed to classify or predict, researchers used to spend much efforts just to extract the important indicators that could be more useful in constructing reasonable accurate predictive models. They will either use features ranking algorithms or will look at the selected features while training the dataset on different machine learning algorithms, like in (Comendador, Rabago, & Tanguilig, 2016;Mueen, Zafar, & Manzoor, 2016). Instead, and until recently, there have been no research efforts to investigate the ability of visualization or clustering techniques in identifying such indicators for small dataset, especially in the learning analytics domain (Asif, Merceron, Ali, & Haider, 2017). If such studies will be conducted, its outcomes might prove the feasibility of mitigating the hassle that is normally spent on features extraction or selection processes.
So, this research aims to narrow the aforementioned gaps by solving the following research questions: What is the best machine learning classification model for classifying student's dissertation project grade, using small dataset size, with a reasonable and significant accuracy rate?
What are the main key indicators that could help in creating the classification model for predicting students' dissertation project grades? Could students' performance in any course (excluding the Dissertation) be predicted with a reasonable and significant accuracy rate using only students' preadmission records, course names, and instructors' name attributes?
The overall study is explained in four sections, including this introduction. The following section will talk about the used methodology. And the third section will demonstrate the analysis results. Finally, and in the last section, results will be interpreted and discussed, and the research will be concluded.

Research methodology
To achieve the project's aims, quantitative simulation research methods were conducted as suggested in the framework phases shown in Fig. 1. In these phases the dataset will be prepared to be passed through visualization and clustering techniques, i.e. like heat map and hierarchical clustering, to extract the top correlated indicators. Then, the indicators will be used in different classification algorithms and the most accurate model will be the chosen for predicting student performance in dissertation projects and all courses grades. In between, and before the classification models' evaluation phase, the datasets will pass through a pre-processing (cleansing, missing data imputation, …) stage to make it ready for the analysis phase. That will be more detailed in the following sections.

Participants and datasets
In this study, the records of fifty graduated students in one master's program were collected from the administration department. These records include students' ID, age, bachelor degree name, bachelor degree accumulated grade, courses taken during their master's study with their grades and instructors name of each course. Table 1 shows the list of the main used attributes, their datatypes, and other related details. From that records, 2 datasets were created to answer the research questions and Table 2 illustrates the descriptive statistics of that sets. These records were provided after to comply with the university's data privacy obligations requirements and the replacement of students' IDs and instructors' names with other unique identifiers.

Tools
To utilize from the provisioned dataset, multiple modifications have been created to prepare the dataset for analysis. Microsoft Excel and Python Integrated Development Environment version 3.6.2 were used for that. Additionally, R studio (version 1.1.456) was used to visualize the dataset and select the key attributes. Besides, it has been used for training the dataset with different classification algorithms and evaluate them in order to select the most accurate machine learning classification algorithm.

Data Analysis & Procedures
As illustrated in Fig. 1, three main phases have been followed to answer the research questions. The following sections will explain these phases in more details.
Dataset pre-processing phase Initially, the datasets contained valueless attributes, missing instances, inadequate attributes' data types and other problems that raise the necessity of preparing it first before feeding it to the analysis phase. Therefore, the datasets were passed through the following preparation stages: Dataset cleaning Firstly, irrelevant attributes to this study (like: Model code, assessment status, Status, Course description, Academic Year, and Bachelor institution) were eradicated. After that, students with incomplete records, like those who had no grades' details in most of their courses or those who didn't have any course records were excluded from the list. Up to that stage, the remaining number of students and their attributes were thirty-eight and seven, in respectively, as illustrated in Table 1. Last, since it's been noted that the number of the courses were decreased since 2010 from nine to seven courses, and to treat all students equally in the analysis phases, the number of courses for all students were decreased from nine to seven by removing the retired courses.
Features encoding In this stage the datatypes of all attributes have been changed to numeric attributes for many reasons. First, some machine learning algorithms, which have proved to be efficient in dealing with small datasets size, such as Linear Discriminant Analysis(LDA) (Sharma & Paliwal, 2015) and Multiple Perceptron Artificial Neural Network (MLP-NN) (Ingrassia & Morlini, 2005;Pasini, 2015) algorithms, requires numeric types of attributes. And the Support Vector Machine algorithm, which was used as well, was designed to work efficiently with numerical attributes. Also, as a best practice in dealing with MLP-NN, in general, attributes have to be in numeric form and be normalized to achieve best classification results. By normalization, attributes' values will be changed and normalized into ranges (either [0,1] or (Mueen et al., 2016)) before feeding them into the classification models. Lastly, since R studio was used for training the classification algorithms, and it executes its operations in RAM, dealing with categorical variables or strings will require more space, runtime, and more processing overhead (since characters are converted to combinations of bytes, especially while dealing with long course names) compared with numerical datatype attributes. This effect on processing performance might not be observed while dealing with the small sample, however, its' always important to comply with best practices to achieve successful analysis results. So, the attributes' conversion to numeric type was done using "ifelse" function in Excel and the following attributes were encoded: B.Sc. Degree, Course Grades and Names, and Instructors' names. The corresponding numbers of each encoded attribute are shown in Tables 3, 4, 5 and 6.
In order to answer research question 1, new arrangements and changes have been made to the dataset, and new attributes have been added. Figures 2 and 3 shows the newly populated datasets with the final arrangements. That arrangements have been programmed to be done automatically using python, and Fig. 4 shows the screenshot of the executed code. So, that new datasets will be used to answer the research questions.
Missing value imputation Visually, and using Amelia library in R, missing values were identified using the missmap function. This function outputs a heat map that marks missing values with different colors. So, both datasets were fed to that function to visually identify the missing values. In the case of, 'dissertation instructor' attributes' were missing most of its values; thus, the variable was deleted. Instead, for the remaining missing values in dissertation grade attribute (i.e. Grade) and course1 grade (i.e. Grade1), they were replaced with the mode value of both attributes. The corresponding code in R is attached in the Additional file 1. Besides, for, the mode of grades attribute's values (i.e. Grades) was the replacement. Compared to mean, median, or regression imputation, and other imputation methods, imputing using mode value will: preserve the new encoded numeric (ordinal) attribute datatype from being changed to continuous ones. -Avoid producing values that will not belong to any of the Grades attribute's classes.
Normalization Normalization is considered one of the recommended pre-processing practices that shall precede training the dataset to some kinds of classification or prediction algorithms, i.e. like the neural network machine learning algorithm. That algorithm recommends making the instances values within specific ranges, either [0,1] or [− 1,1], since scaling to these ranges tend to give better results (Rotich, Backman, Linnanen, & Daniil, 2014). In this project, MinMaxScaler (which scales instances to this range [0,1]) was used as the normalization method and calculated in R using the following equation (assuming the range is [a,b]:

Attributes selection phase
After the pre-processing phase, features selection process was started. The heat map visualization and hierarchical clustering methods were used to help in visualizing the relations between variables and in identifying the main indicators that could help in predicting dissertation and courses' grades. In a nutshell, the heat map is a simple and organized way to display a colorful matrix of data, where its columns represent the dataset attributes and the rows are their corresponding values. The R code for to the used visualization and clustering methods are attached in the Additional file 1. Also, the key indicators -which were identified visually-were compared to those which were selected by the classification algorithms while training the datasets. This comparison is needed to confirm if the key attributes were visually identified correctly, especially, in case if the relationships between attributes cannot be clearly identified.

Classification model evaluation
Multiple machine learning classification algorithms were used to train the datasets, including: MLP-ANN, Naïve Bayes(NB), Support Vector Machines(SVM), K Nearest Neighbor (KNN), and LDA. The idea was to evaluate which one will be better in terms of the ability to produce reasonable accurate prediction rate of students' performance for small size datasets. MLP-ANN and LDA were chosen because some researchers discovered their efficiency dealing with small dataset size and in producing more accurate results, especially, in the fields of face and speech recognition and financial market forecasting (Mustafa, Allen, & Appiah, 2017;Pasini, 2015;Sharma & Paliwal, 2015). MLP is a type of artificial neural network that allows the processing of multiple inputs to produce multi-label output. It accepts nominal or numerical attributes and it can be used as a classification or regression algorithm. Nonetheless, LDA is a dimensionality reduction algorithm that tries to create a linear relationship between different classes, while minimizing the scatter of each class and maximizing the distance between the labels centroids and the central point of all of them (Qiao, Zhou, & Huang, 2009). It predicts the class of a variable using two or multi numeric attributes. On the other hand, NB algorithm computes the probability that a certain class label will appear given that a certain condition has already been occurred. This classifier was fundamentally designed to accept categorical attributes, but also it could support normally distributed numerical inputs. It is an advantageous method since it can utilize from a small size training set to create the classification model (Dey, Chakraborty, Biswas, Bose, & Tiwari, 2016). Also, it is equipped with a kernel density estimator that can handle nonparametric variables. As for KNN, it can be used for classification or prediction problems, where by knowing K value (number of instances) and utilizing numerical variables the algorithm can predict the class labels based on the most occurring labels in the k nearest ones. Likewise, SVM works for classification and prediction problems, and the idea behind it is to find a line that best isolates multi group labels. It is developed to deal with numeric attributes, as it deals with nominal ones after converting them to numeric datatypes. Abstractly, the aforementioned explanation about the selected machine learning algorithms described why they were selected and, most importantly, helped in knowing the attributes' types that shall be used in each algorithm to allow it to perform efficiently. However, since no machine learning algorithm is considered good in all use case scenarios (like in training small sample size or accurately predicting students' performance (as what literature in (Asif et al., 2017) suggests)), this research will examine all the aforementioned algorithms and will evaluate them in terms of their classification accuracy rates to end up selecting the most accurate algorithm to create students' grades' classification model.  The used evaluation metrics for the best performed classifiers are: the accuracy (the right predictions subdivided by the total predictions) and Cohen's kappa (which is more reliable accuracy metric). Notably, since the datasets are small, Leave-One-Out Cross Validation (LOOCV) technique is used as a validation method since it's considered as the most preferable and advisable validation method for small size sets (Rao, Fung, & Rosales, 2008). Instead of segmenting the dataset into training and testing sets, the efficiency of LOOCV lies in its ability to utilize from all the dataset instances (except one) to train the machine learning models. Besides, this process iterates to test one data point in each iteration, and the average accuracy of all tested points will be the output accuracy rate of each classification model. As a baseline from which the reasonableness of the evaluation accuracy results will be compared with (i.e. the point that should be improved), the probability of the occurrence of the grade value (the mode value, i.e. the most occurred grade) will be used and will be measured using this equation: The probability of Grade"x"occurrence ¼ Number of Grade"x"Instances=Total Grades Instances ð Þ Ã 100% That method is called zeroR classification, and it's a function in Weka tool, which calculates the probability for attribute's values occurrence (Litman & Forbes-Riley, 2004). Also, Cohen's Kappa (K) will measure the rate of models' accuracy in comparison with the accuracy of the random occurrence of attributes values. The kappa baseline starts from zero, which means that the algorithm produces an accuracy rate which is similar to the accuracy of the stochastic prediction. This algorithm considered an efficient and reliable evaluation metric for nominal attributes, also, in dealing with imbalanced (nonparametric) dataset attributes (or if there'll be a skewness in class frequency distribution) (Kuhn, 2008;Mchugh, 2012). Last, the overall accuracy p-value will be used to examine how reasonable or significance are the classifiers' accuracy in predicting the class of interest in relative to the baseline, i.e. no information rate (NIR). The applied alpha is 0.05 and the null hypothesis will be rejected if p < 0.05. So, the proposed hypotheses are: -Null hypothesis (H0): there is no difference between the accuracy predicted by classification algorithms and NIR (accuracy of the random prediction). -Alternative hypothesis (H1): there is difference between the accuracy predicted by classification algorithms and NIR (accuracy of random prediction).

Datasets summary statistics
Since the accuracy is the main key metric that the evaluation of machine learning models will be relying on, the baseline accuracy is calculated at first (also called 'no information' rate) for both datasets. The calculated baseline and the results obtained from the preliminary descriptive analysis of the datasets of interest are shown in Table 2 and the related code is shown in the Additional file 1. After that, missing values were identified and found (mainly) in, as shown in Fig. 5. Some missing values were treated by eliminating the attribute (like: dissertation instructor name), and the other missing values were replaced with the mode of the corresponding attributes.

Key attributes
To achieve the first aim of this research, was assessed for its key indicators using the heatmap.2 function, which is imported from gplots Library in R. In that function, the attributes in that dataset were grouped according to their similarity with the help of agglomerative traditional hierarchical clustering algorithm that is embedded within the heatmap function. In other words, since clustering is performed for rows and columns, then, the attributes and values that are similar to each other were grouped close to each other in one cluster. So, after observing and its relative heat map and the column's dendrogram figures, i.e. Figure 6 and Fig. 7 (in respectively), the top five features that were found close to the dissertation grade attribute (i.e. Grade) were: Grade2 (Grade for Course 2), Grade 1 (Grade for Course 1), Grade 5 (Grade for Course 5), Grade 6 (Grade for Course 6), and Grade 3 (Grade for Course 3). These attributes are considered the main key indicators for predicting student grade in dissertation course, as they all have correlations that allowed them to be in one cluster at dendrogram height 1.5. And that answered the sub-question of the first research question, and proved the efficiency of visualization and clustering in identifying the key attributes. Providing that the success factors of the visualizations analysis lies in the scaling of large values attribute, i.e. Student Age, into a range that is commonly used in other attributes, which is (Mueen et al., 2016;Sharma & Paliwal, 2015), using this equation:  In addition, before passing the dataset to dendrogram visualization function, the attributes that had zero or very low standard deviation or variance were nullified to avoid invalid correlations and output errors.
The same aforementioned visualizations techniques were repeated to identifying the best indicators in, to help in answering research question 2. So, as a result of its visualization, Fig. 8 and Fig. 9 illustrate the features which were correlated with students' Grades in all courses (i.e. Grades), and they are: students' Age (Stu.Age), bachelor GPA (BSc.GPA) and specialization (BSc.Deg). However, the visualization of their relations barely appeared in the heat map, but were clearly forming one dendrogram cluster at 1.4 height. Therefore, and as a partial answer to the second question, the clustering analysis was obviously showing that pre-admission attributes (i.e. students' age, bachelor degree and GPA) were having significant impacts on student grades compared to other attributes. Another thing, heatmap visualization was perfect in showing the dominant grade label in both datasets, as the color that represents grade 'A' in grade 4 attribute was the widely spread one, but for grade 1, grade 'Fail' was the dominant grade.

Evaluation of classification models
The extracted key indicators, which was extracted from the visualization analysis, were fed in the five chosen classification algorithms. But, it's worth mentioning that since the chosen classification algorithms have the capability to train two different attributes' types, i.e. nominal and numeric, both were tried and trained. Then, the accuracy results were evaluated to see which variable type can work efficiently with each classification algorithm in training the datasets of interest. As a result, Fig. 10 and Fig. 11 clearly show that SVM model (with radial kernel) reported the highest accuracy rate in predicting students grades in all courses (i.e. Grades attribute) and dissertation project (i.e. Grade attribute). Noting that the underlined (U) x-axis names of algorithms were those that worked efficiently with nominal attributes. The predictions rates, in comparison with the baseline, are 76.3% and 69.7% for dissertation grade and all courses grade class, in respectively. SVM with radial kernel function was chosen since it has the ability to train and deal with imbalanced datasets. Additionally, kappa results showed that LDA's accuracy in predicting student's dissertation grade is 44.7% and that considered better than predicting the same class labels randomly. However, for all course classification (i.e. Grades attribute), SVM's kappa was the highest and its is better than the baseline recording 41.7% accuracy rate. All the aforementioned related results and the comparison between different attributes types are placed in Tables 7, 8, and 9. Now, to evaluate the significance or the credibility of the achieved accuracy rates in contrast with the random prediction ones, the p-values were extracted from the confusion matrix function of all trained machine learning algorithms for and. The outcomes, as presented in Fig. 12, indicate the significance of SVM's accuracy results in accurately classifying students grades in all courses and the dissertations one because the recorded p-value for successfully classifying them were 0.0003 and 0.03, respectively. So, since the p-values were less than 0.05, the proposed null hypothesis has been rejected. To rephrase, the accuracy results achieved by the SVM Radial classifiers in both datasets exceeded the baseline and were accepted as significance accuracy rate, making SVM kernel model a perfect classification model among other tested algorithms. Thus, that answers the remaining parts (about the significance of the accuracy rate) of research questions 1 and 2.

Discussion & Conclusion
Predicting students' performance for post graduate study is important for any educational institutions. It is important especially, for those who are aiming to give students  opportunities in doing something useful in their field of study, and those who are aiming to well manage the needed teaching resources for excellent learning experiences, like the British University. The British University in Dubai is a start-up research-based institute which aims to improve its reputation and ranking by selecting high performing students to engage them in solving real world issues. So, predicting distinguished students is an urgent desire. Additionally, knowing students' performance in each course beforehand is a main requirement in order to help at risk students by mitigating the challenges that they are facing in their learning journeys and helping them excel in the learning process. Whilst, such predictions, especially, for a new university is a challenge since there are no enough dataset records to be analyzed. Nonetheless, our results prove the possibility of doing so with reasonably significant accuracy rates. The support vector machine classifier with radial kernel was the one which proved its efficiency (among the rest of classifiers) in predicting students' performance in all courses' grades, including their dissertation projects' grade. The main reason that may be attributed to that classifier's success is the model training method that its used, which relies only on a few data points or samples (those which are very close to the hyperplane) to build its classification model. That result did not match the research findings in (Mustafa et al., 2017;Pasini, 2015;Sharma & Paliwal, 2015) which proves the efficiency of LDA and MLP-NN in treating small dataset sizes. But it agrees with (Asif et al., 2017) that there is no perfect classifiers that can work efficiently for similar dataset characteristics in different use case scenarios. Moreover, since the attributes values in each class (for both datasets) were imbalanced, and for generalizability purpose (i.e. to measure the accuracy while avoiding the bias that may be created by that imbalanced data while training the dataset), another  performance measure was used and it's called balanced accuracy. Balanced accuracy is an evaluation metric that takes the average of sensitivity (or recall) and specificity to calculate the accuracy rate of a certain attribute's class (Brodersen, Ong, Stephan, & Buhmann, 2010). So, the balanced accuracy rate was extracted from the confusion matrix of the classification results of all tested classifiers for only class A (4) of grades attribute. Then, they were compared with the calculated accuracy baseline, and the result is shown in Fig. 13. LDA has recorded the highest accuracy rates with values 79% and 77% for the classification of class A for, in respectively. Noting that, although SVM produced acceptable accuracy results, it is still susceptible more than LDA to be biased with imbalanced dataset observations while training the model. Despite that, both have proved to be reliable since their kappa results not only exceeded their baselines but also