Framework for automatically suggesting remedial actions to help students at risk based on explainable ML and rule-based models

Higher education institutions often struggle with increased dropout rates, academic underachievement, and delayed graduations. One way in which these challenges can potentially be addressed is by better leveraging the student data stored in institutional databases and online learning platforms to predict students’ academic performance early using advanced computational techniques. Several research efforts have focused on developing systems that can predict student performance. However, there is a need for a solution that can predict student performance and identify the factors that directly influence it. This paper aims to develop a model that accurately identifies students who are at risk of low performance, while also delineating the factors that contribute to this phenomenon. The model employs explainable machine learning (ML) techniques to delineate the factors that are associated with low performance and integrates rule-based model risk flags with the developed prediction system to improve the accuracy of performance predictions. This helps low-performing students to improve their academic metrics by implementing remedial actions that address the factors of concern. The model suggests proper remedial actions by mapping the students’ performance in each identified checkpoint with the course learning outcomes (CLOs) and topics taught in the course. The list of possible actions is mapped to this checkpoint. The developed model can accurately distinguish students at risk (total grade <70%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$< 70\%$$\end{document}) from students with good performance. The Area under the ROC Curve (AUC ROC) of binary classification model fed with four checkpoints reached 1.0. Proposed framework may aid the student to perform better, increase the institution’s effectiveness and improve their reputations and rankings.

Page 2 of 26 Albreiki et al. Int J Educ Technol High Educ 2022, 19(1):49 There have been many attempts to forecast student performance, including the automatic identification of at-risk students, with the aim of ensuring student retention (Bengio et al., 2021) and allocating appropriate courses and resources. Conventionally, educational institutions use older teaching methods to provide technical and nontechnical education (Kuzilek et al., 2015). However, a new form of education based on e-learning must be adopted if an educational institution is to overcome the current challenges (Ha et al., 2020;Hussain et al., 2018). The internet has made it easier for modern educational institutions to compete well in the modern environment. Moreover, students can study at home or learn new skills using various e-learning platforms, such as intelligent tutoring systems (ITSs) (Mousavinasab et al., 2021), learning management systems (LMSs) (Costa et al., 2017;Zhao et al., 2020), and massive open online courses (MOOCs) (Al-Rahmi et al., 2019).
The competitive environment also provides higher education institutions with many ways to sustain long-lasting innovation. Data mining (DM) (Hernáandez-Blanco et al., 2019) is particularly effective when combining ideas from different fields, and has been used to extract important information from raw data. Recent studies (Liao et al., 2019;Iatrellis et al., 2021) have identified new possibilities for technologyenhanced learning systems that can be tailored to each student's needs. The application of EDM can ensure a learning environment that is appropriate for specific students Prenkaj et al., 2020).

Predicting students' performance
The performance of individual students can be predicted with great accuracy using educational data (Koprinska et al., 2015). Prediction assists students in making informed decisions about which courses to choose based on their skills (Kuzilek et al., 2015), can be used to develop study plans (Ha et al., 2020), and aids instructors and administrators in ensuring that students obtain the best possible outcomes. This minimizes the number of official warning signals, and consequently the expulsion rate, which may otherwise affect an education institution's reputation (Ha et al., 2020). Early predictions of student performance may allow decision-makers to take appropriate action at the right time. Furthermore, it may allow them to plan proper training schedules in order to increase student success. For instance, dropouts may experience increased risk of poverty or antisocial behavior, as well as difficulties adjusting to society. Thus, failing to increase the retention rate may negatively aect students, parents, academic institutions, and society (Ha et al., 2020). The detection of at-risk students can be used to improve student retention rate and institutional effectiveness.
Monitoring students' performance is a challenging task for several reasons, i.e., identification of at-risk students (e.g., special needs, low performance) (Bengio et al., 2021), restricted access to certain aspects of the curriculum / education / assessment (basic skills versus the whole spectrum of courses) (Koprinska et al., 2015), and difficulty in using limited data to supply and predict instructional techniques, interventions, and supports.
Each course has specific requirements for enrollment based on educational background, skill-set, and hands-on experience. The overall objective of automatically identifying students' performance at the course level can help to modify existing programs.
Lowering the dropout rate by assisting students in predicting their chances of success in a course before they enroll is therefore crucial (Goga et al., 2015). Student performance may be improved if the course instructors have a better understanding of their students' capacities, allowing teaching tactics to be modified accordingly (Koprinska et al., 2015).

Machine learning and algorithms
Special forms of education, primarily virtual education, have received considerable attention (Lykourentzou et al., 2009). As a result, many businesses and educational institutions focus on automated performance analysis to measure academic success and determine student requisites (Iatrellis et al., 2019;Liao et al., 2021). Various machine learning (ML) algorithms (Ha et al., 2020;Iatrellis et al., 2021;Liao et al., 2019;Tomasevic et al., 2020), are currently being used to train, analyze, and evaluate the performance of students, aided by data collection techniques that improve the learning platform's usability and interactivity. All of this can be equated with artificial intelligence (AI) (Evangelista, 2021).
ML is very accurate in the early prediction of a student's performance (Buenaño-Fernández et al., 2019;Fahd et al., 2022), and can thus be used to improve education programs , reduce dropout rates (Goga et al. 2020;, and enhance retention rates (Bengio et al., 2021). Numerous studies (Buenaño-Fernández et al., 2019;Fahd et al., 2022;Ha et al., 2020;Iatrellis et al., 2021;Tomasevic et al., 2020) have proposed ML-and statistical-based techniques for the early prediction of students' performance, but only a few have proposed remedial solutions (Goga et al., 2015;Tomasevic et al., 2020;Zhao et al., 2020). The primary purpose of these research papers was to establish a scale that could be used to assess undergraduate students' impressions of course content and determine which were in danger of failing (Ha et al., 2020). They also examined whether novel teaching approaches lowered dropout rates. A third goal was to understand aspects that may have an impact on perceptions of anxiety and performance in a course setting (Goga et al., 2015). Finally, they determined whether or not the instructors would use the suggested approach to enhance student learning (Koprinska et al., 2015;Tomasevic et al., 2020).

Prospective goals
For many years, educators and legislators have been working to create a reliable system that would aid instructors in identifying students who were in danger of poor performance (Evangelista, 2021). However, the most intricate systems are expensive, heavily reliant on data, and only provide forecasts (Goga et al., 2015). Thus, it is important to develop a reliable warning system that does not require the installation of a complex database or high expenditure, so that all students have equitable access to an education and a brighter long-term future. Towards this goal, the present study aims to use ML-and rule-based models to automatically identify and help students who are at risk of failing a course and suggest remedial actions. The ML-and rule-based models operate by finding important patterns in the students' data through EDM . The overall aim is to help students to achieve their educational goals and for academic institutions to control their dropout rates.

Literature review
Educational institutions are finding it increasingly challenging to evaluate and forecast the performance of at-risk students due to a scarcity of labeled data and appropriate statistical techniques (Alboaneen et al., 2022). This has led to an increasing number of students with poor grades and a rise in student dropout rates (Koprinska et al., 2015). Therefore, techniques based on support vector machines (SVMs), random forests (RF), linear regression (LR), and additive regression (AR) have been proposed (Goga et al., 2015;Koprinska et al., 2015). Big data plays a crucial role in addressing reallife challenges, because different data mining techniques can be used to create value from the enormous volumes of data that are continually being created. Some studies have developed their own datasets, while others have used existing datasets (Prenkaj et al., 2020). Only the development of ML methods has made it possible to provide more reliable predictions about students' performance (Li et al., 2012).
Several studies have presented methodologies for using students' grades and course evaluations to forecast the performance of at-risk students (Albreiki et al., 2021a;Altujjar et al., 2016). Techniques such as Naive Bayes classifiers, K-nearest neighbors (KNN), SVMs, and neural networks have revealed the variety of variables that affect students (Koprinska et al., 2015;Kruck & Lending, 2003). For example, Kruck & Lending (2003) discussed the aspects connected with school, community, and family, all equally contributing to putting students at risk of dropping out.
Predicting students' performance in higher education helps to identify students that may underperform in various subjects (Moonsamy et al., (2021). Recognizing the necessary support required by at-risk students can be extremely helpful because instructors can then take timely and appropriate actions to improve the skills of these students (Purwaningsih & Arief, 2018). Moreover, the capacity to anticipate student achievement in a course or program opens doors to new possibilities, such as improving educational outcomes for all students (Alturki et al., 2016). Compared with past practices, the advent of accurate prediction systems that can successfully determine students' performance allows teachers to better distribute resources and teach according to the students' needs.

Learning management systems
One of the more novel ways of assessing student performance is to employ an LMS. The development of e-learning technology has made it simpler for educational institutions to deliver quality learning materials to their students (Hu et al., 2014). These LMSs also give valuable insights into how students interact with the system, their engagement time, and behavior analytics. Parameters such as the number of times a student has interacted with the course content (Zhao et al., 2020), how many times a student has taken quizzes and tests, and how active a student is while viewing an educational video or textual content can easily be recorded and analyzed. However, setting up an LMS requires an enormous amount of time. This is because ensuring that all teachers are comfortable with e-learning demands proper training, which is costly and time-consuming. Moreover, there ongoing administrative expenditures are incurred in ensuring that the interface remains tailored to the requirements. There is also the disadvantage of requiring coding and IT expertise to modify and update the LMS according to the organization's requirements, which places a financial burden on higher education institutions. Finally, several LMSs (Zhao et al., 2020) have adopted a "freemium" model with restricted functionality, with only paid features offering extra support and reporting. This is another challenging issue.

Machine learning algorithms
An effective LMS relies on efficient data processing. This is where ML algorithms (MLAs) come in. The use of MLAs, statistical methodologies, learning analytics, and data mining technologies has enabled researchers to examine and anticipate student performance in higher educational institutions. Different studies have utilized different MLAs, such as regression models (Hasan et al., 2020), to uncover findings related to student performance (Shahiri et al., 2015). For example, one investigation examined how students' programming activity impacts the course results (Watson et al., 2013). In another study, a model was developed to estimate how well students would do in their first college-level course (Kruck & Lending, 2003). A dashboard that allows instructors to monitor students' progress in different courses has also been proposed (Yadav et al., 2012), enabling early intervention when a student is thought to be underperforming in certain courses (Gong et al., 2019). These models have revealed that ML techniques are valuable for the early prediction of students' performance.
A recent study (Alboaneen et al., 2022) used ML and deep learning classifiers to predict student performance. The authors used LR, SVM, KNN, RF, and one neural network-based technique. The mean absolute percentage error was used to evaluate the classifiers predictions. The results showed that the midterm exam score greatly affected students' performance. Finally, the authors concluded that academic factors such as the students' background have a greater impact on performance than demographic factors. Another study (Urkude & Gupta, 2019) proposed a predictive model that outperformed naive Bayes, baggage, boosting, and RF methods in terms of categorizing and predicting students' performance, while a further study (Hu et al., 2014) used a decision tree classifier for early predictions of students' performance. Some recent work (Qazdar et al., 2019) used an ensemble of bagging, boosting, and voting to automatically predict students' performance. Prediction models for academic success have been established using the ID3 decision tree induction technique (Altujjar et al., 2016). Data relating to students from King Saud University in Riyadh, Saudi Arabia, who were enrolled in the Bachelor of Science degree in information technology were used to train and validate the models. In contrast, a different study (Li et al., 2012) used data from UWF's (University of West Florida) autumn 2008, fall 2009, and fall 2010 semesters to evaluate students' performance in "Elements of Statistics, " one of the most popular courses in general education. They summed up the different applicable solutions for different subjects, such as programming courses (Alturki et al., 2016), English language (Purwaningsih & Arief, 2018), and radiology (Cornell-Farrow & Garrard, 2020), as a means of ensuring effective learning and lessening student dropout rates. Likewise, various MLAs have been compared in terms of examining student academic performance  and enhancing the educational framework (Liao et al., 2019). The accuracy and recall were used to evaluate the robustness of the proposed model. A recent study (Prenkaj et al., 2020) predicted the final exam scores of students in the third week of the term using data collected by instructors using the Peer Instruction methodology.

Data mining
MLAs will only work with the data fed into them. This is why they need to be coupled with DM techniques. EDM and ML aid the analysis of classroom settings for students. For example, a case study at Greece's University of Thessaly (Ha et al., 2020) proposed a method for testing student performance. The authors used equivalent educational criteria and measurements to categorize the case study participants. In another study, the authors showed that the demographic data had no impact on classification and regression accuracy (Fahd et al., 2022), and artificial neural networks outperformed traditional MLAs when given student participation and past performance information. Moreover, the authors of a separate study reported that students' final marks might be estimated using ML classifiers based on their prior performance (Buenaño-Fernández et al., 2019). In contrast, other researchers (Zhao et al., 2020) used prediction algorithms and trained them with semester-level performance data provided by course teachers. A forecasting model that predicts the first third of a semester's student learning success has been presented (Dekker et al., 2020), and video learning analytics and DM have been employed to forecast students' overall performance at the start of the semester (Namoun & Alshanqiti, 2021).
Different publications assert the existence of distinctions between data qualities, data complexity, the degree of contribution significance, and the limitations of algorithms used in diverse applications (Zhao et al., 2020). For such purposes, large and complex datasets may be automatically analyzed by ML models, providing accurate results concerning students' performance and minimizing unexpected risks.

At-risk students and dropouts
One of the primary goals of utilizing LMSs, MLAs, and DM is to help at-risk students and prevent dropouts. One study reported that the dropout rate of students in computer programming courses was more than 50%, which was unexpectedly high compared with other courses (Kruck & Lending, 2003). The author reported that students experienced considerable variations in programming courses because of different coding abilities, different teaching methods and materials, and the students' interests, learning styles, and self-discipline. Another study used a supervised naive Bayes classifier to determine student performance in an English language course (Purwaningsih & Arief, 2018).
The study revealed that student backgrounds and prior skills at the start of the course could be used as predictors for measuring performance. It is important to note that these previous studies did not determine the possible reasons for students dropping out. Dropouts must be differentiated/segmented depending on student behavior, institutional level, and time. The limited effect of university officials regarding certain reasons for dropping out is another restriction. Finally, the findings of previous studies have revealed that the educational staff of higher education institutions are largely unaware of the dropout problem.
To reduce the dropout rate, we must consider several different perspectives. For example, (Xing et al., 2015) emphasized that student mental health is a crucial factor in determining the likelihood of dropping out. The authors recommended chatbot treatments and a curriculum-wide life-crafting intervention. Recent research (Gupta et al., 2020) employed 12 semi-structured interviews with university staff and LSS (Lean Six Sigma) professionals to better understand student dropout rates and the impact of LSS tools in reducing these rates. The authors suggested that higher education institutions should retain extensive data and educate the appropriate authorities on the effect of student dropout rates so as to establish a student dropout typology. Moreover, the authors emphasized that educational settings should be less punishing. Dropouts can be minimized via consultation and tutoring, because consultation significantly improves the number of students focusing on given activities and reduces the number of inefficient instructors.

Student performance model
One model that takes advantage of all aforementioned strategies is the student performance model (SPM). A recent survey (Albreiki et al., 2021b) highlighted the most promising strategies for predicting students' performance, along with the current limitations and challenges. Different ML and statistical methods have been used to determine the academic and demographic characteristics of those students who are most at risk of failure. Many existing SPMs are based on statistical approaches, using probability and estimation to predict students' performance, and thereby offer a strong basis for decision-making as a means of improving teaching/learning outcomes. Moreover, several studies (Alhassan et al., 2020;Prenkaj et al., 2020) have proposed predictive models and discussed the influence of hidden factors that are peculiar to students, lecturers, the learning environment, and the family, together with their overall effect on student performance, using balanced and unbalanced datasets (Inyang et al., 2019).

Interventions
So how do ML techniques provide remedial interventions for at-risk students? The authors of a recent study (Borrella et al., 2022) used two primary techniques to provide interventions. First, the proposed prediction algorithm identified students at risk of dropping out, and a portion of these students were assigned to an A/B testing experimental environment. Second, the authors employed data analysis to identify target populations of at-risk students. The study recommended that educators assess whether the instruction time is sufficient and students are getting adequate attention, because students need a certain amount of time with appropriate instruction, practice, and feedback. In addition, the study also recommended that educators assess whether the class learning environment promotes opportunities for students to respond and whether the teaching is aligned with students' learning requirements. Instructors should promote one-on-one instruction, which often suits the learning requirements of students who demand more explicit and methodical teaching. As a result, the classroom atmosphere can be improved, and dropouts and suspensions can be reduced. However, the outcomes of this research (Borrella et al., 2022) are subjective due to the diverse range of student backgrounds.

Learning outcomes
The results from the aforementioned interventions can be gauged by the use of learning outcomes. A learning outcome is a statement describing what students should know or do after a class, course, or program, and explains why students should achieve the desired goals. These outcomes assist students in making connections between what they have learned and how they may use it in other situations, such as in their professional lives (Koprinska et al., 2015;Tomasevic et al., 2020;Zhao et al., 2020). The emphasis of learning outcomes is not the quantity of material covered, but how well students can apply what they have learned, both inside the classroom and in the real world (Tomasevic et al., 2020). Moreover, student learning objectives should be obvious, visible, and quantifiable at both the course and program levels, and they should mirror the course and program requirements.
Identifying underachieving students and those who are excelling in school may be simplified by ensuring that program learning outcomes (PLOs) and course learning outcomes (CLOs) are fulfilled. Educators and managers may use PLOs and CLOs to design a wide range of educational initiatives. These may help students improve their grades, and may enhance student counseling and tutoring systems (Tomasevic et al., 2020). Moreover, the student solutions for assessment tasks can be submitted online, and the answers are checked against public and concealed tests established by the instructor. This will quickly enable the instructor to identify students' weaknesses and take adequate measures to ensure that the students obtain the necessary expertise and achieve the desired learning outcomes.
The impact of internet usage data on students' academic performance was the subject of recent research (Waheed et al., 2020). The goal of this study was to analyze and report on students' learning processes and contributions to individual achievement, and the proposed model achieved accuracy of 84-93% (Yukselturk et al., 2014). In addition, hierarchical cluster analysis and association rule mining have been used to determine the ideal number of failed course clusters and course grouping (Marbouti et al., 2016). Furthermore, an ML-based framework has been developed for predicting student performance at a high school in Morocco using school data from 2016-2018 (Alboaneen et al., 2022). Finally, an online undergraduate course's learning activities have been used to construct an early warning system using an LMS (Costa et al., 2017), while student learning outcomes have been predicted based on participation in online educational platforms (Wolff et al., 2013).

Research objectives
The goal of this study is to examine the potential yield of advanced ML strategies to improve the prediction of students' performance at the course level, Fig. 1 summarizes the overview of methodology of this study. Specifically, we investigate the effectiveness of an "Explainable ML" model in conjunction with educational data for predicting students' final performance in programming courses. This research study develops solutions for identifying and predicting students at risk of failure, and suggests appropriate remedial actions to address the significant factors as early as possible. To address the main objective, we formulate the following tasks: 1 ML techniques are used to predict at-risk students as early as possible using course checkpoints. 2 An Explainable ML model is developed to identify contributing factors that can easily be interpreted by laymen. Fig. 1 Overview of methodology of this study Page 10 of 26 Albreiki et al. Int J Educ Technol High Educ 2022, 19(1):49 3 A novel ML-based framework and rule-based models are proposed to improve the identification of students at risk of poor performance during the early stages of the learning process, enabling appropriate interventions to be implemented.

Data collection and dataset description
The educational data used in this study were collected from different sources, such as the Banner system, which contains students' information, instructors that taught programming courses, and documents manually extracted from the Ministry of Education portal.
The main data used specifically pertain to programming courses taught to undergraduate students at the College of Information Technology (CIT), United Arab Emirates University (UAEU). The students must take this course to accomplish the university's graduation requirements. Students from other colleges may take the course as an elective as part of their academic study plan. The data represent the performance of students in programming courses over different academic periods from 2016/2017 (fall and spring) until 2020/2021 (fall and spring). General demographics, course registration, and campus details were added to the data. The original dataset contained 730 records with 44 features before data analysis and classification. After removing inconsistent rows and features using univariate feature processing, the final dataset contained 649 samples and 38 features (see Table 1). The courses were not directed or specially designed for the experiments described in this paper. Based on the features of the data, we constructed three nonoverlapping datasets: • Dataset D1 consists of 218 students enrolled in "Algorithms & Problem Solving, ", a description of which can be found in our previous paper (Albreiki et al., 2021a). • Dataset D2 includes records of 230 students enrolled in "Object-oriented Programming. " In addition to the students' performance in this course, we collected some data about their prior performance, demographics, enrolment, etc. (see Historical Features in Table 1). • Dataset D3 consists of 201 students enrolled in "Algorithms & Problem Solving. " Along with the students' performance in this course, we added information about the topics and CLOs/PLOs covered in each checkpoint. This allowed us to build a framework for automatically suggesting remedial actions.

Data preprocessing
The data preprocessing was divided into six phases. First, the course assessment files, student data (Banner system), and manually extracted documents were synthesized. Second, the compiled data were cleaned to remove any superfluous entries. Third, because of inconsistencies such as differences in file structures due to courses taught by different instructors, the data were unified to ensure homogeneity (structure unification). Next, missing data values were treated using an imputation technique in which missing entries were assigned the average value of the same coursework components. After data aggregation, standardization was carried out to convert the data from categorical to numerical values, integrate all files into one CSV file, and normalize the marks by employing minmax normalization (rescaling the features to the range [0, 1]). Finally, before obtaining the final output, we added an additional column based on rules and significant milestones in student performance. We divided the students into three main categories based on their total grade (TG), i.e., Good ( TG ≥ 70% ), AtRisk ( 60% ≤ TG < 70% ), and Failed ( TG < 60% ) in datasets D1 and D2; Good ( TG ≥ 70% ) and AtRisk ( TG < 70% ) in dataset D3.
A typical data file structure (see Table 2) was employed following that of (Albreiki et al., 2021a). This structure is shown below: C i -name of the predefined checkpoint g i,j -grade of the jth student at checkpoint C i max(g C i ) -maximum possible grade for checkpoint C i m-number of students n-number of checkpoints in the course i, j-indices, i = 1, n, j = 1, m Qz i ), midterm grades MT, final exam grades FE, and the total grade TG, where · D denotes the dataset used, h D1 = 4, h D2 = 1, h D3 = 2 , q D1 = 6, q D2 = 4, q D3 = 5 . All checkpoints were applied cumulatively up to the final exam as input variables to the model.

Explainable ML model
Recent MLAs are very accurate, but are often considered as black box models. When the model is used for decision-making, it is important to explain the reasons for a specific decision. Therefore, insights into the influence/importance of different features are crucial in increasing the confidence in model predictions. For this purpose, an interpretable model must be designed. This model may provide a quantitative relationship between the input variable and the model output. Local fidelity should also be ensured, meaning that the features that are locally important for a prediction can be identified. Finally, the proposed model should be modelagnostic or explain any MLA (Ribeiro et al., 2016). Let us consider a model m that belongs to the class of interpretable models M. We denote an input of model m as x = {x 1 , x 2 , ..., x n } ∈ R n . The corresponding interpretable representation of the vector x is x = {b 1 , b 2 , ..., b k } ∈ R n , b i = {0|1}, i = 1, ..., k . Vector x consists of k components that can explain the model output. The complexity of the model plays a crucial role in its "explainability". Let Ŵ(m) be a measure of the model's complexity. For instance, this may be the depth of the tree in a decision tree model.
In classification algorithms, the output of the classification model is the probability that the input vector corresponds to a certain class. In other words, f (x) = {p, x ∈ R n , p ∈ [0, 1], R n � → R} . � x (s) is some local region around the input vector x, where s is a vector located in proximity to vector x, i.e., the distance from x to s is small. As a distance measure, we could use the Euclidian, Manhattan, or cosine distances, among others. For instance, we can use the Gaussian kernel to represent � x (z) as: where ||x − s|| = n i=1 (x i − s i ) 2 is a distance norm (i.e., Euclidian norm) and σ is a width parameter.
(1) � x (z) = e ||x−s|| 2 2σ 2 We can now formulate an optimization problem. To ensure that model m approximates function f in proximity of input vector x, we minimize the loss function L(f , m, � x ) while ensuring that Ŵ(m) remains at an appropriate level. We can interpret the model as: The features that contribute to the final model output can be identified by performing a search using perturbations. In other words, we learn the behavior of function f using input vectors x in the proximity of x calculated with Ŵ x . For instance, if m is linear, the fidelity function L is as follows: where S is the set of all perturbed samples used to solve the optimization problem in Eq. (2).

Research design
The proposed method for the early prediction of students at risk of low performance and suggesting appropriate remedial actions is illustrated in Fig. 2. There is an initial preprocessing phase in which the data are collected, integrated, and processed to form a proper dataset (see Sect. "Data preprocessing"). The preprocessed data are then passed through each of the objectives mentioned previously. The basic principle is to add checkpoint features to the ML model cumulatively. The explicit details are given below. Table 3 summarizes the objectives of this research study. (2)

Fig. 2 Pipeline of proposed framework
Page 14 of 26 Albreiki et al. Int J Educ Technol High Educ 2022, 19(1):49 For objective 1, we employed advanced ML techniques to identify at-risk students as early as possible using only course checkpoints. Datasets D1 and D2 were used to classify students into Good, AtRisk, and Failed groups. As the model input, we used all checkpoints obtained prior to the midterm exam (MT). We employed multiclassification prediction models using eight ML techniques, namely the XGB classifier, LightGBM, SVM linear, naive Bayes, ExtraTrees, bagging, RF, and multilayer perceptron. We consistently evaluated whether adding the next checkpoint to the model improved its performance significantly. We also assessed the potential value of historical features in improving the model's accuracy. This allowed us to assess the reliability of the proposed cumulative approach. Five-fold cross-validation was used to generalize the true error rate at the population level.
To address objective 2, we followed the same steps as for objective 1. However, the purpose of this objective was not only to enhance the prediction results, but also to make the model more explainable for non-experts, such as educators and instructors. First, we applied feature selection methods such as information gain, Chi-square test, correlation coefficient, and the mean absolute difference (MAD). This allowed us to find the most informative features with respect to the model output. We then used the local interpretable model-agnostic ML model (see Sect. "Explainable ML model") to provide a qualitative understanding of the relationship between the input variables and the model's response. By explaining a representative set of cases, the user obtains a global understanding of our model. The model provides a generic framework for unraveling black boxes and addressing the "why" behind students' predictions or recommendations for those who are at risk. Finally, we compared the performance of the proposed model using different sets of input features (historical data, checkpoints, historical data and performance in course).
For objective 3, we employed a novel framework using ML-and rule-based models for identifying students at risk of low performance during the early stages of the learning process, enabling appropriate interventions or remedial actions to be taken. We start our analysis by mapping the CLOs to topics and corresponding checkpoints. For this purpose, we worked with three instructors teaching the Algorithms & Problem Solving course. They composed a mapping table and suggested lists of remedial actions for each checkpoint. We applied our rule-based model (Albreiki et al., 2021a) to the checkpoints cumulatively to generate values for the risk flags. Consequently, the ML model was employed to classify students into Good or AtRisk groups. The

Evaluation measures
To assess the quality of the outcomes given by the classification methods, we calculated the sensitivity, specificity, area under the receiver operating characteristic (ROC) curve (AUC), accuracy, and balanced accuracy metrics. A confusion or error matrix was constructed for each predictive model to show how well it distinguished between classes. The ROC curve and its AUC were used to evaluate the performance of the classifiers and summarize the trade-off between the true positive rate (TPR) and false positive rate (FPR) using different probability thresholds. We define: The overall accuracy of the model is defined as: where TP, TN, FP, FN are the true positive, true negative, false positive, and false negative values representing the confusion matrix of the classification model, respectively. All models were trained using k-fold cross-validation. The metrics were calculated for each fold separately, and then the averaged values were used as the final measure.

Experimental results
In this section, we present the main results from the experiments outlined in Sect. 3. We show that the advanced and explainable ML-and rule-based models can improve the identification of students at risk of low performance during the early stages of the learning process, so that appropriate interventions can be implemented.

Exploratory data analysis
First, we inspected the attributes in datasets D1, D2, D3 for Gaussianity. A Shapiro-Wilk test revealed the non-normal distribution of all attributes. Therefore, we utilized nonparametric statistical tests for further analysis. To check whether the data from the studied categories came from a common distribution, we applied the Kruskal-Wallis test to continuous features and the Chi-square test to quantitative features.

D1:
Of the 218 students in the dataset, 60.09/16.97/22.94% were identified as being in the Good/AtRisk/Failed groups, respectively. All groups were significantly different in terms of students' performance for all checkpoints ( p < 0.05 ). A statistical test revealed no significant differences between genders ( p = 0.458727 ) for the observed groups.
(4) TPR(sensitivity) = TP TP + FN (5) TNR(specificity) = TN TN + FP (6) Accuracy = TP + TN TP + TN + FP + FN D2: Of the 230 students, 57.39/17.39/25.22% were identified as being in the Good/ AtRisk/Failed groups, respectively. The observed groups were significantly different in terms of all course checkpoints ( p < 0.05 ). This trend was also evident when we compared grades in previously taken courses. For instance, the grades in high school math ( p = 3.82081 × 10 −5 ), high school physics ( p = 4.43989 × 10 −7 ), Calculus I ( p = 9.47833 × 10 −5 ), and Algorithms & Problem Solving ( p = 5.76147 × 10 −18 ) differ significantly between the Failed, AtRisk, and Good groups. The number of times the course was repeated also contributes to the segregation ( p = 1.55162 × 10 −08 ). Historical features revealed that the admitted age, college, and gender had no effect on the total course scores. D3: Of the 1 students from eight different sections, 81.59/18.41% were identified as being in the Good/AtRisk groups, respectively. A statistical test revealed significant differences between the performance for all checkpoints, except homework assignments. No influence of term or year on performance was evident ( p > 0.05 ). There were significant differences between groups in terms of sections ( p = 0.00278 ). This may be related to teaching style as well as gender differences. Due to the gender segregation policy in UAEU, each section is offered for either male or female students. We applied our previously proposed model (Albreiki et al., 2021a) to D3 to identify at-risk students at early stages. All risk flag values differed significantly between the Good and AtRisk groups. The number of remedial actions invoked was also significantly different ( p = 5.93659 × 10 −15 ).
Correlation analysis shows that students' performance for all checkpoints is positively correlates with MT, and TG.

ML techniques for predicting at-risk students using course checkpoints
We divided the students into three main classes based on their total grades for the course (Good, AtRisk, and Failed). We applied eight MLAs (XGB classifier, Light-GBM, SVM linear, naive Bayes, ExtraTrees, bagging, RF, and multilayer perceptron) to D1 and D2 to predict the groups of students based on their TG performance. We considered only those checkpoints before the midterm exam, which are Quiz1Norm, HW1Norm, Quiz2Norm, and HW2Norm for D1 and Quiz1Norm, HW1Norm, and Quiz2Norm for D2. Finally, we calculated the precision, recall, F1-score, and AUC for all of the algorithms. Table 4 summarizes the results for objective 1. For D1, the ExtraTrees classifier achieved the best performance for this objective. It outperformed the other seven state-of-the-art algorithms with an AUC score of 0.96 and an accuracy score of 0.86. For D2, the ExtraTrees classifier outperformed the other algorithms, achieving an accuracy score of 0.87 and an AUC score of about 0.95, as shown in Table 5.

Advanced and explainable ML Model for enhancing prediction results by adding prior knowledge
Even though the traditional ML model successfully predicts at-risk students, it cannot identify the factors that contribute to students falling into this category. Thus, we conducted a series of experiments to identify at-risk students at a sufficiently early stage and predict their MT and TG performance during the course period. The predictions were obtained in three experiments using different features, as described below: • Experiment 1: Using only historical features. We used 18 features from the dataset in this experiment. These features cover historical student data, such as the student's age, registered hours, high school GPA, math grade, physics grade, number of repeated programming courses, citizenship, gender, sponsor, residency, and so on (see Table 1). • Experiment 2: Using only course checkpoints. We used three features from the dataset, namely Quiz1Norm, Quiz2Norm, HW1Norm. • Experiment 3: In this mixed experiment, we combined all of the 21 features used in experiments 1 and 2.
We used the same eight ML classifiers. The attribute feature_importances in Python were used as a feature selection method to improve the efficiency and effectiveness of the predictive model. Figure 3 shows the most important features in the dataset. Based on the ten most important features identified by each classifier, we then attempted to predict which of the students would fall into the three groups of Good, AtRisk, and Failed. Table 6 presents the prediction results using the combined features. Based on the ten most important features, we were able to predict the groups of students based on their MT and TG performance with AUC scores of 0.95 and 0.97, respectively. By incorporating prior knowledge and selecting the most important data points, we were able to improve the prediction results. Table 6 also shows that there are overlapping features/predictors (such as HS_GPA, Qz1Norm, CENG205, HW1Norm, MATH, and PHYS) that affect the performance of the students in this course. After predicting the students' performance successfully, our objective was to generate trust in our model. For this, it is important to explain the model to ML experts and domain experts such as instructors and educators. As such, Fig. 4 presents the results after Explainable ML was run for experiment 3. There are three sections in Fig. 4: Failed students are displayed in blue, AtRisk students are indicated in orange, and Good students are shown in green. All three sections consist of three columns.
The left-hand side of the visualization (blue section) presents the predictive probability distribution per class. This student will fail with 90% confidence. Based on the LGB model results, the features with the most influence on the "Failed" class are presented on the right-hand side. In the center of the plot, we see a condition per influential feature and its strength (i.e., contribution/influence to the model). We find that 45% of this score can be attributed to the "Repeated Grade (ITBP219/CSBP219) " value, 20% of this score comes from Quiz1Norm being less than or equal to 0.42 (normalized value), and the remainder is attributable to the values of HW1m CENG205, CENG202, CSBP121, PHYS, MT, MATH, and so on.
The AtRisk student falls into the orange section with 97% confidence. Based on the LGB model results, the center of the plot gives a condition per influential feature and  its strength. In this case, 25% of the score can be attributed to the "Repeated Grade (ITBP219/CSBP219) " value, 21% of the score comes from MT being greater than 0.50 and less than or equal to 0.60, and the remainder is attributable to the values of CENG202, Qz1, PHYS, PHYS105, and so on. Finally, the green section gives the predictive probability distribution per class for a student classified as"Good" with 89% confidence. Some 22% of this score comes from the MT value being between 0.6 and 0.75, 21% can be attributed to the HW1 value being greater than 0.95, and the remainder comes from the other values.

Novel framework and immediate remedial actions for improving students' performance
We first created a mapping table to link the CLOs with the topics and assessment checkpoints. Three instructors teaching the D3 course composed Table 7. From this table, we can see that some assessments address two or more topics. For each checkpoint, a list of remedial actions ( RA i , i = 1, 10 ) was proposed. Before the beginning of the course, instructor should compose such table. Once it is done, the proposed framework can be used by feeding the model with formative or summative assessments. We now propose a novel framework that uses ML-and rule-based models to identify students at risk of low performance during the early stages of the learning process, enabling appropriate interventions to be implemented. Our model combines a rule-based model (Albreiki et al. 2021a) with binary ML classification to predict each student class based on the students' cumulative grades, i.e., Good ( TG ≥ 70% ) and AtRisk ( TG < 70% ) in D3. Using the rule-based model, we can generate risk flag (RF) features every time Topic 10 (T10) Passing arrays to methods (Chap. 9) • Two-dimensional arrays (Chap. 9).
• Implement searching (sequential/ binary) • Sort an array using bubble/ insertion sort new checkpoint values are inserted into the model. When student performance drops below a certain threshold (less than 70%), the cumulative RF value is updated (Albreiki et al., 2021a). We then add the checkpoints and RF features to the model cumulatively to predict the performance of students based on their groups. In addition, we compare the output of the proposed model with two sets of input features (course checkpoints only and checkpoints with RF features). Based on the mean AUC value, the ExtraTrees classifier performed best with both sets of input features (see Table 8), outperforming the other seven classifiers. Table 9 presents the mean AUC values of the best classifier for both sets of input features. The prediction results clearly improved as the features were cumulatively added. Table 9 also shows that, by adding risk flags from the rule-based model, the performance improved by 2.31%. As a result, we can predict the students' performance at the first checkpoint of the course with a reasonable level of accuracy, which will benefit both students and instructors. Finally, proper remedial actions can be taken during the course by mapping the predicted risk probability to a list of actions associated with each checkpoint, as shown in Fig. 5.
To validate the usage of the proposed framework, we assessed the distribution of the total grade values with respect to the number of remedial actions. Figure 6 shows that a greater number of remedial actions corresponds to a lower total grade. The linear relationship between the number of remedial actions and the total grade was also assessed using Pearson's correlation coefficient. The calculated value of −0.735 is statistically significant ( p = 9.25 × 10 −36 ). Therefore, the proposed customized model can be considered and used as an effective warning system to identify at-risk students in the early stages of a course.

Discussion and future work
Several studies using ML classifiers to predict student performance have obtained varying degrees of accuracy-56.25% (Yadav et al., 2012), 65% , 80% (Muñoz-Carpio et al., 2021), 85% (Iatrellis et al., 2021), 93% (Evangelista, 2021), and an AUC score of 0.79 (Liao et al., 2019). However, the present study has proposed a method that obtained 96% accuracy in terms of predicting the total grade as early as possible before the midterm exam. Furthermore, ML-and statistical-based techniques for early prediction of students' performance have been utilized in a variety of studies (Buenaño-Fernández et al., 2019;Fahd et al., 2022;Ha et al., 2020;Iatrellis et al., 2021;Tomasevic et al., 2020). Nonetheless, previous works (Ha et al., 2020;Iatrellis et al., 2021;Tomasevic et al., 2020) primarily focused on detecting at-risk students, and only a few explainable ML and rule-based models have been discussed. These studies did not examine the features that are most influential in predicting students' performance or identify what factors put a student at risk. In contrast, the proposed method has not only predicted the performance in the total grade with high accuracy, but also produced explainable ML outputs, providing insightful and useful information to non-experts about the features that affect the students' total grade, either the course checkpoints (e.g., Qz1, HW2) or student-based factors (e.g., high school GPA, high school grade, pre-requisite courses, age). Early predictions of at-risk students' performance are crucial. Providing relevant and appropriate remedial solutions to these students is another important problem. Several studies (Gupta et al., 2020;Koprinska et al., 2015;Tomasevic et al., 2020) have provided a list of remedial interventions for students considered to be at risk of poor performance, such as an intense academic program (Tomasevic et al., 2020), mental health support (Xing et al., 2015), less punishing educational settings (Gupta et al., 2020), and timely feedback (Borrella et al., 2022). However, the available research does not provide any clear suggestions on how these remedial solutions assist at-risk students. The present research has suggested proper remedial actions by mapping the students' performance in each checkpoint with the CLOs and topics taught in the course. For example, if the proposed model predicts that a student will not perform well in quiz 2, the student will be directly notified that he/she needs to take specific remedial actions. The list of possible actions is mapped to this checkpoint. This will help the student to perform better and increase the institution's effectiveness.
In future research, the authors aim to implement this framework as an automated solution for academic institutions and test it in a real settings. For example, if a student is at risk, an automatic notification will be sent to the student, and the instructor will be notified with a list of suggested remedial actions. Moreover, the authors hope to improve the prediction results by tuning the hyperparameters and designing more sophisticated features using deep learning models. Finally, the authors will seek to apply this model to other datasets (courses) to validate the model output, and will liaise with instructors to obtain further feedback and inputs.

Conclusions
Early predictions of students' academic performance can play a significant role in planning suitable interventions, such as student counselling, intelligent tutoring systems, continuous progress monitoring, and policymaking. In particular, such interventions can improve academic performance during the learning process and reduce the number of students who drop out or graduate late. As such, effective prediction models directly help educational institutions improve their reputations and rankings. Despite recent technological advances, educational institutions continue to face issues obtaining early and accurate predictions of students' performance due to the non-incorporation of performance modules in most online and offline learning platforms. Therefore, an accurate prediction model of student performance is an urgent requirement for educational institutions. Furthermore, assessing students' performance in the early stages of the learning process helps facilitate the implementation of suitable strategies to mitigate the factors leading to dropouts or low performance at both the student and instructor levels.