Skip to main content

Students’ complex trajectories: exploring degree change and time to degree

Abstract

The complex trajectories of higher education students are deviations from the regular path due to delays in completing a degree, dropping out, taking breaks, or changing programmes. In this study, we investigated degree changing as a cause of complex student trajectories. We characterised cohorts of students who graduated with a complex trajectory and identified the characteristics that influenced the time to graduation. To support this predictive task, we employed machine learning techniques such as neural networks, support vector machines, and random forests. In addition, we used interpretable techniques such as decision trees to derive managerial insights that could prove useful to decision-makers. We validated the proposed methodology taking the University of Porto (Portugal) as case study. The results show that the time to degree (TTD) of students with and without complex trajectories was different. Moreover, the proposed models effectively predicted TTD, outperforming two benchmark models. The random forest model proved to be the best predictor. Finally, this study shows that the factors that best predict TTD are the median TTD and the admission regime of the programme of destination of transfer students, followed by the admission average of the previous programme. By identifying students who take longer to complete their studies, targeted interventions such as counselling and tutoring can be promoted, potentially improving completion rates and educational outcomes without having to use as many resources.

Introduction

Higher education enrolment has increased in developing countries in recent decades (Barakat & Shields, 2019). In these countries, the democratisation of access to higher education has been promoted, eliminating the elite exclusivity of the university. This has happened because of the need to increase the number of highly qualified human resources to guarantee economic competitiveness and other factors related to the alteration of individual life. Indeed, investment in education has become increasingly important for young people as they realise that educational qualifications can improve their chances of finding better job opportunities (Wang, 2021). In parallel, a political movement to increase the number of young people entering higher education has been observed (Dias, 2015).

In several countries, namely Portugal, this pressure for expansion has led to increased system diversification, resulting in a binary system comprising university and polytechnic studies (Sousa, 2021), as well as in the emergence of the private sector (Teixeira et al., 2022). This process of massification has also led to the constitution of a large number of different programmes (Dias, 2015). In addition, the attempt to get more students into higher education has led to the provision of financial support for less advantaged students to cover tuition fees and other education expenses (Biffl & Isaac, 2002).

The increasing diversity of the student population and of higher education institutions and programmes has encouraged the diversification of trajectories (Haas & Hadjar, 2020). Indeed, the process of higher education is not always linear (Rosenberg et al., 2018), and students have been experiencing increasingly complex trajectories, including dropping out, stopping out for a time, transferring between programmes or institutions, enrolling part-time, and taking longer to conclude degree programmes (Goma, 2023; Rosenberg et al., 2018).

These complex processes result from various factors within and across multiple dimensions of the educational system. Each student’s psychological, social, and cognitive features may impact their progress. In addition, the background and the level of preparation of students also constitute important determinants of their trajectories. It is also widely recognised that students’ experiences, university environment, and professors may play a critical role in their trajectories (Bowman & Holmes, 2018; Tinto, 1994; Xerri et al., 2018; Xie et al., 2015).

A significant part of students’ complex trajectories is transferring to other programmes (Hovdhaugen, 2009). This is a particular issue in countries such as Portugal, where access to higher education is guided by a numerus clausus system established through a national public tender to prioritise the admission of students with higher access classification (Ferrão & Almeida, 2018). This leads to more than half of Portuguese students not accessing the course/establishment pair assumed as their first choice when applying for higher education (Casanova & Almeida, 2016). Thus, many students who end up being placed in the other options find ways to change course. Some resort to the course transfer system, which has vacancies for course changes every curricular year. Other students choose to re-apply the next academic year, hoping to gain access to their course of choice (Almeida et al., 2016). Similar scenarios happen in different countries, with students using the first year as a bridge to their preferred course (Okun et al., 2009).

It is important to note that changing programmes involves two decisions. The first decision is to leave the original programme. The second decision is to either leave higher education or change to an alternative degree programme within higher education (Tieben, 2020). The determinants of course transfer have been shown to differ from those of dropping out (Ferrão & Almeida, 2018). The conditions of the institution play a dominant role in the decision to quit a course (Berger & Braxton, 1998), while the opportunities and conditions outside the university are relevant considerations when deciding to drop out (Tieben, 2020).

In this paper, we focus our analysis on students with a complex trajectory, namely those who have transferred from the programme in which they enrolled at university to a different one. We aim to propose a method that enables programme managers and counselling and tutoring service providers to determine the time to degree (TTD) for each student. Studies show that transfer students are more likely to either not complete their degrees or take longer to complete their degrees when compared to students who have not experienced a complex trajectory (Townsend & Wilson, 2009). Transfer students are forced to navigate through complicated systems in order to take advantage of course credits and to enrol in courses, which can lead some of them to feel unwelcome or even marginalised by their new institution or degree programme (Chin-Newman & Shaw, 2013). Transfer students can also struggle with social integration as they try to find their place in a new context (Utter & DeAngelo, 2015). These barriers to social and academic integration can lead to transfer students taking longer to conclude their studies.

Having identified the students most likely to have a longer TTD, higher education institutions can design policies to prevent longer trajectories. For example, institutions may ask students who are predicted to take shorter routes to interact and share experiences with the other students, as a way to shorten the expected routes of the latter.

Through this paper, we seek to provide many valuable contributions to the literature. First, we aim to explore students’ trajectories, a topic still very much unexplored in the literature, as studies focusing on individual students and their specific trajectories are rare. Moreover, we seek to study complex trajectories, particularly those involving a transfer, a less prominent topic that has been mostly neglected. In addition, we aim to contribute to the literature by exploring the TTD of students after they transfer to another programme, which, to the best of our knowledge, has also not been addressed before. We seek to estimate the time that newly enrolled transfer students need to conclude the new programme by applying several machine learning techniques, namely random forests, bagged trees, and boosted trees. The application of ensemble methods, such as those previously mentioned, to the educational data mining field is still in the early stages, although their predictive performance is generally high. Lastly, we aim to provide decision-makers with information on the factors that impact the TTD of students by evaluating the importance of student-related variables to the prediction model.

This paper is structured as follows. The following section presents related studies in order to emphasise the contributions of the current study. "Methodology" section introduces the methodology and data used in the current study, the variables included in the proposed model, and the criteria used to evaluate the performance of the model. "Results" section presents the results, which are discussed in "Discussion" section. "Study limitations" section highlights the limitations of the study and "Conclusions and future work" presents the study’s conclusions and ideas for future research.

Literature review

Student trajectory in higher education refers to the “progression through higher education including all transitions (e.g. from undergraduate to graduate studies) and states (e.g. enrolment patterns such as part-time vs full-time enrolment) within a certain period (e.g. academic year or 3-year life period)” (Haas & Hadjar, 2020). Giani (2015) divided the trajectory of higher education students into seven stages: application, acceptance, enrolment, persistence/transfer, attainment, graduate school entry, and graduate school attainment.

The literature has incipiently addressed the study of student trajectories in higher education (Haas & Hadjar, 2020). A literature review on student trajectories between 1999 and 2018 identified only 27 articles (Haas & Hadjar, (2020). Most of these studies address the reality of USA institutions and use nationally representative large-scale data. The availability of data, particularly longitudinal student data, may be one reason for this lack of studies on student trajectories (Haas & Hadjar, 2020). This type of study deserves more attention, particularly because of the developments in higher education in the last decades. Facilitated access to higher education for underrepresented social groups (Hadjar & Becker, 2009) has encouraged the expansion, diversification (Schofer & Meyer, 2005), and heterogeneity of student populations, which should motivate studies of student trajectories.

The literature on student trajectories has adopted three distinct research designs (Haas & Hadjar, 2020). The first set of studies has focused on describing trajectory types and patterns (Robinson, 2004). The second and third sets of studies have focused on answering specific questions concerning students’ attributes and determining who follows which trajectories and why (Giani, 2015; Goldrick-Rab, 2006).

The study of complex trajectories is a niche within the topic of student trajectories. Complex trajectories include dropping out, stopping out for some time, transferring between programmes or institutions, enrolling part-time, and taking longer to conclude a degree programme (Goma, 2023; Rosenberg et al., 2018). Despite the relevance of longitudinal studies on complex trajectories, the literature has only focused on very specific issues such as dropping out, neglecting the fact that students may transfer between study programmes or institutions, interrupt their studies, or slow down the pace of study (Haas & Hadjar, 2020). Indeed, dropping out has been the most explored dimension (e.g. Berzenski, 2021; Tieben, 2020). Most of the studies on dropping out have focused on predicting whether a student is prone to quit their studies. In contrast, few studies have sought to explore the transfer of students between programmes (e.g. Rodríguez-Gómez et al., 2016). However, empirical studies have shown that the determinants of programme transfer differ from those of dropping out (Ferrão & Almeida, 2019). For example, Terenzini et al. (1981) and Yi (2008) showed the distinct impact of student characteristics on events such as transferring and dropping out. In this context, we may conclude that there is a need for research into the complex trajectories of higher education students, such as those who transfer between programmes.

In parallel, the literature on TTD, i.e. the number of years it takes for a student to complete a higher education degree, is also scarce, particularly in terms of its estimation (Bhaskaran et al., 2017). TTD is a relevant metric because it can show the efficiency of an educational system (Rayner & Papakonstantinou, 2022). The longer it takes for a student to graduate, the more resources are used by higher education institutions and by the student to achieve their final goal (Iatrellis et al., 2020). In the USA, about 41% of higher education students fail to graduate within six years (Basavaraj & Garibay, 2019). In Europe, only 23% to 30% of higher education students graduate within the expected time (Boegeholz et al., 2022). In this context, early estimation of each student’s TTD may be paramount. With such an estimate, institutions may design customised actions that prevent long trajectories. This is particularly relevant for students who have already experienced a programme transfer, as they already took longer to reach their preferred course.

For example, Hailikari et al. (2019) used interviews to categorise first-year students into six profiles and concluded that there were significant differences in graduation times among these profiles. They also found large differences in the completion rates of master’s degrees between professional and non-professional fields, with students from the humanities tending to prolong their studies due to a fear of unemployment. Rayner and Papakonstantinou (2022) also explored the TTD of students. In particular, this study sought to identify the predictors of TTD for undergraduates and concluded that the most relevant are the gender, the admission rank, the number of discipline majors, and the level of academic achievement.

Concerning the methodology adopted by the studies exploring educational data, recently machine learning methods have been gaining momentum (Karalar et al., 2021). Aldowah et al. (2019) reviewed 402 articles and identified the machine learning techniques used in educational data mining and learning analytics. According to this literature review, most of the studies adopt classification techniques (26.25%) and clustering (21.25%) (Aldowah et al., 2019). Romero and Ventura (2020) listed the following machine learning approaches, among others: causal mining to relate student behaviour to learning, academic failure, or dropping out; clustering to group materials or students; prediction of student performance and student behaviour; and social network analysis to interpret the structure and relationships in collaborative activities. Sghir et al. (2022) highlighted that predicting student performance dominates the field, followed by identifying at-risk students.

Regarding the research on predictive analytics in higher education, Sghir et al. (2022) reviewed several papers published in the last decade (2012–2022) in order to identify the algorithms and goodness-of-fit measures most commonly used. They concluded that artificial neural networks attained the best performance in classification problems, followed by random forests (Boehmke & Greenwell, 2019) and gradient boosting (Friedman, 2001, 2002; Mason et al., 1999). Decision trees, naive Bayes, ensemble methods, and k-nearest neighbours were also identified as popular algorithms. Concerning regression problems, single and multiple linear regression algorithms have been used in prediction tasks. For clustering problems, (Sghir et al., 2022) identified seven articles using the k-means algorithm (MacQueen, 1967). In terms of performance measures, the ones most commonly used for classification tasks were, in descending order, frequency, accuracy, F-measure, recall, precision, area under the ROC curve (AUC), kappa, sensitivity, specificity, and Mathew Correlation Coefficient (MCC). For regression problems, the measures used include Pearson’s R, the root mean square error (RMSE), the predictive mean square error (pMSE), and the predictive mean absolute percentage correction (pMAPC).

Concerning the variables used in predictive learning analytics, several are commonly associated with predicting graduation, dropping out, and academic performance. Sghir et al. (2022) classified predictor variables into five classes, as follows. Prior academic data includes the student’s records in secondary education (Berzenski, 2019; Tieben, 2019) or admission information such as admission exam grades (Carreira & Lopes, 2019) and scientific area (Carreira & Lopes, 2019; Rodríguez-Gómez et al., 2016). Demographic characteristics include personal data such as gender (Hashim et al., 2020; Martins et al., 2019; Sánchez-Gelabert et al., 2020), ethnicity (Berzenski, 2019; Monaghan, 2019), and age (Hashim et al., 2020; Monaghan, 2019; Tumen et al., 2008). They also include the socio-economic context of the student (e.g. the country of origin/residence) (Carreira & Lopes, 2019; Rodríguez-Gómez et al., 2016) and the educational and occupational level of the student’s family (Sánchez-Gelabert et al., 2020; Hashim et al., 2020). Academic data pertains to the student’s performance at the higher education institution. Other commonly employed predictors (Brezavšcek et al., 2017) include the area of study (Hashim et al., 2020), the number of completed credit points (Berzenski, 2019; Martins et al., 2018), the grade point average (GPA) Berzenski (2019); Hashim et al. (2020), and the time spent at each programme stage. Behavioural features are mostly employed with data retrieved from learning management systems Aldowah et al. (2019); Prenkaj et al. (2021); Sghir et al. (2022), as they are easily accessible (Romero & Ventura, 2020). The motivation of the student is a relevant factor for success (Pardo et al., 2017; Wong & Chiu, 2019) and is a good example for the last category, which is psychological data.

In the context of higher education trajectories, (Haas & Hadjar, 2020) defined three levels of trajectory predictors. The macro level includes factors deriving from the regional and national structures of the higher education system (e.g. fees and financial aid). The meso level considers factors related to the organisational structures of higher education institutions (e.g. offering mentorship and the size of the programme). The micro-level includes factors that depend on the student’s context, such as socio-demographics, expectations, and academic preparation.

Methodology

Research questions

In this study, we aim to examine the complex trajectories of higher education students, namely those that involve a change of programme. For these trajectories, we intend to estimate the TTD after the change to a different programme. We propose to use machine learning models to obtain accurate estimates of the TTD of students and help identify the factors that are most relevant to distinguish students presenting different progressions.

The proposed solution should benefit higher education institutions and students by allowing an early and accurate prediction of TTD and the subsequent implementation of remedial plans to reverse scenarios where a long TTD is expected. The identification of students with a lower TTD should also contribute to this inversion, for example, by encouraging more interaction between these students and those with a potentially higher TTD. Sharing experiences and practices may help to mitigate the difficulties that some students may encounter. Overall, this may enable institutions to maintain their reputation for academic excellence.

More specifically, we formulated the following research questions:

  • RQ1: Is the TTD of students who transfer to another programme different from that of students who have a non-complex trajectory?

  • RQ2: Can a machine-learning model that integrates variables characterising students’ previous trajectories infer the TTD after a transfer?

  • RQ3: What are the most important factors when predicting the TTD of students with complex trajectories?

Research design

The methodology proposed in this study is illustrated in Fig. 1.

Fig. 1
figure 1

Overview of the methodology

This study focuses on the students who completed at least a degree programme during the analysis period. Moreover, it only considers the academic trajectory until the first degree completion. Having set the sample of analysis, the first step of the proposed methodology is to distinguish the students who transferred from one programme to another from those who did not transfer.

Secondly, the TTD is computed for each student, i.e. the difference between the time of conclusion and the time of enrolment in the final programme. It should be noted that in the present study, the TTD is expressed in years, since only yearly data was available. It is also worth pointing out that, as noted in the literature (see "Literature review" section), barriers to social and academic integration may mean that transfer students take longer to complete their studies. On the other hand, transfer students may be able to use credits from previous programmes, which may also affect the length of their studies. This computation is done for the groups of students who transferred from one programme to another and for those who did not transfer. Having collected this data, we propose using a Z-test to answer the first research question and thus establish the statistical relevance of the difference in the TTD of the two groups.

Next, we propose characterising students after they have transferred to a new programme. Following the categorisation proposed by Sghir et al. (2022), we suggest including variables related to the prior academic background and demographics. In addition, we propose to describe students based on their academic-related variables (Sghir et al., 2022). We also propose to characterise both the original and the new programme, i.e. institution-related variables (Sghir et al., 2022). Table 4 lists all the variables used to define a student after a change. Figure 2 and the third column of Table 4 help to better understand the time frame corresponding to each variable. Some variables concern the period before enrolling in the university (\(h_s\)), namely the type of high school and the grades on the national admission exams. Other variables refer to the period between the admission to the university and the moment of the programme change (\(Y_p\)), e.g. the number of enrolments, the number of programmes enrolled in that period, or the percentage of time the student was working or displaced. Some other variables refer to the most recent academic year before the change (\(Y_r\)), e.g. the programme of origin or the cumulative number of credits completed before the change. Finally, we also accommodate variables collected at the moment of the transfer (\(i_c\)), e.g. final programme faculty, final programme duration, and whether the student has a scholarship or was working at the moment of the change.

Fig. 2
figure 2

Overview of the methodology

It is worth noting that, because machine learning algorithms do not generally accept data with missing values, we propose to exclude from the study students for which the data is not complete. Thus, students presenting one or more missing values were not considered for the training of the predictive models. Moreover, categorical features were one-hot encoded, while ordinal features were ordinally encoded.

In the last stage, we propose using a data mining prediction model that uses the variables introduced in Table 4 as independent variables and the TTD as the dependent variable. Thus, the second research question is answered. Although the TTD is discrete, we propose to treat this problem as a regression problem. Considering a predictive model based on classification algorithms would limit the scope of applicability of the model, as the model would only be able to predict the target values observed in the training dataset. Using a regression algorithm to train the model overcomes this limitation. In addition, students’ average grades exhibit a degree of continuity, although they were rounded to give a final whole number grade.

Following several studies on education (Casuat & Festijo, 2019; Cortez & Silva, 2008), we propose using the decision tree (DT), random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP) regression algorithms. These are among the most popular regression algorithms and can be applied to relatively small datasets.

DTs are easy to apply, effective, and fast to train. The hierarchical tree structure resembles a human decision-making process, making DTs easy to understand (Czajkowski & Kretowski, 2016). They are also a white box algorithm, meaning they are explainable. The RF algorithm is an ensemble algorithm where several DTs are generated from a random vector sampled independently (Breiman, 2001) and combined by averaging the results to produce one single predicted value. RFs have good predictive performance and have some level of interpretability. They are a good choice when the number of predictive attributes is large, as is the case of the present study, though at a high computational cost (Moreira et al., 2018). SVMs aim to find a hyperplane that best fits the data while minimising the margin of error, making them a powerful tool for non-linear regression. One of the strengths of SVMs is the ability to solve small sample, non-linear, and high dimensional pattern recognition problems while being memory efficient (Wang et al., 2016). An MLP, or artificial neural network, attempts to reproduce the functioning of the human brain. Each node in an MLP is equivalent to a neuron in the human brain. MLPs perform well in many real-life problems, even non-linear problems, and are robust to noise. MLPs are hard to interpret due to the lack of mathematical foundation and hidden layers and their training usually comes at a high computational cost (Moreira et al., 2018).

We propose tuning the hyperparameters using an exhaustive grid search with stratified k-fold cross-validation, where k = 5. Thus, the gathered data set is split into train and test datasets, the latter representing 20% of the original one. The training dataset is used to optimise the hyperparameters of each model, while the test dataset is used to assess the performance of the models.

Choosing adequate goodness-of-fit metrics is crucial for the performance evaluation of the models. We recommend the use of the RMSE, the mean absolute error (MAE), the coefficient of determination (\(R^2\)), and the mean absolute percentage error (MAPE). We suggest computing the confidence intervals at a confidence level of 95% by bootstrapping the test set predictions for the regression metrics. Regression models should be tuned to guarantee the lowest RMSE.

Finally, we propose following the permutation feature importance approach to identify the importance of the features, answering the third research question. This model inspection technique is recommended for opaque models such as RFs (Breiman, 2001). It measures the impact of each feature on the model’s performance by randomly permuting the values of a single feature and observing the resulting change in the model’s predictive goodness-of-fit metric. The process involves different steps, as follows. First, the model is trained on a dataset with all features intact, and its performance metric (in this case, the RMSE) is recorded as the baseline. Then, the values of one feature are shuffled randomly and the dataset is passed through the trained model again to obtain the new value of the performance metric. The difference between the baseline metric and the permuted metric quantifies the importance of that feature. The more significant the drop in performance (in this case, the greater the decrease in the RMSE) after permuting a feature, the more influential that feature is considered to be. By evaluating the permutation feature importance for all features, it is possible to identify which variables have the highest impact on the model’s performance and gain insights into their relative importance.

It should be noted that in order to prevent overfitting, in addition to cross-validation, we chose to select simpler models with fewer parameters, like linear regression or simple DTs. We also used an ensemble algorithm (RF), which is less prone to overfitting. We minimised the risk of overfitting in the case of the neural networks through L1 regularisation and in the case of SVMs through C regularisation.

Case study and data description

This study uses data from the University of Porto (U.Porto), a Portuguese public research university. This university has approximately 34,000 students, 3400 academic staff and researchers, and offers undergraduate (bachelor) and graduate (master and doctorate) programmes in several fields, such as engineering, humanities, law, and medicine.

Students’ data was collected from the information system of the University of Porto after passing through a rigorous procedure related to data protection and ethical issues. This data covers the demographics and the prior academic, institutional, and current academic data of each student. Students’ data was anonymised to ensure individual students could not be identified.

The data gathered corresponds to yearly student information for those enrolled for the first time at the University of Porto between 2005 and 2015 in an entry-level degree, i.e. a bachelor’s (B) or an integrated master’s (IM) degree. The data available encompasses all the academic information of 52,822 students, obtained until 2020. The 54 courses offered by the 14 faculties of the University of Porto are represented in the dataset, of which 36 are bachelor’s degrees and 18 are integrated master’s degrees.

Figure 3 presents the ten most frequent trajectories in the period under review, which cover about 93% of all trajectories. “Not Enrolled” refers to the period in which a student who had previously enrolled at the university was not enrolled in any programme. “Final Programme” refers to the period in which the student was enrolled in the programme where they graduated. The period in which the student was enrolled in a programme other than the final programme is labelled “Other Programme”. Finally, the period labelled as “Graduated” refers to the year of graduation and subsequent years. It is worth noting that the figure’s time axis goes up to 16 years, as this corresponds to the longest academic trajectory available in the dataset. Among the most frequent trajectories, about two thirds lasted five, three, or six years, corresponding to undergraduate (three years) and integrated master’s (five and six years) students. In addition, there are also frequent trajectories that lasted four years, which corresponds to a bachelor’s degree taking one additional year to complete. The remaining popular paths show that about 20% of the students needed extra time to graduate. The complex trajectories, i.e. those with a different programme in the first year, were the least frequent in this top ten.

Fig. 3
figure 3

The 20 most frequent trajectories of students who finished a degree

Looking at the annual enrolments of new students over the analysed period, shown in Fig. 4, it is possible to see that this number increased significantly between 2005 and 2007 and has gradually decreased since then. This can be partly explained by the economic crisis that Portugal experienced between 2010 and 2014. Nevertheless, the number of students enrolled at the University of Porto has always been higher than 3000.

Fig. 4
figure 4

Number of students enrolled at the University of Porto for the first time and number of students who changed programme

If we focus on students who changed programmes and successfully graduated, Fig. 4 shows that they are a minority. In fact, each year, the maximum percentage of enrolments reflecting a change of programme in relation to the total number of re-enrolments is around 2.43%. Nevertheless, if we focus on all the students who transferred at least once and graduated, this corresponds to 2743 students, or 7.2% of the total number of students who were admitted to the University of Porto between 2005 and 2015 and graduated by 2020. It should be noted that the significant decline in the percentage of student transfers after 2015 is due to the fact that the data collected does not include new students from 2015 onwards, thus reducing the number of potential transfers. In addition, our study focuses only on students who already graduated, so it is possible that more students transferred in the last years of the period analysed but are not reflected in this graph.

With regard to transfers, it is important to look briefly at the programmes of origin and destination. Table 5 in the appendix lists the programmes from which students have transferred to other programmes, listed in descending order of frequency. The top programmes in terms of transfers are engineering and those related to health. The table also shows the three most frequent destination programmes and the respective frequency of students who made this transition. There seems to be a repeating pattern of transfers between programmes, since the most frequent destination courses represent the vast majority of transfers. An extreme example of this is the case of the integrated master’s degree in dental medicine, where 95% of the students who transfer are destined for medicine.

Table 1 illustrates an example of a complex trajectory of a student. More specifically, it shows the academic path of a student who enrolled in two programmes at the University of Porto, successfully completing the second. Over the course of six years, the student enrolled in two programmes, completing credits in both. For this student, the initial programme was a bachelor’s degree in Communication Sciences: Journalism, Public Relations, Multimedia and the final programme was a bachelor’s degree in Applied Languages, which they completed in three years. The moment \(i_0\) corresponds to the beginning of year one, while the moment \(i_c\) corresponds to the beginning of year six. Thus, \(Y_P\) corresponds to five years. Variables preceding admission to the university are identified in Table 4 as \(h_s\). Variables in the period \(Y_p\) include data collected during the time at the university before the transfer. Variables marked with \(Y_r\) were collected at the beginning of the year preceding the change, i.e. year five in this particular case.

Table 1 COMPLEX trajectory example

Results

After identifying the students who transferred between programmes at least once in the analysed period, i.e. the complex trajectories, we computed the TTD of the students who transferred and of those who did not.

Fig. 5
figure 5

TTD of students who transferred between programmes per degree type

Fig. 6
figure 6

TTD of students who did not transfer between programmes per degree type

Figures 5 and 6 show the TTD of students who transferred between programmes and those who did not, distributed by degree type. It should be noted that, by definition, an integrated master’s degree takes longer than a bachelor’s degree, since an integrated master’s degree combines a bachelor’s and a master’s degree. Concerning the bachelor’s degrees, the TTDs have a similar distribution across both figures. However, there is a more significant asymmetry to the right in the case of the TTD distribution of the students who changed programmes. In the case of the integrated master’s degrees, the distribution for the two populations is more distinct. In this case, it is more evident that the TTD is generally higher in the trajectories with a transfer. The results of the Z-test with a one-sided alternative (see Table 2) show that the null hypothesis, i.e. there is no significant difference between the means of two populations, is rejected for both types of degree, i.e. integrated master’s and bachelor’s. We can therefore answer the first research question and state that the TTD is different for students who experience a complex trajectory and those who do not. This fact corroborates the literature and emphasises the need to develop specific models to predict TTD for students who have undergone a course change. While in the case of integrated master’s degrees the time taken by a transfer student to complete the programme is longer than that of a non-transfer student, the opposite seems to be true for undergraduate degrees.

Table 2 Z-test results

To characterise the students who transferred between programmes, we based our analysis on the variables introduced in "Methodology" section. Some descriptive statistics are presented in Table 4. Most of these students are female, do not work, live in their regular house (not displaced), and do not have a scholarship. It is interesting to note that the average age of the students at the time of transfer is 19.47 years, which means that most of them transfer between programmes in their second year of studies, as they are usually 18 years old when admitted to the university. After cleaning the data and excluding observations with missing values, 2047 students remained in the dataset.

Figure 7 shows the TTD values for the test dataset using the programme duration as the benchmark, i.e. the number of years foreseen in the study plan. A second benchmark prediction was made using the programmes’ median TTD, computed with the data from the training dataset. The plot for the latter is similar to Fig. 7, as the difference between the two values is small for most programmes. The graph shows that even though the actual values are distributed along a range of TTDs from one to ten years, the values predicted by the benchmark cover a smaller range, between two and a half and six years, with the smaller values corresponding to the bachelor’s degrees and the larger values corresponding to the integrated master’s degrees, namely the one in medicine. Table 3 shows the performance metrics of both benchmarks, which do not differ much, with a slightly better performance in the programme duration benchmark. The results show that the coefficient of determination is close to zero, demonstrating a non-existent correlation between the predicted values and the actual ones. Nevertheless, the MAE is relatively small, with a maximum value of 0.666 years, which means that the mean deviation between the predicted values and the actual ones is less than an academic year. These findings underline the need for advanced models to predict the TTD of transfer students.

Fig. 7
figure 7

Benchmark predictions (programme duration)

Figures 8 and 9 show the predicted values of TTD for the DT and RF algorithms considered in this study. The predictions for the other models are presented in Appendix D. The models were obtained with algorithms implementation provided by the scikit-learn library for Python. The performance metrics of the four models are presented in Table 3, which includes the mean and the 95% confidence interval. Regarding the performance metrics, the first conclusion is that all models perform better than the best benchmark prediction. This is evident in all metrics but is most apparent in the coefficient of determination, with results above or close to 0.6. This means that the proposed models have a high potential to support higher education decision-makers, such as programme directors. Regarding the second research question, the trained machine learning models were able to use the variables characterising the students’ previous trajectories to predict the TTDs with a smaller error than the used benchmarks. This could be anticipated, as machine learning models are able to recognize complex patterns and identify interactions between features. Moreover, the use of a set of explanatory variables enables machine learning models to generalise better to unseen data.

The RF model performed the best according to all metrics. It had the lowest deviation between the predicted values and the actual ones while having the best correlation. The DT model had the worst performance metrics, even though the difference when compared to the other models was small. By analysing the predictions of each model, it is possible to identify some trends. The estimates from the DT model are translated into a cloud of points with a reduced range of values. The model could not predict TTD values higher than seven, even though they represent over three per cent of the test dataset. It also struggled to estimate values in the lower range, predicting values below the actual ones. In general, the DT model underpredicted the TTD of students with complex trajectories, since most of the points in the cloud are located above the diagonal line. The histogram of predicted values shows two large columns representing TTDs of five and six years, indicating a better performance for integrated master’s programmes.

Figure 9 shows the TTD values predicted by the RF model against the actual values, which are translated into a set of points that are close to the diagonal line. The predicted values cover a more extensive range, though the model did not catch the TTD of ten years. In the lower range, the cloud of points in Fig. 9 is also in line with the expected values. The points are scattered around the diagonal line in a smaller range, with a balanced number of points above and below. The histogram of predicted values shows three peaks, corresponding approximately to the values of three, five, and six years. These coincide with the three highest peaks of actual values.

The scatter plots of the SVM and the MLP are very similar. The scattering of points around the diagonal lines is wider in both these models when compared against the RF model. They share the same difficulties as the other models in trying to predict the TTD for large values. In terms of the histogram of the predicted values, they show a more continuous distribution of values. Although the most relevant peaks are still present, there is a quasi-continuous distribution of values in the histogram. The SVM model tends to underpredict the TTDs, especially for the lower values, while the MLP model tends to overpredict the TTDs.

Table 3 Performance metrics for benchmarks (BM) and models (mean and 95% confidence interval)
Fig. 8
figure 8

Decision tree regression predictions

Fig. 9
figure 9

Random forest regression predictions

Discussion

In line with several studies on education data mining that focus on RF models (Martins et al., 2019), the results obtained in this study highlight the potential of this machine learning technique. However, although RF models tend to outperform other machine learning algorithms, their results are difficult to interpret due to the stochastic nature of the decision path. The RF model combines the predictions of randomly generated tree predictors and uses the ensemble’s variability to produce a more robust prediction (Breiman, 2001). Since RF models combine multiple DTs, there is no single decision path, making the model less interpretable. On the other hand, DTs, as the one shown in Fig. 11, are explainable models where the decision processes are easy to track (Quinlan, 1993). In the very first level, the median TTD of the programme is used for splitting the observations, revealing that this is the most promising feature to discriminate TTDs. This may mean that students tend to follow a similar pattern to other transfer students, although this past trend was not enough to estimate TTDs. Indeed, students usually transferred from the same programmes. In the second level, the decision was based on the final programme admission regime and the number of years since the first enrolment, revealing that the way the students were admitted and the time taken to transfer between programmes (perhaps due to potential credit transfers) impacted their TTDs. The further down we go in the branches, the more features are used for the decision process.

Figure 10 shows the increase in the RMSE for the top ten features, i.e. the importance of the variables, computed for the RF model. In line with the results provided by the DT, the graph shows that the median TTD for the final programme is the feature that affects the RMSE the most. This result helps explain why the benchmarks used provided reasonable predictions even though they did not take into account any other information about the student. As mentioned before, this may occur because most students complete programmes following the pattern of their colleagues from other years. This is clearly shown in the distribution of TTDs in Figs. 5 and 6, where the peak durations for bachelor’s (three years) and integrated master’s (five or six years) degrees are visible. The second most relevant feature refers to the period in years from the first enrolment in the university to the year of the programme change. This variable can be used by programme directors to identify students who will have more difficulties in graduating within the expected time frame. The final programme admission regime, i.e. re-enrolment (R), is also relevant in the model. This may show that students who re-enrol after a break in their studies may be following a different academic path than those who transfer to another programme without a break in their studies. The variable representing the number of credits completed before the programme change together with the credits enrolled at the beginning of the year of change also plays an important role in the model. This may be connected with the possibility of getting credits from the ECTS credits already completed. The admission average of the previous programme closes the top five most influential features. This feature belongs to the class of academic data and is often referred to in the literature as a strong predictor of academic success (Miguéis et al., 2018).

This model inspection technique shows that from the original 30 variables, a few dominate the model’s performance, answering the last research question proposed. Programme directors may also look at the patterns of previous transfer students regarding subject choices and the sequence of these choices in order to identify opportunities to reduce TTDs or to guide new transfer students.

Fig. 10
figure 10

Feature importance using permutation on the full model

Overall, higher education institutions can benefit from accurately estimating TTDs, particularly those of students who have changed programmes. Identifying students at risk of taking longer to complete their degrees enables the promotion of early intervention and support measures to help them stay on track. This may include support services such as tutoring, mentoring, and career counseling to help students achieve their academic goals more effectively (Brock, 2010). In this way, the estimation of TTDs can enable resources to be allocated more efficiently by better understanding when students are likely to need additional academic support. In addition, TTD predictions can enable institutions to offer students alternative graduation trajectories that may be more appropriate for each student, thus encouraging shorter academic trajectories (Sidebotham et al., 2015).

Study limitations

The quality of the present research was limited by several factors, namely the quantity and quality of the available data. The number of transfer students was not large and their characterisation was rather limited, with missing values for some variables. Another limitation of the present study relates to the granularity of the data and the moment in which the data was collected. The available data refers to the beginning of the academic years, although data collected at the beginning of each semester would have been more appropriate. For example, the number of credits enrolled and approved differs for each semester of academic study and it may have been beneficial to consider this difference in the model.

Moreover, for each academic year, it was only possible to obtain the number of credits enrolled, the cumulative number of credits approved in the programme, and the students’ GPA for a given programme. The number of credits completed in a given year was not directly available. This number had to be computed based on the difference in the cumulative credits that students had in two consecutive years. However, when a student changes programme, it is impossible to compute the number of credits completed in the year before the change. Thus, regarding the credits completed, we assumed the estimate of the total number of credits completed before \(Y_r\) plus the number of credits enrolled in \(Y_r\). In addition, a student’s GPA is only known at the beginning of an academic year. Since most students change programme after the first year, the GPA of the previous programme at the moment of transfer is not known in most cases. For this reason, this variable, which the literature highlights as a significant predictor of student success (Berzenski, 2019; Iatrellis et al., 2020), was not included in the proposed models.

Most machine learning algorithms cannot handle missing values. For this reason, only 2047 of the original 2743 samples of students with complex trajectories were used to train the models. Not only does this result in a smaller dataset, potentially reducing the performance of the models, but it also reduces the diversity captured by the variables considered in the models. For example, in the time horizon considered in this study, there were transfer students who were not Portuguese. However, by not considering students with missing values, these students were excluded from the analysis and, consequently, the variable capturing the nationality was also excluded, due to the lack of different values. This can affect the quality of the models and prevent them from giving good results for situations not covered in the dataset used in the study. It should be noted that we chose not to impute missing values because of the possibility of introducing bias. In addition, imputing missing data using the wrong approach may lead to incorrect model predictions.

Some algorithms can handle nominal variables directly, as is the case of DTs. However, the scikit-learn (Buitinck et al., 2013; Pedregosa et al., 2011) algorithmic implementation adopted in this study only accepts numerical features. For this reason, all nominal variables had to be one-hot encoded. Some variables had a large cardinality, such as the programme name (53 different values). This means that the transformed dataset used to train the models was sparse. Not only does this increase the computational time of the training phase, but it may also reduce the performance of the models.

In addition, some variables usually explored in studies related to dropping out and student academic performance were either missing or unknown in this particular case study. This was the case of parents’ education and place of residence, which, according to Aparicio-Chueca et al. (2019) and other researchers, play an essential role in predictive models.

Finally, the present study was limited in geographical scope, as it only considered student transfers within the University of Porto. Although the use of unseen data and the cross-validation approach used to test the performance of the models guaranteed that the models were able to generalise in the present context, we cannot assume that the models have transferability. Indeed, in order to assess the ability of the models to generalise to other contexts, we would need to validate them on data from other contexts.

Conclusions and future work

In this study, the TTD of higher education students with complex trajectories, i.e. including a programme transfer, was characterised and predicted using machine learning models. For this purpose, we used a dataset composed of students from the University of Porto whose first enrolment occurred between 2005 and 2015 and who were tracked until 2020. The dataset included 2047 students with complex trajectories, i.e. those who changed programme during the analysis period, for which it was possible to characterise the TTD.

Our analysis demonstrated that the TTD of students with complex trajectories is statistically different from those without a complex trajectory. While students with complex trajectories graduate faster at the bachelor’s degree level, the opposite is true at the integrated master’s degree level. This reinforces the need to provide decision-makers, namely programme directors, with a tool that allows them to anticipate how long a student who has just enrolled in a programme will take to complete it after a transfer.

Four machine learning algorithms were used to predict the TTD of students with complex trajectories. The results revealed that predictive modelling is effective in the academic domain, particularly in predicting TTD in complex scenarios, and that decision-makers can use such models to plan institutional actions and optimise their limited resource allocation. By accurately predicting when students are likely to complete their degree programmes, institutions can take proactive measures to enhance students’ academic experience and improve overall educational outcomes. Once students at risk of taking longer to complete their studies have been identified, advisers can use the model’s insights to provide personalised advice and support. This could include creating tailored academic plans, suggesting appropriate courses, or referring students to support services. Institutions can also provide additional tutoring and mentoring for students who are likely to take longer to complete their studies.

The RF model had the best performance out of the four models, while the DT model performed the worst. Yet, all four models performed better regarding the goodness-of-fit metrics than the two benchmark models. While the RF model showed a better prediction capacity than the other models, similar to that of neural networks, it is an opaque or black box model. These models are difficult to understand because the predictions are based on a decision process that is not understandable by humans. Thus, we conducted a feature importance determination using permutation on the model to help identify the variables that affect the model’s performance the most. We concluded that the most relevant factors to predict the TTD of students with complex trajectories were the median TTD of the final programme, the number of years since the first enrolment, and the admission regime of the previous programme.

Although the DT model performed the worst, it is an explainable model where it is possible to interpret the decision process. The DT model showed in the first branches the same variables highlighted in the variables’ importance analysis: the median TTD, the final programme admission regime, and the number of years since the first enrolment. Thus, it is possible to state that these are the most relevant factors for predicting TTDs.

The choice between model transparency and predictive accuracy is a critical consideration when developing machine learning models such as RFs and DTs. These two aspects often represent a trade-off, and the decision can have a significant impact on a model’s acceptance and usefulness in real-world scenarios. If a model’s predictions are to be used in contexts where explicability is crucial (e.g. healthcare, finance, or law), an interpretable model may be preferable, even if it sacrifices some prediction accuracy. Conversely, if prediction accuracy is paramount, models such as RFs may be the better choice. The decision should be based on the specific requirements and constraints of the scenario in which the model will be used.

From an educational point of view, we believe that the proposed model is relevant, as it helps to predict the TTD of transfer students and can open possibilities such as more timely interventions from decision-makers that may lead to TTD reductions and improvements in the quality of students’ academic experiences.

Regarding future work, the models could be further improved by enriching the dataset with more data. In particular, some observations were not included in the dataset due to missing values in some features. This primarily affected variables related to the period preceding the programme change, like the admission average or the application preference for the previous programme. This exclusion of observations affected other features, like nationality, which initially included several countries and, in the final dataset, resulted in a cohort of students of Portuguese nationality. This may limit the potential of the predictive models proposed.

We also believe that it would be relevant for future works to develop a qualitative study, targeting transfer students, to assess what motivates the delays in their trajectory after changing programme. Identifying these factors may lead to the development of richer models that accommodate other relevant predictive variables, in addition to those covered in this study.

Availability of data and materials

The data that supports the findings of this study is available from the University of Porto’s Rectory but restrictions apply to the availability of this data, which was used under license for the current study, and so is not publicly available. The data can however be made available by the authors upon reasonable request and with the permission of the University of Porto’s Rectory.

Abbreviations

ECTS:

European Credit Transfer and Accumulation System

DT:

Decision tree

RF:

Random forest

MAE:

Mean absolute error

MAPE:

Mean absolute percentage error

MLP:

Multilayer perceptron

RMSE:

Root mean square error

SVM:

Support vector machine

TTD:

Time to degree

References

Download references

Acknowledgements

Not applicable.

Funding

This work was funded by the European Union through the ERASMUS+ project with reference 2020-1-ES01-KA203-082842 and co-supported through strategic funding from FCT UIDB044232020 and UIDP044232020.

Author information

Authors and Affiliations

Authors

Contributions

JPP contributed to the pre-processing of the data, the training and optimization of the machine learning models, produced the graphics and analysed the results of the model. VM contributed to the state-of-the-art, introduction, and methodology and analyzed the students’ complex trajetories. AS contributed to the conceptualization of the study, state-of-the-art, introduction, methodology and final review of the article.

Corresponding author

Correspondence to Vera Lucia Miguéis.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Decision tree plot (Fig. 11).

Fig. 11
figure 11

Decision tree model

Appendix B

See Table 4

Table 4 Features used in the study

Appendix C

See Table 5

Table 5 Number of transfers and top-3 programmes of destination

Appendix D

SVM and MLP regression plots (Figs. 12, 13).

Fig. 12
figure 12

Support vector machine regression predictions

Fig. 13
figure 13

Multilayer preceptron regression predictions

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pêgo, J.P., Miguéis, V.L. & Soeiro, A. Students’ complex trajectories: exploring degree change and time to degree. Int J Educ Technol High Educ 21, 8 (2024). https://doi.org/10.1186/s41239-024-00438-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s41239-024-00438-5

Keywords