The role and measurement of self-regulated learning
Self-regulation is an overarching construct that captures how students direct and monitors their own learning processes and progress (Pintrich and De Groot, 1990; Pintrich, Smith, Garcia, and McKeachie, 1993). Specifically, SRL is defined as a process where students actively set goals and make plans for their learning, monitor their learning process, and adjust their study plans (Pintrich, 2004). Students with high self-regulatory skills can appropriately apply effective learning strategies to increase effectiveness based on their personal needs and the characteristics of the tasks and the environment (Pintrich, 2004). Due to the flexibility of the course schedule and the limited social interaction, online courses require students to take more responsibility to regulate their own learning. In contrast, in face-to-face classrooms, instructors and peers can monitor and guide student behavior.
Multiple approaches, such as self-report questionnaires, observation, and think-aloud protocols have been used to measure SRL, with self-report questionnaires being the most widely used (Schellings and Van Hout-Wolters, 2011; Winne, 2010). The Motivated Strategies for Learning Questionnaire (MSLQ) developed by Pintrich et al. (1993) is the most commonly adopted instrument for measuring SRL in both face-to-face and online courses (Broadbent and Poon, 2015; Duncan and McKeachie, 2005). MSLQ captures three sets of SRL skills: (1) the use of cognitive strategies, (2) the use of metacognitive strategies, and (3) the management of personal and environmental academic resources including time management, choice and control of the study environment, effort regulation, and help-seeking (Pintrich and De Groot, 1990; Pintrich et al. 1993). MSLQ instructs students to predict/recall the likelihood or frequency of conducting certain SRL behaviors in the future/past. For instance, before a course starts, student time management skills would be measured by several Likert scale statements capturing the extent to which students predict that they can make good use of their study time, spend enough time studying, keep up with the coursework, attend class regularly, and find time to review before an exam in the upcoming course. Extensive research has been conducted to explore the relationship between self-reported SRL skills and online performance. There is consistent evidence that student online performance is associated with self-reported SRL skills overall. The sub-skills of time management, effort regulation, and metacognition have also shown consistent relationships with performance in online classes, but findings are mixed regarding the relationships between other SRL sub-skills, such as the use of cognitive strategies, and performance (Broadbent and Poon, 2015).
While these findings provide suggestive evidence that SRL skills play an important role in the learning process, most previous studies have relied on student self-reported instruments to measure SRL skills and investigate the role of SRL in online learning. As we discuss in the “Using clickstream data to understand SRL” section, self-reported data may not be effective measures of SRL, as many individuals suffer from self-report bias and past memories are often insufficient for students to accurately recall past behavior or predict future events. Therefore, more timely and objective measures of student SRL skills are needed to more accurately capture student SRL skills.
In contrast to the consistent positive correlations between self-reported SRL skills and academic performance, there is less consistent evidence that SRL skills can be meaningfully altered to affect academic performance. Findings from previous interventions that have attempted to improve SRL skills, mainly concentrating on time management, have varied considerably. For instance, previous work that has attempted to support time management in online courses by providing more deadlines, by allowing students to set their own deadlines, or by suggesting that students schedule study time have found mixed results on the effects of these interventions on student performance. Studies examining the effects of externally and self-imposed interim deadlines on course grades have found positive (e.g., Ariely and Wertenbroch, 2002), negative (e.g., Burger, Charness, and Lynham, 2011), and null effects (e.g., Levy and Ramim, 2013). Studies examining the effects of encouraging students to plan when they will do work have also found a mix of positive (Baker, Evans, Li, and Cung, 2019), negative (Baker, Evans, and Dee, 2016), and null (Sitzmann and Johnson, 2012) effects on course and assignment grades.
These varied findings underscore the importance of understanding whether SRL time management behaviors (e.g., procrastination, cramming, and time-on-task) are actually affected by these interventions and then whether an improvement in time management behaviors is effective at improving performance. Previous studies have taken on these questions by attempting to examine whether the underlying mechanisms are affected by various time management interventions. However, these studies have used crude measures, such as self-reported time spent per week (Häfner, Oberst, and Stock, 2014), days between completing assignments (Sitzmann and Johnson, 2012), numbers of web-page visits (Bannert, Sonnenberg, Mengelkamp, and Pieger, 2015), self-reports of time management behaviors (Azevedo and Cromley, 2004; van Eerde, 2003), or time of exam submission (Levy and Ramim, 2013). As discussed in the “Using clickstream data to understand SRL” section, nuanced analyses of rich clickstream data can provide more objective and detailed insights into how various interventions are, or are not, affecting student SRL behaviors and can thus allow for better targeted and more efficient interventions.
Clickstream data and its use in higher education research
In the practice and research of higher education, there is an emerging interest in the use of the timely and nuanced clickstream LMS data to better understand and support students’ learning. Clickstream data are contained in the detailed logs of time-stamped actions from individuals interacting with LMSs (e.g., Canvas and Blackboard). These actions typically consist of events that a user initiates, such as navigating between web pages, downloading a file, or clicking play on a video. While such data only provide a partial and noisy record of a student’s actions, they enable practitioners and researchers to collect information at scale about how students interact with online education resources and thus promise more objective and richer insight into the learning experience than many other methods. In this section, we explain the format of typical clickstream data, introduce major approaches that have been used by researchers in analyzing clickstream data, and provide a brief overview of the current uses of clickstream data in higher education.
Figure 1 shows an example of the type of data that the LMS Canvas provides, based on students accessing a website associated with a course offering at the University of California, Irvine in 2016. Each row in Fig. 1 corresponds to an event generated by a particular student, identified via his or her (anonymized) Student ID. The URL is the web address of the resource being requested by the student, such as a request to navigate to a particular web page on the site or a request to download a file. One challenge in analyzing this type of data is that the URLs are not semantically meaningful by themselves, although the string names corresponding to the directory paths (e.g., “https://canvas.eee.uci.edu/courses/course_id/grades”) often provide useful clues about the content that the student is requesting (in this case, grade information). In practice, most URLs can be readily assigned to categories such as “grades,” “file downloads,” “assignments,” or “quizzes.” This type of clickstream data can also be combined with LMS-provided information about additional student activities, such as the text content of search queries, text context in forum discussions, or interactions between students.
There are two somewhat different data analysis strategies that can be used to analyze clickstream data, each with its strengths and weaknesses. The first approach is based on aggregate non-temporal representations of the clickstream information per student, in which information is combined over time. An example would be to generate one histogram per student of the counts of actions of different types of activities over the duration of a course (e.g., number of clicks on lecture videos, number of clicks on the gradebook page). This allows for a flattened multivariate representation, with each student represented as a multidimensional vector. The advantage of this representation is that it is amenable to a multitude of statistical analyses, such as multivariate regression for predicting outcomes or clustering of students into groups. The disadvantage, however, is that this type of static aggregate representation does not retain any information about the sequential or temporal aspects of a student’s behavior over the duration of a class (Mobasher, 2007; Spiliopoulou, 2000). Time-dependent or sequence-dependent representations, on the other hand, can retain more detailed information about a student’s behavior over time. A simple example of a time-dependent representation is to count the number of total click events per student recorded per day over the duration of the class, resulting in a count-valued time-series per student of the number of events per day. Figure 2 presents examples of such representations. A time-dependent representation can reveal more subtle sequential patterns in student behavior than static multivariate representations, such as a change in student activity levels midway through a course (Mobasher, 2007; Spiliopoulou, 2000). But working with time-dependent data is more complicated than working with multivariate representations, and there are typically fewer data analysis tools available for working with such data, particularly with the type of event data that underlies clickstreams.
At the most basic level, using clickstream data in educational contexts allow us to analyze mechanical aspects of student behavior, such as the overall level and frequency of activity on a course website, the temporal patterns of students’ online activity (both individually and relative to other students), and choices of which online resources students access. Such descriptions of student behavior, using various visualization and exploratory data mining techniques, were the focus of the earliest research in educational data mining (e.g., Baker and Yacef, 2009; Romero and Ventura, 2007). In recent years, the uses of clickstream data in educational research have expanded far beyond simple descriptions and have introduced both the possibility of empirical examination of educational theories using fine-grained process data and a new wave of data-driven pedagogical interventions (Fischer et al. 2020). The direction of these advances can be categorized into three main groups.
First, clickstream data can help instructors and practitioners understand how students are using the available resources in an effort to improve instructional designs. For instance, instructors can monitor which resources students use most and test different designs that might allow them to better calibrate the course, either to emphasize important resources that are valuable but under-utilized by students or to provide more resources that students favor, affording more targeted guidance and feedback (Bodily and Verbert, 2017; Diana et al. 2017; Shi, Fu, Chen, and Qu, 2015). Second, the real-time accessibility of behavioral clickstream data can be used to develop automatic feedback and intervention modules within the LMS. For example, researchers have built early detection systems for dropout or poor course performance, which can help instructors allocate their attention to the most at-risk students (Baker, Lindrum, Lindrum, and Perkowski, 2015; Bosch et al. 2018; Lykourentzou, Giannoukos, Nikolopoulos, Mpardis, and Loumos, 2009; Whitehill, Williams, Lopez, Coleman, and Reich, 2015). Students can also be provided with adaptive guidance in real-time by, for instance, suggesting collaboration partners (Brusilovsky, 2003; Caprotti, 2017). Third, clickstream data allow for novel analyses that aim to advance understanding of how to identify and cluster student subgroups, as well as to personalize interventions to support learning processes. This includes the identification of student subpopulations with respect to their use of online resources (Gasevic, Jovanovic, Pardo, and Dawson, 2017) or students’ engagement patterns in MOOC environments (Guo and Reinecke, 2014; Kizilcec, Piech, and Schneider, 2013). These student clusterings may be used in sequential modeling techniques such as recurrent neural network methods that populate a recommendation system of optimal course progression for different types of learners (e.g., Pardos, Tang, Davis, and Le, 2017).
Using clickstream data to understand SRL
One major line of research on using clickstream data is to measure student SRL behaviors with the goals of better understanding and supporting SRL (Roll and Winne, 2015). Previous studies have explored the use of clickstream data to measure SRL primarily in two types of technology-enhanced learning environments: interactive learning and LMS. The first group of studies has focused on interactive learning environments in which students are offered various tools that are designed to support SRL, including cognitive tools for information processing (e.g., note-taking window), goal-setting tools, reflection tools, and help-seeking tools (Nussbaumer, Steiner, and Albert, 2008; Perry and Winne, 2006; Winne and Jamieson-Noel, 2002). The second large group of studies has focused on student SRL behaviors using clickstream data from LMSs (e.g., blackboard and canvas), which are usually used to deliver learning materials (e.g., text, video, and audio), conduct learning activities (e.g., assignments and discussion), and support different forms of evaluation (e.g., exams and grade book systems; Lewis et al. 2005). The aspects of SRL behaviors that can be inferred using clickstream data collected from the two types of learning environments differ and are largely dependent on the types of interactions students can have within each learning environment.
The interactive learning environments embedded with SRL tools allow students to use one or more SRL tools to explicitly set goals for their learning tasks, monitor their learning process, use different cognition tools to process the information, and reflect and adjust their learning. SRL behaviors, such as cognitive strategy use, planning, and help-seeking, are measured with data on the frequency of, timing of, characteristic conditions of, and behavioral reactions to the use of these SRL tools (e.g., Nussbaumer et al. 2008; Winne and Jamieson-Noel, 2002). While detailed and diverse SRL behaviors can be inferred from data collected from these interactive learning environments, most of these learning environments are used in laboratory studies (e.g., Perry and Winne, 2006; Winne and Jamieson-Noel, 2002) or for specific domains or topics (e.g., learning the human life cycle; Perry and Winne, 2006) and thus have not been commonly adopted in higher education.
Unlike these interactive learning environments designed for specific domains or topics, LMSs are widely adopted in higher education contexts to support the basic processes that are necessary for learning any subject online (Lewis et al. 2005). Specifically, students usually interact with LMSs by downloading course materials, watching video lectures online, submitting assignments, completing quizzes, posting on the discussion forums, and so on (Lewis et al. 2005). Largely due to the fact that the features of learning management platforms are not set up to explicitly encourage and measure SRL, only a few studies have examined how to use clickstream data from LMSs to measure SRL, and these studies have mainly focused on the sub-concept of time management skill because it is most amenable to measurement (e.g., Baker et al. 2019; Cicchinelli et al. 2018; Crossley, Paquette, Dascalu, McNamara, and Baker, 2016; Lim, 2016; Park et al. 2018; You, 2016). Researchers have used measures such as the frequency with which students view resources pertaining to course dates and deadlines (Cicchinelli et al. 2018; Park et al. 2018), how far in advance students start work on/turn in various assignments (Crossley et al. 2016; Kazerouni, Edwards, Hall, and Shaffer, 2017; Levy and Ramim, 2013), and how close together work sessions are (e.g., Baker et al. 2019; Park et al. 2018) to examine students’ time management skills.
In addition, recent work has interrogated the extent to which clickstream measures provide valid inference about various SRL constructs in two ways: (1) by examining whether students’ perceptions about their self-regulated learning correspond to their click patterns, and (2) by examining the extent to which clickstream measures complement self-reported measures in predicting student course performance. One recent study found that clickstream data are helpful measures of true time management skills (Li et al. 2020). First, the clickstream measures were strongly correlated with students’ self-reported time management skills from a post-course survey (and somewhat correlated with measures from a pre-course survey). Second, the clickstream measures of time management were better predictors of students’ performance in the class than were the self-reported measures (Li et al. 2020). These results suggest that clickstream measures of SRL offer insightful and valid information about students’ actual learning processes.
The studies above suggest clickstream data can provide objective and timely measures of SRL, such as time management skills, that can be easily scaled up for large student populations. These clickstream measures can be used to examine the relationships between SRL behaviors and performance, which may provide additional and novel information on the role of SRL in online learning beyond existing findings based on self-reported data. Moreover, unlike self-reported measures that are usually collected at only one or limited time points, these measures can be used to investigate how student SRL behaviors unfold over time and to explore how personal and environmental factors influence SRL behaviors. Finally, the nuanced information on individual learning processes that clickstream data can uncover is useful in understanding how and why an SRL intervention influences student learning outcomes. Indeed, scholars (e.g., Damgaard and Nielsen, 2018) have recently argued that examining the mechanisms that various behavioral interventions affect is “crucial,” as interventions can have unintended negative consequences if the likely affected behavioral pathways are not well understood (Damgaard and Nielsen, 2018, p. 313).
In the following sections of this paper, we provide researchers, instructors, and administrators with examples of these promising avenues of research—defining and identifying behavioral patterns that are related to student learning outcomes, suggesting behavioral changes to students for greater success, and providing insights regarding the mechanisms by which education interventions affect student outcomes. In discussing this growing field of research, we specifically highlight the ways in which decontextualized, noisy, and sparse clickstream data can provide only partial answers to many questions by focusing on the specific strategies and cautions necessary for working with these data.