Skip to main content

An integrated approach for knowledge extraction and analysis in collaborative knowledge construction

Abstract

Collaborative knowledge construction (CKC) involved students’ sharing of information, improvement of ideas, and construction of collective knowledge. In this process, knowledge extraction and analysis can provide valuable insights into students’ knowledge capacities, depths, and levels in order to improve the CKC quality. However, existing studies tended to extract and analyze knowledge from a single perspective (e.g., the number of certain knowledge types and knowledge structures), which failed to demonstrate the complexity and dynamics of knowledge construction and advancement. To fill this gap, this research designed a series of computer-supported collaborative concept mapping (CSCCM) activities to facilitate students’ CKC process and then used an integrated approach (i.e., semantic knowledge analysis combined with learning analytics) to extract, analyze, and understand students’ knowledge characteristics and evolutionary trends. Results demonstrated that compared to the low-performing pairs, the high-performing pairs mainly discussed knowledge related to the course content, and their knowledge evolution trend was relatively stable. Based on the results, this research provided analytical implications to extract, analyze, and understand students’ knowledge and pedagogical implications to promote students’ knowledge construction and advancement.

Introduction

The significance of knowledge is emphasized in the current information society (Anderson, 2008; Välimaa & Hoffman, 2008), and relevant practices, e.g., knowledge discovery (Pazzani, 2000), knowledge management (Durst & Zieba, 2018), and knowledge construction (Charlton & Avramides, 2016) have been long studied. Knowledge externalization, as a critical element component in these practices, is regarded as a learner’s conscious process of presenting his/her inner knowledge to the public through varied media, e.g., audio, text, images, concept map, etc. (Ifenthaler et al., 2011; Lehmann et al., 2014). Particularly, in collaborative knowledge construction (CKC), students must externalize their knowledge through sharing information and resources, comparing and negotiating disagreements, and synthesizing and co-constructing knowledge (Fischer et al., 2002; Mayordomo & Onrubia, 2015; Zabolotna et al., 2023). Although studies put emphasis on the value of understanding students’ knowledge (Ashwin, 2014; Felder & Brent, 2005; Simonsmeier et al., 2022), few studies have addressed the extraction and analysis of student knowledge in the CKC context. Currently, three methods for knowledge extraction are used, namely manual coding, semi-automatic analysis, and automatic analysis. First, existing research has extensively employed manual coding in terms of established knowledge classification frameworks to extract knowledge (Liu et al., 2021; Phillips et al., 2019). However, the traditional manual coding approach often involved labor-intensive and time-consuming work that may produce subjective results. Semi-automatic analysis is the utilization of artificial intelligence (AI) algorithms to train knowledge classification models based on a training corpus, which highly relies on human labelling or intervention. A major disadvantage of semi-automated analysis approach is that the AI trained models may not be suitable for application in other research or educational contexts, which reduces the capacity of generalizability and accuracy of knowledge classification models (Patikorn et al., 2019). Automatic analysis involves the use of advanced artificial intelligence algorithms technologies to automatically learn and infer knowledge types from unlabeled data without human coding and labelling. One of the automatic analysis methods is semantic knowledge analysis, which uses approaches such as semantic network analysis and topic-modeling to extract concepts, ideas, or knowledge based on the linguistic units (e.g., words, sentences, chapters) (Liu & Chen, 2023; Pfiffner, 2021; Wu et al., 2021). Nevertheless, existing analyses tend to extract relevant keywords, define knowledge types in terms of these keywords, and subjectively judge student knowledge capabilities, which require high responsibility, skills, and expertise from the researchers or coders (Jelodar et al., 2019). It is necessary to automatically determine the type of knowledge associated with keywords without manual intervention in order to improve efficiency and accuracy of knowledge extraction and analysis. To fill this gap, this research designed a series of computer-supported collaborative concept mapping (CSCCM) activities to assist students’ CKC, and then integrated learning analytics methods with semantic knowledge analysis based on a knowledge base to extract, analyze, and understand students’ knowledge construction process. Specifically, this research compared the characteristics and evolutions of semantic knowledge between high-performing pairs and low-performing pairs. Based on the results, this research provided pedagogical implications to promote future instructional practices and analytical implications in the CKC process.

Literature review

Computer-supported collaborative concept mapping as a knowledge externalization means

Grounded upon the socio-cultural perspective of learning (Vygotsky, 1978), collaborative knowledge construction (CKC) emphasizes students’ sharing and organization of information, construction, and advancement of knowledge, and establishment of consensus and reflection through peer interactions in groups (Fischer et al., 2002; Mayordomo & Onrubia, 2015; Zabolotna et al., 2023). As a means of CKC, computer-supported collaborative concept mapping (CSCCM) provides students with opportunities to organize and externalize their knowledge, clarify and distinguish concepts, and integrate new knowledge into their prior knowledge (Chiou, 2008; Greene & Azeved, 2010). CSCCM is an effective strategy for externalizing and representing knowledge in the graphical formats, which consists of nodes denoting knowledge concepts and labeled lines representing the relationships between concepts (Novak et al., 1983). Studies have proved that CSCCM has the potential to enhance students’ cognitive abilities, develop their higher-order thinking skills, and foster students’ deep learning process (Chang et al., 2017; Chu et al., 2019; Sundararajan et al., 2018). The analysis of the knowledge reflected in the concept map enables instructors to gain insight into students’ knowledge capabilities and provide guidance for instructional interventions. Through the analysis of knowledge across students with different performances, instructors can identify the knowledge deficiencies of low-performing students, which supports instructors in providing targeted strategies for these students to succeed academically. However, extracting knowledge from concept maps is a challenging and tedious task. There is a lack of fixed standards for the analysis of knowledge from concept maps due to the ill-structured characteristics of CSCCM (Jonassen, 1997). As a supplementary component of the CSCCM, discussions provide students with a direct means of sharing, negotiating, and integrating their ideas to externalize their internal knowledge through textual or oral communications (Ifenthaler et al., 2011). The data generated from students’ textual or oral communications provide analytical possibilities for understanding and analysis of students’ knowledge. In summary, extracting knowledge from student discussions can be considered as a crucial means to gain insights into the students’ knowledge construction process.

Existing knowledge extraction, classification, and analysis methods

Existing research has utilized manual coding and semi-automatic analysis to extract and analyze knowledge. On the one hand, knowledge classification frameworks were used to manually identify students’ domain-specific knowledge types and knowledge depths, such as Structure of Observed Learning Outcome (Liu et al., 2021), Revised Bloom’s Taxonomy (Blooma et al., 2013), and Technological Pedagogical Content Knowledge (Phillips et al., 2019). On the other hand, semi-automatic analysis relies on AI algorithms to train knowledge classification models based on a labeled training corpus that contains different types of knowledge. Typical AI algorithms include Support Vector Machine (SVM) (Karlovčec et al., 2012), Artificial Neural Network (ANN) (Patikorn et al., 2019), and Bidirectional Encoder Representations from Transformers (BERT) (Shen et al., 2021). Of these two methods, manual coding students’ knowledge is labor-intensive, time-consuming, and error-prone work (Han et al., 2021), while the semi-automatic analysis approaches can increase the efficiency of the work, reduce the potential for bias or errors, and improve the reproducibility of data analysis. However, overfitting existed in semi-automatic classification through validation of existing models, resulting in a low accuracy and generalizability when applied to new datasets (Patikorn et al., 2019; Shen et al., 2021). To address these challenges, recent studies have explored automatic analysis methods that can overcome these limitations and provide more accurate and reliable results. Compared to manual coding and semi-automatic analysis, the automatic analysis does not rely on human intervention. Instead, it utilizes advanced technologies to automatically learn and infer knowledge types from unlabeled data. One promising approach to automatic analysis is the use of semantic knowledge analysis, which involves the application of natural language processing and machine learning techniques to extract meanings and relationships from textual data (Liu & Chen, 2023; Pfiffner, 2021; Yeari & van den Broek, 2016).

Semantic knowledge analysis

Semantic knowledge is a set of concepts extracted from linguistic units (i.e., words, sentences, chapters) generated from natural languages or texts (Lupyan et al., 2019). A major approach of semantic knowledge analysis requires grouping or clustering of keywords with methods such as semantic network analysis, topic modeling, and then defining knowledge types based on the meaning of keywords (Drieger, 2013; Gurcan & Cagiltay, 2019; Peng & Xu, 2020). For instance, Drieger (2013) measured node-based clustering coefficients of semantic networks and obtained various local clusters that encoded different semantic knowledge. Gurcan and Cagiltay (2019) used Latent Dirichlet Allocation (LDA) to discover the knowledge domains and skill sets from a textual corpus related to big data software engineering discipline and results extracted ten core competency areas from 48 trending knowledge. However, these methods involve the researchers’ manual definition of the knowledge types according to the meaning of the keywords, which may increase workload, lessen the efficiency, and reduce the interpretability of results (Jelodar et al., 2019). To solve this issue, semantic dictionaries or knowledge bases are designed, which apply natural language processing to provide solutions for classifying knowledge based on keywords. Sememe knowledge base is a type of semantic knowledge base that utilizes sememes for describing and organizing the meaning of words and phrases (Zhao et al., 2022). Sememes are defined as the minimum semantic units of human languages in linguistics (Bloomfield, 1926) and a limited set of sememes compose the meanings of all the words. For example, the sememe of “apple” includes Computer and Fruit, which means the word “apple” has two main meanings: one is a famous computer brand (Apple brand) and another is a sort of juicy fruit (apple). Most research designed and applied sememe knowledge base in the natural language processing, information retrieval, and machine translation fields in order to enhance the computer’s ability to understand human language (Niu et al., 2017; Wen et al., 2022; Ye et al., 2022). Recently, some studies have applied sememe to support instruction and learning in the education field. For instance, Liu et al. (2018) developed a mixed similarity strategy to integrate sememe knowledge, orthographic, and phonological features for the automated generation of questions, thereby helping instructors save time in constructing examination papers. Chen and Dong (2022) designed an automatic grading system with text similarity based on sememe to score subjective items in examinations. Given the promises of applying sememe in the education field, it is necessary to investigate how sememe can be used to address educational challenges, such as understanding students’ knowledge construction process.

In addition, recent studies have started to further analyze knowledge after extracting and classifying knowledge in order to obtain a comprehensive understanding of students’ knowledge construction and advancement (Blooma et al., 2013; Lin et al., 2013; Zhang et al., 2019). These studies focused on a single analytical perspective to analyze knowledge features, including the frequency of certain knowledge types and knowledge structures. For example, Blooma et al. (2013) counted the types of knowledge to capture knowledge characteristics in the CKC process. Results found that “procedural knowledge” was the most prominent knowledge while “meta-cognitive knowledge” was lacking. Zhang et al. (2019) used epistemic network analysis, a learning analytics method, to compare the epistemic network characteristics of teachers’ knowledge in different groups. Results found that the teachers with higher scores had a richer, more organized, and more flexible knowledge structure than teachers with lower scores. Although they made valuable attempts to analyze students’ knowledge, they often fall short of providing a deep insight into the complexity and dynamics of knowledge construction and development. On the one hand, knowledge has a hierarchical structure and organizational form, which requires a systematic method for classification, integration, and analysis (Daft & Lewin, 1993). On the other hand, knowledge is a dynamic concept constantly develops and evolves over time during students’ learning process (Nonaka et al., 2000). Due to the complexity and dynamics of knowledge, merely focusing on one analytical perspective may cause inconclusive and incomprehensible results. An integrated approach enables researchers to gain a comprehensive understanding of complex phenomena, avoid inconclusive or incomplete results, and leverage multiple perspectives to develop effective solutions (Kelley & Knowles, 2016; Sun et al., 2021). Considering the complex and dynamic characteristics of collaborative knowledge construction, it is necessary to use an integrated approach to analyze and understand students’ knowledge characteristics and evolutions.

Methodology

Research purposes and questions

This research’s purpose was to gain a deep understanding of the students’ knowledge characteristics and evolutions during the CKC process by using automatic knowledge analysis methods. This research conducted a series of CSCCM activities supported with online discussions in an online collaborative concept mapping platform designed by the research team to facilitate higher education students’ CKC quality. Then, this research aimed to extract students’ knowledge generated in online discussions and compare semantic knowledge characteristics and evolutionary trends between pairs with high and low performances. There were two research questions:

RQ 1

What were the differences in semantic knowledge characteristics between pairs with high and low performances during the CSCCM process?

RQ 2

What were the differences in evolutionary trends of semantic knowledge between pairs with high and low performances during the CSCCM process?

Research context and participants

The research context was a four-day graduate-level online course titled “Educational Technology Development and Application” during summer 2022, offered at a top China’s research-intensive university. This course focused on learning theories, instructional practices, and technology applications, as well as development trends in educational technology.

Participants were 16 (15 females and 1 male; ages between 24 to 32) part-time Master of Education students from the College of Education at the university. They came from the majors of educational management (10 students), subject education (5 students), and educational technology (1 student). Participants were divided into 8 pairs randomly. The reasons for choosing participants in this course are twofold: first, the course has been designed with the application of collaborative knowledge construction pedagogy, which is suitable for the research purposes; second, the instruction and learning strategy of concept mapping can be easily learned by the participants, as it has been successfully implemented within higher education. All participants signed the informed consent forms before the course and agreed the data collection of this research.

The instructional process

The online course lasted four days, with the following instructional process designed (see Fig. 1). The first day was designed as an initiation and warm-up phase for students to get familiar with the experiment, concept mapping, and the platform environment. The current experiments were held from Day 2 to Day 4. Each day consisted of three sessions: the online lecture, the individual-level concept mapping activity, and the CSCCM activity. The online lectures included four themes, namely the development of educational technology, online and blended learning, learning analytics and educational data mining, and artificial intelligence in education. The individual-level concept map activity was designed as a preparation for CSCCM (de Weerd et al., 2017). Each student was required to search for learning materials and resources and build an individual-level concept map independently. Then, CSCCM involved student pairs to collaboratively complete a concept map at the pair level. For example, one CSCCM activity asked pairs to record the basic concepts, definitions, theories, instructional processes, and technical support related to blended learning.

Fig. 1
figure 1

The instructional procedure

The platforms used in this research included DingTalk (see Fig. 2a) and an online collaborative concept mapping platform (see Fig. 2b). DingTalk was used for lecturing and student pair’s communication during CSCCM activities. Online collaborative concept mapping platform was designed by our research team to support individual-level and pair-level concept mapping (see Fig. 2b). The administrator created individual spaces for each student and pair spaces for each pair in advance, so that students can complete their individual-level concept maps and pair-level concept maps. An online chat box was embedded to help students share and exchange their ideas and knowledge while constructing pair-level concept maps (see Fig. 2c). In addition, the platform supports peer evaluation function, which enables students to examine and evaluate peers’ concept maps (see Fig. 2d).

Fig. 2
figure 2

Screenshots of a DingTalk, b online collaborative concept mapping platform, c online chatting function in the platform, and d online commenting function in the platform

On Days 2 – 4, students were required to construct individual-level concept maps and then pair-level concept maps after the lecture. In the CSCCM activity on Day 2, students were asked to engage in the CSCCM activity immediately after completing the individual-level concept maps. In the CSCCM activity on Day 3, students were asked to view peers’ concept maps before constructing pair-level concept maps for cognitive group awareness support (Farrokhnia et al., 2019). Furthermore, in the CSCCM activity on Day 4, students had to evaluate peers’ concept maps, respond to peers’ comments, and modify individual-level concept maps before constructing pair-level concept maps to improve the completeness of their concept mapping (Hwang & Chang, 2021). The CSCCM activities included dialogic prompts to foster knowledge inquiry and construction (e.g., What do you think about this idea?, Do you agree with my ideas?, My opinion is …, I disagree with this idea because…, The idea about …is appropriate, A summary of our pair’s idea is…).

Data collection and analysis process

This research collected data in two ways. First, discussion data from eight pairs on the online collaborative concept mapping platform and DingTalk during the CSCCM activities were collected, mainly including discussion content, discussion participants, and discussion time. There was a total of 681 discussion data. Second, the final versions of the pair-level concept maps were collected; there was a total of 24 concept maps (8 pairs * 3 times). An overall analytical framework was proposed, which used an integrated approach to analyze semantic knowledge characteristics and evolutionary trends between the pairs with high and low performances (see Fig. 3).

Fig. 3
figure 3

The analytical framework

Regarding the pair-level concept maps, an assessment standard was adapted to evaluate pair-level concept maps (see Table 1). The pair-level concept maps were evaluated in three dimensions: structure (distribution of nodes), idea (average depth of ideas), and connection (average depth of connections). The overall score was obtained by adding the scores of distributions of nodes (DIS), average depth of ideas (DoI), and average depth of connections (DoC). The performance for each pair was defined as the average score of the pair-level concept maps completed for the three CSCCM activities. Two raters with educational technology background calculated the three values for the pair-level concept maps independently, and they reached agreements through discussions when there were conflicts about the scoring. According to the scoring results, eight pairs were divided into high-performing pairs and low-performing pairs. The high-performing pairs consisting of four pairs (i.e., pair 2, 3, 4, and 7) achieved higher scores for their pair-level concept maps (M = 22.72, SD = 1.10), while the low-performing pairs, also consisting of four pairs (i.e., pair 1, 5, 6, and 8) achieved lower scores for their pair-level concept maps (M = 17.48, SD = 1.10).

Table 1 Assessment standard for pair-level concept maps

The top 100 keywords were extracted from the discussion data and sememes that correspond to keywords were identified. One keyword may belong to several different sememes and we decided to choose a sememe that occurred frequently as the sememe for this keyword. For example, the sememe of a keyword “theory” includes Debate and Knowledge, and the sememe of a keyword “misconception” includes Wrong and Knowledge. The sememe Knowledge appeared with a high frequency, therefore the sememe of keywords “theory” and “misconception” was defined as Knowledge in this research. Finally, the top 100 keywords were identified as 65 sememes. 60 and 52 sememes were identified in the high-performing pairs and low-performing pairs, respectively.

To answer the first question, semantic network analysis (SNA) and epistemic network analysis (ENA) were used to compare sememe characteristics between the high-performing pairs and low-performing pairs. First, the sememe networks of high-performing and low-performing pairs were created to identify co-occurrence structures using the network visualization software, Gephi. Pointwise Mutual Information (PMI) was used to measure the probability of two sememes appearing in all discussions. PMI was represented as

$$PMI({S}_{1},{S}_{2})=\mathrm{log}(\frac{P\left({S}_{1},{S}_{2}\right)}{P({S}_{1})P({S}_{2})})$$

where \(P\left({S}_{1},{S}_{2}\right)\) represented the probability of the co-occurrence of \({S}_{1}\) and \({S}_{2}\), \(P({S}_{1})\) represented the probability of \({S}_{1}\) occurrence, and \(P({S}_{2})\) represented the probability of \({S}_{2}\) occurrence. Modularity analysis, a community detection method based on the Leuven algorithm, was conducted to reveal different clusters within a sememe network. The pair-level SNA metrics were calculated to uncover semantic network characteristics, including density, average path length (APL), transitivity, reciprocity, centralization, distance, and average weighting degree (AWD) (see Table 2) (Ouyang et al., 2021). R package igraph was used to measure those SNA metrics.

Table 2 The descriptions of SNA metrics

Moreover, ENA was used to analyze the co-occurrence structure of major sememes (i.e., sememes with high frequencies). A scatter plot about sememe frequencies was drawn to determine the number of sememes for ENA (see Fig. 4). Sememes were arranged on the scatter plot in descending order of the frequencies. The trend line of the scatter plot showed a sudden change when the number of sememes was 5 or 8. In order to display the co-occurrence relationship between sememes as much as possible, we chose 8 sememes as the ENA nodes, which were Education, Image, Knowledge, Study, Plans, NounUnit, Thinking, and FuncWord (see Table 3).

Fig. 4
figure 4

A frequency scatter plot for sememe

Table 3 Sememes selected for ENA

Note. X-axis represented sememes sorted from low to high frequency. Y-axis represented the frequencies of sememes.

To answer the second research question, we used 10 min as a time slice to analyze the evolutionary trends of sememes, and each CSCCM activity was divided into 5 stages with a total of 50 min. Firstly, we constructed time series of sememes in this research. It can be defined as

$${k}_{i,j}=\frac{{P}_{i,j}}{\sum_{j=1}^{n}{P}_{i,j}}\left(i=\mathrm{1,2},\dots ,m;j=\mathrm{1,2},\dots ,5\right)$$

where \({k}_{i,j}\) represented the relative frequency of the i-th sememe on the j-th 10-min and \({P}_{i,j}\) represented the frequency of the i-th sememe on the j-th 10-min. Therefore, the time series matrix of sememes can be represented as

$$K={[{K}_{1},{K}_{2},\cdots ,{K}_{m}]}^{T}=\left[\begin{array}{cc}\begin{array}{ccc}{k}_{\mathrm{1,1}}& {k}_{\mathrm{1,2}}& \cdots \\ {k}_{\mathrm{2,1}}& {k}_{\mathrm{2,2}}& \cdots \\ \vdots & \vdots & \ddots \end{array}& \begin{array}{c}{k}_{\mathrm{1,5}}\\ {k}_{\mathrm{2,5}}\\ \vdots \end{array}\\ \begin{array}{ccc}{k}_{m,1}& {k}_{m,2}& \cdots \end{array}& {k}_{m,5}\end{array}\right]$$

According to the results of modularity analysis, we calculated the average evolutionary trend of each cluster in the high-performing and low-performing pairs based on Euclidean distance. Euclidean distance is a commonly used definition of distance that calculates the arithmetic mean value of each time slice (Aghabozorgi et al., 2015). The average evolutionary trend was considered as the overall characteristic of each cluster. Moreover, ENA was performed to characterize evolution of major sememes for the five stages of CSCCM activities in the high-performing and low-performing pairs.

Results

RQ 1: What were the differences in semantic knowledge characteristics between pairs with high and low performances during CSCCM?

Regarding the semantic network analysis results, the high-performing pairs demonstrated higher connectedness and stronger cohesion than the low-performing pairs. Specifically, high-performing pairs had higher values of density, transitivity, centralization, average degree, and average weighting degree than low-performing pairs. In addition, high-performing pairs had lower values of average path length and distance than low-performing pairs (see Table 4). In summary, the high-performing pairs formed a semantic network with high connectedness and strong cohesion while the low-performing pairs formed a semantic network with low connectedness and weak cohesion.

Table 4 The comparison of the SNA metrics of high-performance and low-performing pairs

Modularity analysis generated two clusters for the high-performing pairs and four clusters for the low-performing pairs (see Fig. 5). For the high-performing pairs, cluster 1 centered on the education-related theory and practice, technology applications and developments in education, and cluster 2 centered on grammatical meanings, forms, and functions. Specifically, approximately 38.33% of sememe were clustered in cluster 1. The core sememes in cluster 1 were Knowledge, Education, Implement, Perception, and Study. Therefore, cluster 1 was about course lectures and CSCCM specific content. Cluster 2 consisted of 61.67% of sememes without full lexical meanings but with grammatical meanings and grammatical functions, such as FuncWord and NounUnit. In addition, Enrich and Merge indicated that students paid attention to the adjustment and modification of the concept map. For low-performing pairs, cluster 1 and cluster 2 centered on grammatical meanings, forms, and functions. Cluster 3 and cluster 4 centered on education-related content and topic. Specifically, the core sememes in each of the four clusters were FuncWord, NounUnit, Knowledge, and Education, respectively. These clusters covered 18.87%, 13.21%, 37.74%, and 30.19% of the sememe frequency. In summary, sememes were more tightly connected in the high-performing pairs than in the low-performing pairs. Moreover, sememes related to the learning content were clustered into one cluster in the high-performing pairs, which was reflected by a strong sememe connection in cluster 1. However, sememes related to the learning content were divided into two different clusters in the low-performing pairs, including cluster 3 and cluster 4, which meant that students did not integrate a variety of knowledge related to the learning content in their discussions.

Fig. 5
figure 5

Sememes network diagrams

Note. Nodes represented sememes and nodes in different colors represented different clusters. Node size represented relative influence, i.e., eigenvector centrality. Tie weights represented the strength of relations, i.e., co-occurrence frequency of two sememes.

ENA results showed the co-occurrence structure of major sememes between the high-performing pairs and low-performing pairs. The high-performing pairs and low-performing pairs were characterized by the connection values and the locations of the centroid of the ENA plots (see Fig. 6). For all pairs, most of the codes shared strong connections with Education, the core theme of the course content. However, the sememe connected to Education was completely different, which can be reflected by the locations of the centroid in epistemic networks. Specially, for the high-performing pairs, the centroid of the epistemic network was located to the left of X-axis, mainly focusing on Knowledge, Plans, Study, and Education. The connection between Education and Knowledge was 0.43; the connection between Education and Plans was 0.33; and the connection between Education and Study was 0.17. For the low-performing pairs, the centroid of the epistemic network was located on the positive axis for X, focusing on NounUnit, FuncWord, Image, and Education. The connection between Education and NounUnit was 0.38; the connection between Education and Image was 0.28; and the connection between Education and NounUnit was 0.19. Moreover, Mann–Whitney U test further revealed the differences in the distribution of connection between the high-performing pairs and low-performing pairs. A significant difference was found on the X-axis (U = 60, p = 0.00, r = − 0.87), which meant that there were significant differences in the connections between the high-performing pairs and low-performing pairs. In summary, the high-performing pairs concentrated on discussing content related to the course content and CSCCM activities; the low-performing pairs concentrated on discussing grammatical meanings and functions.

Fig. 6
figure 6

The subtracted ENA plots of the high- and low-performing pairs

Note. In subtracted network, the blue square represented the centroid of high-performing pairs, the red square represented the centroid of low-performing pairs and the boxes represented 95% confidence intervals. The weights of the connections were compared between high-performing pairs and low-performing pairs, and the color of the line was set to be the same as the pair that had a stronger connection between the sememe. The color depth represents the strength of the connection.

RQ 2: What were the differences in evolutionary trends of semantic knowledge between pairs with high and low performances during CSCCM?

The evolutionary trends of sememes in the high-performing pairs were relatively stable while sememes in low-performing pairs showed variability and fluctuation. For the high-performing pairs, the average evolutionary trends of sememes in two clusters were roughly similar, as demonstrated by similar evolutionary shapes throughout the CSCCM activities (see Fig. 7a). The range of sememe frequency (i.e., the maximum value of frequency minus the minimum value of frequency) was 0.29 for cluster 1 and 0.24 for cluster 2. In addition, the fluctuation of cluster 1 and cluster 2 occurred during the first half of the activity for the high-performing pairs. For the low-performing pairs, cluster 1 and cluster 2, centering on grammatical meaning, form, and function, had more variabilities and changes, compared to cluster 3 and cluster 4, centering on course-related knowledge (see Fig. 7b). Specifically, for the low-performing pairs, ranges of sememe frequency in four clusters were 0.49, 0.50, 0.27, and 0.21. In addition, the peaks of three clusters (except cluster 4) occurred in the first half of the activity, while valleys of these four clusters occurred in the second half of the activity. In summary, the evolution of sememe in high-performing pairs was relatively stable, compared to the low-performing pairs.

Fig. 7
figure 7

Average evolutionary trend for clusters in the high- and low-performing pairs. The black line represented the evolution of each sememe and the colored line represented the average evolution of sememe in one cluster.

ENA results showed the evolution in the co-occurrence structure of major sememes between the high-performing and low-performing pairs. The high-performing pairs and low-performing pairs were characterized by the locations of the centroid of the ENA plots (see Fig. 8). For high-performing pairs, the centroid for most of the stages fell in the upper half of the network, which indicated that high-performing pairs were able to continuously focus on the course content for knowledge construction during the CSCCM activity. The sememe evolution trend had changes in the middle stage of the activity (i.e., 20–30 min) and the centroid was biased towards FuncWord. For the low-performing pairs, the centroid for most of the stages fell in the lower half of the network, which indicated that low-performing pairs were not able to continuously focus on the meaningful learning content during the CSCCM activity. The centroid of low-performing pairs was biased towards course-related sememes, such as Education in the middle stage of the activity (i.e., 20–40 min). Moreover, Mann–Whitney U test was used to determine whether there were significant differences in the position of the stage centroid between two adjacent stages. For high-performing pairs, there was no significant difference on the X-axis or Y-axis between two adjacent stages. For low-performing pairs, significant differences were found on the Y-axis between the centroid of 10-20 min and the centroid of 20–30 min (U = 10.5, p = 0.02, r = 0.67), and between the centroid of 30–40 min and centroid of 40–50 min (U = 43, p = 0.01, r = − 0. 79). In summary, the co-occurrence structure of the major sememes in high-performing pairs fluctuated slightly, focusing mainly on course-related knowledge, while the co-occurrence structure of the major sememes in low-performing pairs fluctuated greatly, with less attention to course-related knowledge.

Fig. 8
figure 8

Evolution of the centroid in ENA plots

Discussions and implications

Addressing the research questions

To gain a deep comprehension of the students’ knowledge characteristics and evolutions during the CKC process, this research integrated learning analytics methods with semantic knowledge analysis based on a knowledge base to extract, analyze, and understand students’ knowledge construction process. Regarding the first research question, the result showed that high-performing pairs focused on course-related and activity-related knowledge, while the low-performing pairs concentrated on discussing grammatical meanings and functions. Specially, for high-performing pairs, meaningful sememes related to the course content and CSCCM activity (i.e., Education, Learning, Knowledge) formed a strong co-occurrence structure with a high frequency. In addition, meaningful sememes were clustered into one group in the sememe network, indicating that high-performing pairs made full use of learning content and materials and thought comprehensively in the discussion. For low-performing pairs, linguistical knowledge that represents units (i.e., NounUnit, FuncWord) in language formed a strong co-occurrence structure with a high frequency. In addition, for the low-performing pairs, sememes were not clustered into one group in the sememe network, indicating that students tended to have scattered thoughts and cannot thought comprehensively in the discussion. Overall, the research results showed that semantic knowledge characteristics of high-performing pairs existed a strong focus on course-related and activity-related knowledge while semantic knowledge characteristics of low-performing pairs existed a strong emphasis on linguistic knowledge. Consistent with previous research results (e.g., Peng & Xu, 2020; Yoon et al., 2021), the results indicated that when students concentrate on the content that is relevant to the course, task, and activity during collaborative discussions, they tend to attain good academic results.

Regarding the second research question, the results indicated that the evolutionary trend of sememes in high-performing pairs tended to be relatively stable while low-performing pairs showed variability and fluctuation over time. Specially, compared to low-performing pairs, the high-performing pairs exhibited smaller changes in sememe frequencies throughout the CSCCM activities (reflected by the smaller value of change range in sememe), and lower differences in the co-occurrence structure of major sememes (reflected by closer centroid position). In addition, two clusters in the high-performing pairs had similar evolutionary trends, while four clusters in the low-performing pairs had diverse evolutionary trends. This result again verified that the high-performing pairs had a stable knowledge evolutionary trend but the low-performing pairs showed variability and fluctuation of knowledge evolution. Overall, the research results showed that the high-performing pairs demonstrated a more sustained cognitive engagement, compared to the low-performing pairs (Liu et al., 2022).

Analytical implications

Since CKC is a complex, adaptive, and dynamic process, this research extended the knowledge extraction and analysis using an integrated approach, combining semantic knowledge analysis with learning analytics to gain a comprehensive understanding of knowledge characteristics and evolutions during students’ CKC processes. There are two analytical implications generated from this research, namely the application of domain-specific knowledge bases and AI-driven learning analytics and data mining. First, it is essential to develop semantic dictionaries or knowledge bases that are customized for specific subjects to conduct productive knowledge analysis. Semantic dictionaries or knowledge bases represent all words as a finite set of semantics by defining upper-level semantic properties, which is highly interpretable and the results can be easily understood by students, instructors, and researchers. In general, this research represented the first attempt to apply knowledge bases from the natural language processing domain to knowledge analysis in educational contexts. However, general dictionaries cannot cover all the required terms and concepts in a certain field. Future work can construct, update, and maintain semantic dictionaries or knowledge bases in different specialized domains, thus improving the accuracy of the extracted semantic knowledge. Second, integrated approaches, particularly AI-driven learning analytics and data mining, are worth applying to capture the nature of CKC. Compared to the traditional analytical methods, the integrated approach used in this research, namely integrated learning analytics methods with semantic knowledge analysis can better extract and represent the complex and dynamic structure of CKC (de Carvalho & Zárate, 2020). Future work can apply advanced AI algorithms (e.g., natural language processing and genetic programming) with learning analytics and data mining to offer in-time, dynamic knowledge characteristics of CKC (de Carvalho & Zárate, 2020; Hoppe et al., 2021). For example, Ouyang et al. (2023) proposed an integrated approach that combined a probabilistic model with two sequence analysis and mining techniques to investigate macro-level collaborative patterns and micro-level sequences of group communicative discourses. Overall, to obtain a comprehensive understanding of students’ knowledge construction, application of domain-specific knowledge bases and AI-driven learning analytics and data mining have potentials to optimize the process of knowledge extraction and analysis, increase the capacity of generalizability and accuracy of results, and increase the efficiency of the knowledge analysis work.

Pedagogical implications

Three pedagogical implications are proposed based on the semantic knowledge insights generated from our investigation, including reasonable monitoring, appropriate incentive, and adaptive support. First, instructors are supposed to make reasonable monitoring to guide students who are trapped in off-topic discussions in the correct direction. Our results showed that the low-performing pairs focused less on course-related discussions with more fluctuated knowledge evolution, compared to the high-performing pairs. When these students face bottlenecks in the discussion, the instructor should identify areas where they need to improve, encourage them to ask questions and share thoughts, and provide information or hints to promote their innovative thinking (Al‐Zahrani, 2015; Golding, 2011). Second, instructors are supposed to promote active involvement and introduce appropriate incentives to encourage students to stay on-topic. Results showed that pairs with low performance wasted time on content without practical meaning, while pairs with high performance focused on content related to course and CSCCM activities. Therefore, instructors should consider ways (such as extra credit or praise) to encourage students to stay focused on the task at hand (Hou & Wu, 2011). Third, instructors are supposed to provide adaptive support in terms of student pairs’ and groups’ dynamic evolvement in the collaborative learning process. Our results showed considerable fluctuations of knowledge evolution during the initial stages, and a decreased focus of knowledge in the later stages. As mentioned by Ouyang et al. (2023), the collaborative learning process needs a dynamic intervention approach to suit students’ complicated tendencies. To be specific, instructors can observe, monitor, and regulate students to organize and focus on discussions about the topic in the first half of the activity, and guide students to inquiry and construct knowledge from multiple perspectives in the second half of the activity. Overall, students’ activities should be reasonably monitored, and instructors should introduce appropriate incentives and support students’ work appropriately with instructional interventions.

Conclusions, limitations, and future directions

In the era of knowledge-based economy, knowledge has become increasingly extensive, complex, and diverse, which poses high demands on understanding students’ knowledge. Using integrated learning analytics methods with semantic knowledge analysis, this research this study offered valuable insights into the complex and dynamic nature of students’ knowledge during the CKC process. The results revealed differences between pairs with high and low performances in terms of semantic knowledge characteristics and evolutionary trends. Moreover, this research provided analytical contributions to extract and analyze students’ knowledge for the comprehension of student knowledge, and proposed pedagogical implications to advance students’ knowledge advancement. There were two limitations of this research, which lead to future research directions. First, the sample size of the research was small and the educational background was homogeneous, which weakened the generalizability of the research results and implications. Future research should expand the sample size to different instructional contexts in order to verify the research results and implications. Second, this research mainly employed SNA and ENA to uncover the knowledge characteristics and evolution trends of students, which may not be sufficient for a comprehensive analysis of knowledge. Further research can integrate additional analytical methods (such as lag sequential analysis) and AI algorithms (such as probabilistic model) to provide a more complete description of knowledge construction process. Overall, this research provided researchers and educators with new insights into complex and dunamic nature of CKC and offered analytical implications for understanding students’ knowledge in a comprehensive way, which is essential for educational practice and research in the knowledge era.

Availability of data and materials

The data will be available on request from the corresponding author.

References

Download references

Acknowledgements

We thank the students participated in this research.

Funding

The research is supported by National Natural Science Foundation of China (62177041), and the Fundamental Research Funds for the Central Universities, Zhejiang University, China.

Author information

Authors and Affiliations

Authors

Contributions

Ning Zhang: Collection and analysis of research data, and writing of the original draft. Fan Ouyang: Conceptualization and manuscript writing (review, editing), supervision of the research process. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fan Ouyang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, N., Ouyang, F. An integrated approach for knowledge extraction and analysis in collaborative knowledge construction. Int J Educ Technol High Educ 20, 45 (2023). https://doi.org/10.1186/s41239-023-00414-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s41239-023-00414-5

Keywords