Skip to main content

Table 10 Summary of the 225 classification features used in our random forest classifiers

From: A revised application of cognitive presence automatic classifiers for MOOCs: a new set of indicators revealed?

Category

Feature name

Feature description

Quantity

Discussion contextual features

mes.depth

The numeric position (chronological order) within a thread

1

mes.replies

The total number of replies beneath each message in a thread

1

mes.start

A binary number to indicate whether the message is the start of a thread

1

mes.end

A binary number to indicate whether the message is the end of a thread

1

Linguistic features

cm*

Cohesion measure features from the Coh-Metrix tool

108

liwc*

Word-collection based features from the LIWC tool

90

Semantic similarity

sim.cos.pre

cosine similarity of the current and the previous message represented by two TF-IDF weighted vectors

1

sim.cos.next

cosine similarity of the current and the next message represented by two TF-IDF weighted vectors

1

sim.bert.pre

similarity of the current and the previous message represented by pre-trained BERT embedding vectors

1

sim.bert.next

similarity of the current and the next message represented by pre-trained BERT embedding vectors

1

Name entities

ner*

In each message, occurrence times of 18 types of name entities, including Person, ORG, Date, GPE, Location, Time, etc

18

ner.total

The total number of all above name-entity types in a message

1

  1. Both Coh-Metrix (3.0 version) and LIWC (2015 version) provided three duplicate features, which were (1) the number of the words in the message, (2) the average number of words in the message, and (3) the number of first-person singular pronouns in the message. Hence, we adopted 198 computational linguistic features, after removing the three duplicate features in LIWC, to build our automatic classifier