A revised application of cognitive presence automatic classifiers for MOOCs: a new set of indicators revealed?

Table 10 Summary of the 225 classification features used in our random forest classifiers

Category	Feature name	Feature description	Quantity
Discussion contextual features	mes.depth	The numeric position (chronological order) within a thread	1
	mes.replies	The total number of replies beneath each message in a thread	1
	mes.start	A binary number to indicate whether the message is the start of a thread	1
	mes.end	A binary number to indicate whether the message is the end of a thread	1
Linguistic features	cm*	Cohesion measure features from the Coh-Metrix tool	108
Linguistic features	liwc*	Word-collection based features from the LIWC tool	90
Semantic similarity	sim.cos.pre	cosine similarity of the current and the previous message represented by two TF-IDF weighted vectors	1
	sim.cos.next	cosine similarity of the current and the next message represented by two TF-IDF weighted vectors	1
	sim.bert.pre	similarity of the current and the previous message represented by pre-trained BERT embedding vectors	1
	sim.bert.next	similarity of the current and the next message represented by pre-trained BERT embedding vectors	1
Name entities	ner*	In each message, occurrence times of 18 types of name entities, including Person, ORG, Date, GPE, Location, Time, etc	18
Name entities	ner.total	The total number of all above name-entity types in a message	1

Both Coh-Metrix (3.0 version) and LIWC (2015 version) provided three duplicate features, which were (1) the number of the words in the message, (2) the average number of words in the message, and (3) the number of first-person singular pronouns in the message. Hence, we adopted 198 computational linguistic features, after removing the three duplicate features in LIWC, to build our automatic classifier