

For example, Marvin and Linzen ( 2018) construct minimal sentence pairs, consisting of a grammatical and ungrammtical sentence, to explore the capacity of LMs in capturing phenomena such as subject–verb agreement, reflexive anaphora, and negative polarity items. Example linguistic properties are subject–verb agreement or syntactic structure, while representations can be word or sentence embeddings. The basic idea is to use learned representations to predict linguistic properties of interest. There is also a rich literature on exploring what kinds of linguistic phenomena a model has learned (Hu et al., 2020a Hewitt and Liang, 2019 Hewitt and Manning, 2019 Chen et al., 2019 McCoy et al., 2019 Conneau et al., 2018 Gulordava et al., 2018 Peters et al., 2018 Tang et al., 2018 Blevins et al., 2018 Wilcox et al., 2018 Kuncoro et al., 2018 Tran et al., 2018 Belinkov et al., 2017). ( 2019) use the average of local coherence scores between consecutive pairs of sentences as the document coherence score.

( 2018) use multi-headed self-attention to capture long distance relationships between words, which are passed to an LSTM layer to estimate essay coherence scores. ( 2018) capture local coherence by computing the similarity of the output of two LSTMs (Hochreiter and Schmidhuber, 1997), which they concatenate with essay representations to score essays. Neural models have also been proposed (Ji and Smith, 2017 Li and Jurafsky, 2017 Li et al., 2018 Mesgar and Strube, 2018 Mim et al., 2019 Tien Nguyen and Joty, 2017).

Sentences in a document are represented by nodes in the graph, and two nodes are connected if they share the same or similar entities. Guinaudeau and Strube ( 2013) and Mesgar and Strube ( 2016) use a graph to model entity transition sequences. To assess local coherence, traditional studies have used entity matrices, for example, to represent entity transitions across sentences (Barzilay and Lapata, 2005, 2008). Various methods have been proposed to capture local and global coherence, while our work aims to examine the performance of existing pretrained LMs in document coherence understanding. Lastly, in summary coherence rating, readability assessment, and essay scoring, coherence is just one dimension of the overall document quality measurement. Third, the aforementioned three tasks are artificial, and have very limited utility in terms of real-world tasks, while our task can provide direct benefit in applications such as essay scoring, in identifying incoherent (intruder) sentences as a means of providing user feedback and explainability of essay scores.

These two tasks do not consider sentences from outside of the document of interest. Paragraph reconstruction aims to recover the original sentence order of a shuffled paragraph given its first sentence. Second, sentence insertion aims to find the correct position to insert a removed sentence back into a document. Incoherence is introduced by shuffling sentences, while our intruder sentences are selected from a second document, and there is only ever a single intruder sentence per document. First, the document discrimination task assigns coherence scores to a document and its sentence-permuted versions, where the original document is considered to be well-written and coherent and permuted versions incoherent. These tasks differ from our task of intruder sentence detection as follows. Coherence measurement has been studied across various tasks, such as the document discrimination task (Barzilay and Lapata, 2005 Elsner et al., 2007 Barzilay and Lapata, 2008 Elsner and Charniak, 2011 Li and Jurafsky, 2017 Putra and Tokunaga, 2017), sentence insertion (Elsner and Charniak, 2011 Putra and Tokunaga, 2017 Xu et al., 2019), paragraph reconstruction (Lapata, 2003 Elsner et al., 2007 Li and Jurafsky, 2017 Xu et al., 2019 Prabhumoye et al., 2020), summary coherence rating (Barzilay and Lapata 2005 Pitler et al., 2010 Guinaudeau and Strube, 2013 Tien Nguyen and Joty, 2017), readability assessment (Guinaudeau and Strube, 2013 Mesgar and Strube, 2016, 2018), and essay scoring (Mesgar and Strube, 2018 Somasundaran et al., 2014 Tay et al., 2018).
