Text Segmentation based on Semantic Word Embeddings
Abstract
We explore the use of semantic word embeddings [?, ?, ?] in text segmentation algorithms, including the C99 segmentation algorithm [?, ?] and new algorithms inspired by the distributed word vector representation. By developing a general framework for discussing a class of segmentation objectives, we study the effectiveness of greedy versus exact optimization approaches and suggest a new iterative refinement technique for improving the performance of greedy strategies. We compare our results to known benchmarks [?, ?, ?, ?], using known metrics [?, ?]. We demonstrate stateoftheart performance for an untrained method with our Content Vector Segmentation (CVS) on the Choi test set. Finally, we apply the segmentation procedure to an inthewild dataset consisting of text extracted from scholarly articles in the arXiv.org database.
Text Segmentation based on Semantic Word Embeddings
Alexander A Alemi 
Dept of Physics 
Cornell University 
aaa244@cornell.edu 
Paul Ginsparg 
Depts of Physics and Information Science 
Cornell University 
ginsparg@cornell.edu 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors I.2.7 [Natural Language Processing]: Text Analysis

Information Retrieval, Clustering, Text

Text Segmentation, Text Mining, Word Vectors
Segmenting text into naturally coherent sections has many useful applications in information retrieval and automated text summarization, and has received much past attention. An early text segmentation algorithm was the TextTiling method introduced by Hearst [?] in 1997. Text was scanned linearly, with a coherence calculated for each adjacent block, and a heuristic was used to determine the locations of cuts. In addition to linear approaches, there are text segmentation algorithms that optimize some scoring objective. An early algorithm in this class was Choi’s C99 algorithm [?] in 2000, which also introduced a benchmark segmentation dataset used by subsequent work. Instead of looking only at nearest neighbor coherence, the C99 algorithm computes a coherence score between all pairs of elements of text,^{1}^{1}1By ‘elements’, we mean the pieces of text combined in order to comprise the segments. In the applications to be considered, the basic elements will be either sentences or words. and searches for a text segmentation that optimizes an objective based on that scoring by greedily making a succession of best cuts. Later work by Choi and collaborators [?] used distributed representations of words rather than a bag of words approach, with the representations generated by LSA [?]. In 2001, Utiyama and Ishahara introduced a statistical model for segmentation and optimized a posterior for the segment boundaries. Moving beyond the greedy approaches, in 2004 Fragkou et al. [?] attempted to find the optimal splitting for their own objective using dynamic programming. More recent attempts at segmentation, including Misra et al. [?] and Riedl and Biemann [?], used LDA based topic models to inform the segmentation task. Du et al. consider structured topic models for segmentation [?]. Eisenstein and Barzilay [?] and Dadachev et al. [?] both consider a Bayesian approach to text segmentation. Most similar to our own work, Sakahara et al. [?] consider a segmentation algorithm which does affinity propagation clustering on text representations built from word vectors learned from word2vec [?].
For the most part, aside from [?], the nontopic model based segmentation approaches have been based on relatively simple representations of the underlying text. Recent approaches to learning word vectors, including Mikolov et al.’s word2vec [?], Pennington et al.’s GloVe [?] and Levy and Goldberg’s pointwise mutual information [?], have seen remarkable success in solving analogy tasks, machine translation [?], and sentiment analysis [?]. These word vector approaches attempt to learn a loglinear model for wordword cooccurrence statistics, such that the probability of two words appearing near one another is proportional to the exponential of their dot product,
(1) The method relies on these wordword cooccurrence statistics encoding meaningful semantic and syntactic relationships. Arora et al. [?] have shown how the remarkable performance of these techniques can be understood in terms of relatively mild assumptions about corpora statistics, which in turn can be recreated with a simple generative model.
Here we explore the utility of word vectors for text segmentation, both in the context of existing algorithms such as C99, and when used to construct new segmentation objectives based on a generative model for segment formation. We will first construct a framework for describing a family of segmentation algorithms, then discuss the specific algorithms to be investigated in detail. We then apply our modified algorithms both to the standard Choi test set and to a test set generated from arXiv.org research articles.
The segmentation task is to split a text into contiguous coherent sections. We first build a representation of the text, by splitting it into basic elements, (), each a dimensional feature vector () representing the element. Then we assign a score to each candidate segment, comprised of the through elements, and finally determine how to split the text into the appropriate number of segments.
Denote a segmentation of text into segments as a list of indices , where the th segment includes the elements with , with . For example, the string “aaabbcccdd" considered at the character level would be properly split with into (“aaa", “bb", “ccc", “dd").
The text representation thus amounts to turning a plain text document into an dimensional matrix , with the number of initial elements to be grouped into coherent segments and the dimensionality of the element representation. For example, if segmenting at the word level then would be the number of words in the text, and each word might be represented by a dimensional vector, such as those obtained from GloVe [?]. If segmenting instead at the sentence level, then is the number of sentences in the text and we must decide how to represent each sentence.
There are additional preprocessing decisions, for example using a stemming algorithm or removing stop words before forming the representation. Particular preprocessing decisions can have a large effect on the performance of segmentation algorithms, but for discussing scoring functions and splitting methods those decisions can be abstracted into the specification of the matrix .
Having built an initial representation the text, we next specify the coherence of a segment of text with a scoring function , which acts on the representation and returns a score for the segment running from (inclusive) to (noninclusive). The score can be a simple scalar or more general object. In addition to the scoring function, we need to specify how to return an aggregrate score for the entire segmentation. This score aggregation function can be as simple as adding the scores for the individual segments, or again some more general function. The score for an overall segmentation is given by aggregating the scores of all of the segments in the segmentation:
(2) Finally, to frame the segmentation problem as a form of optimization, we need to map the aggregated score to a single scalar. The key function () returns this single number, so that the cost for the above segmentation is
(3) For most of the segmentation schemes to be considered, the score function itself returns a scalar, so the score aggregation function will be taken as simple addition with the key function the identity, but the generality here allows us to incorporate the C99 segmentation algorithm [?] into the same framework.
Having specified the representation of the text and scoring of the candidate segments, we need to prescribe how to choose the final segmentation. In this work, we consider three methods: (1) greedy splitting, which at each step inserts the best available segmentation boundary; (2) dynamic programming based segmentation, which uses dynamic programming to find the optimal segmentation; and (3) an iterative refinement scheme, which starts with the greedy segmentation and then adjusts the boundaries to improve performance.
The greedy segmentation approach builds up a segmentation into segments by greedily inserting new boundaries at each step to minimize the aggregate score:
(4) (5) until the desired number of splits is reached. Many published text segmentation algorithms are greedy in nature, including the original C99 algorithm [?].
The greedy segmentation algorithm is not guaranteed to find the optimal splitting, but dynamic programming methods can be used for the text segmentation problem formulated in terms of optimizing a scoring objective. For a detailed account of dynamic programming and segmentation in general, see the thesis by Terzi [?]. Dynamic programming as been applied to text segmentation in Fragkou et al. [?], with much success, but we will also consider here an optimizaton of the the C99 segmentation algorithm using a dynamic programming approach.
The goal of the dynamic programming approach is to split the segmentation problem into a series of smaller segmentation problems, by expressing the optimal segmentation of the first elements of the sequence into segments in terms of the best choice for the last segmentation boundary. The aggregated score for this optimal segmentation should be minimized with respect to the key function :
(6) (7) While the dynamic programming approach yeilds the optimal segmentation for our decomposable score function, it can be costly to compute, especially for long texts. In practice, both the optimal segmentation score and the resulting segmentation can be found in one pass by building up a table of segmentation scores and optimal cut indices one row at a time.
Inspired by the popular Lloyd algorithm for means, we attempt to retain the computational benefit of the greedy segmentation approach, but realize additional performance gains by iteratively refining the segmentation. Since text segmentation problems require contiguous blocks of text, a natural scheme for relaxation is to try to move each segment boundary optimally while keeping the edges to either side of it fixed:
(8) (9) We will see in practice that by 20 iterations it has typically converged to a fixed point very close to the optimal dynamic programming segmentation.
In the experiments to follow, we will test various choices for the representation, scoring function, and splitting method in the above general framework. The segmentation algorithms to be considered fall into three groups:
Choi’s C99 algorithm [?] was an early text segmentation algorithm with promising results. The feature vector for an element of text is chosen as the pairwise cosine distances with other elements of text, where those elements in turn are represented by a bag of stemmed words vector (after preprocessing to remove stop words):
(10) with the frequency of word in element . The pairwise cosine distance matrix is noisy for these features, and since only the relative values are meaningful, C99 employs a ranking transformation, replacing each value of the matrix by the fraction of its neighbors with smaller value:
(11) where the neighborhood is an block around the entry, the square brackets mean 1 if the inequality is satisfied otherwise 0 (and values off the end of the matrix are not counted in the sum, or towards the normalization). Each element of the text in the C99 algorithm is represented by a rank transformed vector of its cosine distances to each other element.
The score function describes the average intersentence similarity by taking the overall score to be
(12) where is the sum of all ranked cosine similarities in a segment and is the squared length of the segment. This score function is still decomposable, but requires that we define the local score function to return a pair,
(13) with score aggregation function defined as component addition,
(14) and key function defined as division of the two components,
(15) While earlier work with the C99 algorithm considered only a greedy splitting approach, in the experiments that follow we will use our more general framework to explore both optimal dynamic programming and refined iterative versions of C99. Followup work by Choi et al. [?] explored the effect of using combinations of LSA word vectors in eq. (10) in place of the . Below we will explore the effect of using combinations of word vectors to represent the elements.
To assess the utility of word vectors in segmentation, we first investigate how they can be used to improve the C99 algorithm, and then consider more general scoring functions based on our word vector representation. As the representation of an element, we take
(16) with representing the frequency of word in element , and representing the component of the word vector for word as learned by a word vector training algorithm, such as word2vec [?] or GloVe [?].
The length of word vectors varies strongly across the vocabulary and in general correlates with word frequency. In order to mitigate the effect of common words, we will sometimes weight the sum by the inverse document frequency (idf) of the word in the corpus:
(17) where is the number of documents in which word appears. We can instead normalize the word vectors before adding them together
(18) or both weight by idf and normalize.
Segmentation is a form of clustering, so a natural choice for scoring function is the sum of square deviations from the mean of the segment, as used in means:
(19) (20) and which we call the Euclidean score function. Generally, however, cosine similarity is used for word vectors, making angles between words more important than distances. In some experiments, we therefore normalize the word vectors first, so that a euclidean distance score better approximates the cosine distance (recall for normalized vectors).
Trained word vectors have a remarkable amount of structure. Analogy tasks such as man:woman::king:? can be solved by finding the vector closest to the linear query:
(21) Arora et al. [?] constructed a generative model of text that explains how this linear structure arises and can be maintained even in relatively low dimensional vector models. The generative model consists of a content vector which undergoes a random walk from a stationary distribution defined to be the product distribution on each of its components , uniform on the interval (with the dimensionality of the word vectors). At each point in time, a word vector is generated by the content vector according to a loglinear model:
(22) The slow drift of the content vectors helps to ensure that nearby words obey with high probability a loglinear model for their cooccurence probability:
(23) for some fixed .
To segment text into coherent sections, we will boldly assume that the content vector in each putative segment is constant, and measure the log likelihood that all words in the segment are drawn from the same content vector . (This is similar in spirit to the probabilistic segmentation technique proposed by Utiyama and Isahara [?].) Assuming the word draws are independent, we have that the log likelihood
(24) is proportional to the sum of the dot products of the word vectors with the content vector . We use a maximum likelihood estimate for the content vector:
(25) (26) (27) This determines what we will call the Content Vector Segmentation (CVS) algorithm, based on the score function
(28) The score for a segment is the sum of the dot products of the word vectors with the maximum likelihood content vector for the segment, with components given by
(29) The maximum likelihood content vector thus has components , depending on whether the sum of the word vector components in the segment is positive or negative.
This score function will turn out to generate some of the most accurate segmentation results. Note that CVS is completely untrained with respect to the specific text to be segmented, relying only on a suitable set of word vectors, derived from some corpus in the language of choice. While CVS is most justifiable when working on the word vectors directly, we will also explore the effect of normalizing the word vectors before applying the objective.
To explore the efficacy of different segmentation strategies and algorithms, we performed segmentation experiments on two datasets. The first is the Choi dataset [?], a common benchmark used in earlier segmentation work, and the second is a similarly constructed dataset based on articles uploaded to the arXiv, as will be described in Section Text Segmentation based on Semantic Word Embeddings. All code and data used for these experiments is available online^{2}^{2}2github.com/alexalemi/segmentation.
To evaluate the performance of our algorithms, we use two standard metrics: the metric and the WindowDiff (WD) metric. For text segmentation, near misses should get more credit than far misses. The metric [?], captures the probability for a probe composed of a pair of nearby elements (at constant distance positions ) to be placed in the same segment by both reference and hypothesized segmentations. In particular, the metric counts the number of disagreements on the probe elements:
(30) where is equal to 1 or 0 according to whether or not both element and are in the same segment in hypothesized and reference segmentations, resp., and the argument of the sum tests agreement of the hypothesis and reference segmentations. ( is taken to be one less than the integer closest to half of the number of elements divided by the number of segments in the reference segmentation.) The total is then divided by the total number of probes. This metric counts the number of disagreements, so lower scores indicate better agreement between the two segmentations. Trivial strategies such as choosing only a single segmentation, or giving each element its own segment, or giving constant boundaries or random boundaries, tend to produce values of around 50% [?].
The metric has the disadvantage that it penalizes false positives more severely than false negatives, and can suffer when the distribution of segment sizes varies. Pevzner and Hearst [?] introduced the WindowDiff (WD) metric:
(31) where counts the number of boundaries between location and in the text, and an error is registered if the hypothesis and reference segmentations disagree on the number of boundaries. In practice, the and WD scores are highly correlated, with more prevalent in the literature — we will provide both for most of the experiments here.
The Choi dataset is used to test whether a segmentation algorithm can distinguish natural topic boundaries. It concatenates the first sentences from ten different documents chosen at random from a 124 document subset of the Brown corpus (the ca**.pos and cj**.pos sets) [?]. The number of sentences taken from each document is chosen uniformly at random within a range specified by the subset id (i.e., as min–max #sentences). There are four ranges considered: (3–5, 6–8, 9–11, 3–11), the first three of which have 100 example documents, and the last 400 documents. The dataset can be obtained from an archived version of the C99 segmentation code release^{3}^{3}3 http://web.archive.org/web/20010422042459/http://www.cs.man.ac.uk/~choif/software/C991.2release.tgz (We thank with Martin Riedl for pointing us to the dataset.). An extract from one of the documents in the test set is shown in Fig. Text Segmentation based on Semantic Word Embeddings.
We will explore the effect of changing the representation and splitting strategy of the C99 algorithm. In order to give fair comparisons we implemented our own version of the C99 algorithm (oC99). The C99 performance depended sensitively on the details of the text preprocessing. Details can be found in Appendix Text Segmentation based on Semantic Word Embeddings.
The first experiment explores the ability of word vectors to improve the performance of the C99 algorithm. The word vectors were learned by GloVe [?] on a 42 billion word set of the Common Crawl corpus in 300 dimensions^{4}^{4}4Obtainable from http://wwwnlp.stanford.edu/data/glove.42B.300d.txt.gz. We emphasize that these word vectors were not trained on the Brown or Choi datasets directly, and instead come from a general corpus of English. These vectors were chosen in order to isolate any improvement due to the word vectors from any confounding effects due to details of the training procedure. The results are summarized in Table Text Segmentation based on Semantic Word Embeddings below. The upper section cites results from [?], exploring the utility of using LSA word vectors, and showed an improvement of a few percent over their baseline C99 implementation. The middle section shows results from [?], which augmented the C99 method by representing each element with a histogram of topics learned from LDA. Our results are in the lower section, showing how word vectors improve the performance of the algorithm.
WD Algorithm 3–5 6–8 9–11 3–11 3–5 6–8 9–11 3–11 C99 [?] 12 11 9 9 C99LSA 9 10 7 5 C99 [?] 11.20 12.07 C99LDA 4.16 4.89 oC99 14.22 12.20 11.59 15.56 14.22 12.22 11.60 15.64 oC99tf 12.14 13.17 14.60 14.91 12.14 13.34 15.22 15.22 oC99tfidf 10.27 12.23 15.87 14.78 10.27 12.30 16.29 14.96 oC99k50 20.39 21.13 23.76 24.33 20.39 21.34 23.26 24.63 oC99k200 18.60 17.37 19.42 20.85 18.60 17.42 19.60 20.97 Table \thetable: Effect of using word vectors in the C99 text segmentation algorithm. and WD results are shown (smaller values indicate better performance). The top section (C99 vs. C99LSA) shows the few percent improvement over the C99 baseline reported in [?] of using LSA to encode the words. The middle section (C99 vs. C99LDA) shows the effect of modifying the C99 algorithm to work on histograms of LDA topics in each sentence, from [?]. The bottom section shows the effect of using word vectors trained from GloVe [?] in our oC99 implementation of the C99 segmentation algorithm. The oC99tf implementation sums the word vectors in each sentence, with no rank transformation, after removing stop words and punctuation. oC99tfidf weights the sum by the log of the inverse document frequency of each word. The oC99k models use the word vectors to form a topic model by doing spherical means on the word vectors. oC99k50 uses 50 clusters and oC99k200 uses 200. In each of these last experiments, we turned off the rank transformation, pruned the stop words and punctuation, but did not stem the vocabulary. Word vectors can be incorporated in a few natural ways. Vectors for each word in a sentence can simply be summed, giving results shown in the oC99tf row. But all words are not created equal, so the sentence representation might be dominated by the vectors for common words. In the oC99tfidf row, the word vectors are weighted by (i.e., the log of the inverse document frequency of each word in the Brown corpus, which has 500 documents in total) before summation. We see some improvement from using word vectors, for example the of 14.78% for the oC99tfidf method on the 3–11 set, compared to of 15.56% for our baseline C99 implementation. On the shorter 3–5 test set, our oC99tfidf method achieves of 10.27% versus the baseline oC99 of 14.22% . To compare to the various topic model based approaches, e.g. [?], we perform spherical means clustering on the word vectors [?] and represent each sentence as a histogram of its word clusters (i.e., as a vector in the space of clusters, with components equal to the number of its words in that that cluster). In this case, the word topic representations (oC99k50 and oC99k200 in Table Text Segmentation based on Semantic Word Embeddings) do not perform as well as the C99 variants of [?]. But as was noted in [?], those topic models were trained on crossvalidated subsets of the Choi dataset, and benefited from seeing virtually all of the sentences in the test sets already in each training set, so have an unfair advantage that would not necessarily convey to real world applications. Overall, the results in Table Text Segmentation based on Semantic Word Embeddings illustrate that the word vectors obtained from GloVe can markedly improve existing segmentation algorithms.
The use of word vectors permits consideration of natural scoring functions other than C99style segmentation scoring. The second experiment examines alternative scoring frameworks using the same GloVe word vectors as in the previous experiment. To test the utility of the scoring functions more directly, for these experiments we used the optimal dynamic programming segmentation. Results are summarized in Table Text Segmentation based on Semantic Word Embeddings, which shows the average and WD scores on the 3–11 subset of the Choi dataset. In all cases, we removed stop words and punctuation, did not stem, but after preprocessing removed sentences with fewer than 5 words.
Algorithm rep n WD oC99 tf  11.78 11.94 tfidf  12.19 12.27 Euclidean tf F 7.68 8.28 T 9.18 10.83 tfidf F 12.89 14.27 T 8.32 8.95 Content (CVS) tf F 5.29 5.39 T 5.42 5.55 tfidf F 5.75 5.87 T 5.03 5.12 Table \thetable: Results obtained by varying the scoring function. These runs were on the 3–11 set from the Choi database, with a word cut of 5 applied, after preprocessing to remove stop words and punctuation, but without stemming. The CVS method does remarkably better than either the C99 method or a Euclidean distancebased scoring function. Note first that the dynamic programming results for our implementation of C99 with tf weights gives , 3% better than the greedy version result of 14.91% reported in Table Text Segmentation based on Semantic Word Embeddings. This demonstrates that the original C99 algorithm and its applications can benefit from a more exact minimization than given by the greedy approach. We considered two natural score functions: the Euclidean scoring function (eqn. (20)) which minimizes the sum of the square deviations of each vector in a segment from the average vector of the segment, and the Content Vector scoring (CVS) (eqn. (28) of section Text Segmentation based on Semantic Word Embeddings), which uses an approximate log posterior for the words in the segment, as determined from its maximum likelihood content vector. In each case, we consider vectors for each sentence generated both as a strict sum of the words comprising it (tf approach), and as a sum weighted by the log idf (tfidf approach, as in sec. Text Segmentation based on Semantic Word Embeddings). Additionally, we consider the effect of normalizing the element vectors before starting the score minimization, as indicated by the column.
The CVS score function eqn. (28) performs the best overall, with scores below 6%, indicating an improved segmentation performance using a score function adapted to the choice of representation. While the most principled score function would be the Content score function using tf weighted element vectors without normalization, the normalized tfidf scheme actually performs the best. This is probably due to the uncharacteristically large effect common words have on the element representation, which the log idf weights and the normalization help to mitigate.
Strictly speaking, the idf weighted schemes cannot claim to be completely untrained, as they benefit from word usage statistics in the Choi test set, but the raw CVS method still demonstrates a marked improvement on the 3–11 subset, 5.29% versus the optimal C99 baseline of 11.78% .
To explore the effect of the splitting strategy and to compare with our overall results on the Choi test set against other published benchmarks, in our third experiment we ran the raw CVS method against all of the Choi test subsets, using all three splitting strategies discussed: greedy, refined, and dynamic programming. These results are summarized in Table Text Segmentation based on Semantic Word Embeddings.
Alg 3–5 6–8 9–11 3–11 TT [?] C99 [?] C01 [?] U00 [?] F04 [?] GCVS RCVS DPCVS M09 [?] R12 [?] D13 [?] Table \thetable: Some published results on the Choi dataset against our raw CVS method. GCVS uses a greedy splitting strategy, RCVS uses up to 20 iterations to refine the results of the greedy strategy, and DPCVS shows the optimal results obtained by dynamic programming. We include the topic modeling results M09, R12, and D13 for reference, but for reasons detailed in the text do not regard them as comparable, due to their mingling of test and training samples. Overall, our method outperforms all previous untrained methods. As commented regarding Table Text Segmentation based on Semantic Word Embeddings (toward the end of subsection Text Segmentation based on Semantic Word Embeddings), we have included the results of the topic modeling based approaches M09 [?], R12 [?], and D13 [?] for reference. But due to repeat appearance of the same sentences throughout each section of the Choi dataset, methods that split that dataset into test and training sets have unavoidable access to the entirety of the test set during training, albeit in different order.^{5}^{5}5In [?], it is observed that “This makes the Choi data set artificially easy for supervised approaches.” See appendix Text Segmentation based on Semantic Word Embeddings. These results can therefore only be compared to other algorithms permitted to make extensive use of the test data during crossvalidation training. Only the TT, C99, U00 and raw CVS method can be considered as completely untrained. The C01 method derives its LSA vectors from the Brown corpus, from which the Choi test set is constructed, but that provides only a weak benefit, and the F04 method is additionally trained on a subset of the test set to achieve its best performance, but its use only of idf values provides a similarly weak benefit.
We emphasize that the raw CVS method is completely independent of the Choi test set, using word vectors derived from a completely different corpus. In Fig. Text Segmentation based on Semantic Word Embeddings, we reproduce the relevant results from the last column of Table Text Segmentation based on Semantic Word Embeddings to highlight the performance benefits provided by the semantic word embedding.
Figure \thefigure: Results from last column of Table Text Segmentation based on Semantic Word Embeddings reproduced to highlight the performance of the CVS segmentation algorithm compared to similar untrained algorithms. Its superior performance in an unsupervised setting suggests applications on documents “in the wild”. Note also the surprising performance of the refined splitting strategy, with the RCVS results in Table Text Segmentation based on Semantic Word Embeddings much lower than the greedy GCVS results, and moving close to the optimal DPCVS results, at far lower computational cost. In particular, taking the dynamic programming segmentation as the true segmentation, we can assess the performance of the refined strategy. As seen in Table Text Segmentation based on Semantic Word Embeddings, the refined segmentation very closely approximates the optimal segmentation.
This is important in practice since the dynamic programming segmentation is much slower, taking five times longer to compute on the 3–11 subset of the Choi test set. The dynamic programming segmentation becomes computationally infeasible to do at the scale of word level segmentation on the arXiv dataset considered in the next section, whereas the refined segmentation method remains eminently feasible.
3–5 6–8 9–11 3–11 RCVS vs DPCVS [?] Table \thetable: Treating the dynamic programming splits as the true answer, the error of the refined splits as measured in across the subsets of the Choi test set. Performance evaluation on the Choi test set implements segmentation at the sentence level, i.e., with segments of composed of sentences as the basic elements. But text sources do not necessarily have wellmarked sentence boundaries. The arXiv is a repository of scientific articles which for practical reasons extracts text from PDF documents (typically using pdfminer/pdf2txt.py). That Postscriptbased format was originally intended only as a means of formatting text on a page, rather than as a network transmission format encoding syntactic or semantic information. The result is often somewhat corrupted, either due to the handling of mathematical notation, the presence of footers and headers, or even just font encoding issues.
To test the segmentation algorithms in a realistic setting, we created a test set similar to the Choi test set, but based on text extracted from PDFs retrieved from the arXiv database. Each test document is composed of a random number of contiguous words, uniformly chosen between 100 and 300, sampled at random from the text obtained from arXiv articles. The text was preprocessed by lowercasing and inserting spaces around every nonalphanumeric character, then splitting on whitespace to tokenize. An example of two of the segments of the first test document is shown in Figure Text Segmentation based on Semantic Word Embeddings below.
1. nature_414 : 441  443 . 12 seinen , i . and schram a . 2006 . social_status and group norms : indirect_reciprocity in a helping experiment . european_economic_review 50 : 581  602 . silva , e . r . , jaffe , k . 2002 . expanded food choice as a possible factor in the evolution of eusociality in vespidae sociobiology 39 : 25  36 . smith , j . , van dyken , j . d . , zeejune , p . c . 2010 . a generalization of hamilton ’ s rule for the evolution of microbial cooperation science_328 , 1700  1703 . zhang , j . , wang , j . , sun , s . , wang , l . , wang , z . , xia , c . 2012 . effect of growing size of interaction neighbors on the evolution of cooperation in spatial snowdrift_game . chinese_science bulletin 57 : 724  728 . zimmerman , m . , egu ‘i luz , v . , san_miguel ,2of ) e , equipped_with the topology of weak_convergence . we will state some results about random measures . 10 definition a . 1 ( first two moment measures ) . for a random_variable z , taking values in p ( e ) , and k = 1 , 2 , . . . , there is a uniquely_determined measure ( k ) on b ( ek ) such that e [ z ( a1 ) _$\cdot$_$\cdot$ z ( ak ) ] = ( k ) ( a1 _$\cdot$_$\cdot$ ak ) for a1 , . . . , ak b ( e ) . this is called the kth_moment measure . equivalently , ( k ) is the unique measure such that e [ hz , 1i _$\cdot$_$\cdot$ hz , ki ] = h ( k ) , 1 _$\cdot$_$\cdot$ ki , where h . , . i denotes integration . lemma a . 2 ( characterisation of deterministic random measures ) . let z be a random_variable_taking values in p ( e ) with the first two moment measures : = ( 1 ) and ( 2 ) . then the following_assertions_are_equivalent : 1 . there is p ( e ) with z = , almost_surely . 2 . the second_moment measure has product  form , i . e . ( 2 ) = ( which is equivalent to e [ hz , 1i hz , 2i ] = h , 1i h , 2i ( this is in fact equivalent to e [ hz , i2 ]Figure \thefigure: Example of two of the segments from a document in the arXiv test set. This is a much more difficult segmentation task: due to the presence of numbers and many periods in references, there are no clear sentence boundaries on which to initially group the text, and no natural boundaries are suggested in the test set examples. Here segmentation algorithms must work directly at the “word" level, where word can mean a punctuation mark. The presence of garbled mathematical formulae adds to the difficulty of making sense of certain streams of text.
In Table Text Segmentation based on Semantic Word Embeddings, we summarize the results of three word vector powered approaches, comparing a C99 style algorithm to our content vector based methods, both for unnormalized and normalized word vectors. Since much of the language of the scientific articles is specialized, the word vectors used in this case were obtained from GloVe trained on a corpus of similarly preprocessed texts from 98,392 arXiv articles. (Since the elements are now words rather than sentences, the only issue involves whether or not those word vectors are normalized.) As mentioned, the dynamic programming approach is prohibitively expensive for this dataset.
Alg S WD oC99 G oC99 R CVS G CVS R CVSn G CVSn R Table \thetable: Results on the arXiv test set for the C99 method using word vectors (oC99), our CVS method, and CVS method with normalized word vectors (CVSn). The and WD metrics are given for both the greedy (G) and refined splitting strategies (R), with respect to the reference segmentation in the test set. The refined strategy was allowed up to 20 iterations to converge. The refinement converged for all of the CVS runs, but failed to converge for some documents in the test set under the C99 method. Refinement improved performance in all cases, and our CVS methods improve significantly over the C99 method for this task. We see that the CVS method performs far better on the test set than the C99 style segmentation using word vectors. The and WD values obtained are not as impressive as those obtained on the Choi test set, but this test set offers a much more challenging segmentation task: it requires the methods to work at the level of words, and as well includes the possibility that natural topic boundaries occur in the test set segments themselves. The segmentations obtained with the CVS method typically appear sensibly split on section boundaries, references and similar formatting boundaries, not known in advance to the algorithm.
Figure \thefigure: Effect of applying our segmentation algorithm to this paper with 40 segments. The segments are denoted with alternating color overlays. As a final illustration of the effectiveness of our algorithm at segmenting scientific articles, we’ve applied the best performing algorithm to this article. Fig. Text Segmentation based on Semantic Word Embeddings shows how the algorithm segments the text roughly along section borders.
We have presented a general framework for describing and developing segmentation algorithms, and compared some existing and new strategies for representation, scoring and splitting. We have demonstrated the utility of semantic word embeddings for segmentation, both in existing algorithms and in new segmentation algorithms. On a real world segmentation task at word level, we’ve demonstrated the ability to generate useful segmentations of scientific articles. In future work, we plan to use this segmentation technique to facilitate retrieval of documents with segments of concentrated content, and to identify documents with localized sections of similar content.
This work was supported by NSF IIS1247696. We thank James P. Sethna for useful discussions and for feedback on the manuscript.
 [1] S. Arora, Y. Li, T. M. Yingyu Liang, and A. Risteski. Random walks on context spaces: Towards an explanation of the mysteries of semantic word embeddings. 2015, arXiv:1502.03520.
 [2] D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine learning, 34(13):177–210, 1999.
 [3] F. Y. Choi. Advances in domain independent linear text segmentation. In Proc. of the 1st North American chapter of the Association for Computational Linguistics conference, pages 26–33. Association for Computational Linguistics, 2000, arXiv:cs/0003083.
 [4] F. Y. Choi, P. WiemerHastings, and J. Moore. Latent semantic analysis for text segmentation. In In Proceedings of EMNLP. Citeseer, 2001.
 [5] A. Coates and A. Y. Ng. Learning feature representations with kmeans. In Neural Networks: Tricks of the Trade, pages 561–580. Springer, 2012.
 [6] B. Dadachev, A. Balinsky, and H. Balinsky. On automatic text segmentation. In Proceedings of the 2014 ACM symposium on Document engineering, pages 73–80. ACM, 2014.
 [7] L. Du, W. L. Buntine, and M. Johnson. Topic segmentation with a structured topic model. In HLTNAACL, pages 190–200. Citeseer, 2013.
 [8] S. T. Dumais. Latent semantic analysis. Ann. Rev. of Information Sci. and Tech., 38(1):188–230, 2004.
 [9] J. Eisenstein and R. Barzilay. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 334–343. Association for Computational Linguistics, 2008.
 [10] P. Fragkou, V. Petridis, and A. Kehagias. A dynamic programming algorithm for linear text segmentation. Journal of Intelligent Information Systems, 23(2):179–197, 2004.
 [11] M. A. Hearst. Texttiling: Segmenting text into multiparagraph subtopic passages. Computational linguistics, 23(1):33–64, 1997.
 [12] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pp. 2177–2185, 2014.
 [13] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
 [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
 [15] H. Misra, F. Yvon, J. M. Jose, and O. Cappe. Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1553–1556. ACM, 2009.
 [16] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 2014.
 [17] L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Comp. Ling., 28(1):19–36, 2002.
 [18] M. Riedl and C. Biemann. Text segmentation with topic models. Journal for Language Technology and Computational Linguistics, 27(1):47–69, 2012.
 [19] M. Sakahara, S. Okada, and K. Nitta. Domainindependent unsupervised text segmentation for data management. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on, pages 481–487. IEEE, 2014.
 [20] E. Terzi et al. Problems and algorithms for sequence segmentations. 2006.
 [21] M. Utiyama and H. Isahara. A statistical model for domainindependent text segmentation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 499–506. Association for Computational Linguistics, 2001.
APPENDIX
This set of experiments compare to the results reported in [?]. We implemented our own version of the C99 algorithm (oC99) and tested it on the Choi dataset. We explored the effect of various changes to the representation part of the algorithm, namely the effects of removing stop words, cutting small sentence sizes, stemming the words, and performing the rank transformation on the cosine similarity matrix. For stemming, the implementation of the Porter stemming algorithm from nltk was used. For stopwords, we used the list distributed with the C99 code augmented by a list of punctuation marks. The results are summarized in Table Text Segmentation based on Semantic Word Embeddings.
While we reproduce the results reported in [?] without the rank transformation (C99 in table Text Segmentation based on Semantic Word Embeddings), our results for the rank transformed results (last two lines for oC99) show better performance without stemming. This is likely due to particulars relating to details of the text transformations, such at the precise stemming algorithm and the stopword list. We attempted to match the choices made in [?] as much as possible, but still showed some deviations.
Perhaps the most telling deviation is the 1.5% swing in results for the last two rows, whose only difference was a change in the tie breaking behavior of the algorithm. In our best result, we minimized the objective at each stage, so in the case of ties would break at the earlier place in the text, whereas for the TBR row, we maximized the negative of the objective, so in the case of ties would break on the rightmost equal value.
These relatively large swings in the performance on the Choi dataset suggest that it is most appropriate to compare differences in parameter settings for a particular implementation of an algorithm. Comparing results between different articles to assess performance improvements due to algorithmic changes hence requires careful attention to the implemention details.
note cut stop stem rank (%) WD (%) C99 [?] 0 T F 0 23  0 T F 11 13  0 T T 11 12  oC99 0 T F 0 22.52 22.52 0 T F 11 16.69 16.72 0 T T 11 17.90 19.96 Reps 0 F F 0 32.26 32.28 5 F F 0 32.73 32.76 0 T F 0 22.52 22.52 0 F T 0 32.26 32.28 0 T T 0 23.33 23.33 5 T T 0 23.56 23.59 5 T T 3 18.17 18.30 5 T T 5 17.44 17.56 5 T T 7 16.95 17.05 5 T T 9 17.12 17.20 5 T T 11 17.07 17.14 5 T T 13 17.11 17.19 TBR 5 T F 11 17.04 17.12 Best 5 T F 11 15.56 15.64 Table \thetable: Effects of text representation on the performance of the C99 algorithm. The cut column denotes the cutoff for the length of a sentence after preprocessing. The stop column denotes whether stop words and punctuation are removed. The stem column denotes whether the words are passed through the Porter stemming algorithm. The rank column denotes the size of the kernel for the ranking transformation. Evaluations are given both as the metric and the Window Diff (WD) score. All experiments are done on the 400 test documents in the 3–11 set of the Choi dataset. The upper section cites results contained in the CWM 2000 paper [?]. The second section is an attempt to match these results with our implementation (oC99). The third section attempts to give an overview of the effect of different parameter choices for the representation step of the algorithm. The last section reports our best observed result as well as a run (TBR) with the same parameter settings, but with a tiebreaking strategy that takes rightmost rather then leftmost equal value. Recall from sec. Text Segmentation based on Semantic Word Embeddings that each sample document in the Choi dataset is composed of 10 segments, and each such segment is the first sentences from one of a 124 document subset of the Brown corpus (the ca**.pos and cj**.pos sets). This means that each of the four Choi test sets ( 35, 68, 911, 311) necessarily contains multiple repetitions of each sentence. In the 35 Choi set, for example, there are 3986 sentences, but only 608 unique sentences, so that each sentence appears on average 6.6 times. In the 311 set, with 400 sample documents, there are 28,145 sentences, but only 1353 unique sentences, for an average of 20.8 appearances for each sentence. Furthermore, in all cases there are only 124 unique sentences that can begin a new segment. This redundancy means that a trained method such as LDA will see most or all of the test data during training, and can easily overfit to the observed segmentation boundaries, especially when the number of topics is not much smaller than the number of documents. For example, using standard 10fold cross validation on an algorithm that simply identifies a segment boundary for any sentence in the test set that began a document in the training set gives better than 99.9% accuracy in segmenting all four parts of the Choi dataset. For this reason, we have not compared to the topicmodeling based segmentation results in Tables Text Segmentation based on Semantic Word Embeddings and Text Segmentation based on Semantic Word Embeddings.
