Text Mining¶

Tokenisation
- Break text into single words or n-grams
- “example text”
  - (“example”, “text”)
  - (“exam”, “xamp”, “ampl”, “mple”, “ple”, “le t”, “e te”, “tex”, “text”)
StopwordRemoval
- Remove frequentwordsthatmayconfuseyouralgorithm
- “this is an example” -> “example”
Stemming
- Finding the root/stem of a word helps matching similar words
- “user”,“users”,“used”,“using” -> “use”

Text Mining (scikit learn)¶

In order to extract numerical feature vectors from a sequence of symbols , scikit-learn provides utilities for the most common ways to extract numerical features:

tokenizing strings and giving an integer id for each possible token.
counting the occurrences of tokens.
normalizing and weighting tokens.

also have a look at the text feature extraction section in the userguide and the working with text data section

Loading files¶

Individual samples are assumed to be files stored a two levels folder structure such as the following:

container_folder/
- category_1_folder/
  - file_1.txt file_2.txt … file_42.txt
- category_2_folder/
  - file_43.txt file_44.txt …

The load_files function needs following parameters:

container_path
categories=['category_1_folder', ....]
encoding='utf-8'

from sklearn.datasets import load_files

# corpus-4docs has no categories -> thus use the parent directory as root and restrict it to the 'corpus-4docs' folder
corpus_4_docs = load_files('DataSetEx6', categories=['corpus-4docs'], encoding='utf-8') 

#corpus_30_docs = load_files('DataSetEx6/corpus-30docs',encoding='utf-8')

for text in corpus_4_docs.data:
    print(text[:30])
print(corpus_4_docs.target)

An Occupation for the 99 Per C
Málaga vs. Real Madrid Barcelo
Real Madrid Slips Into First W
David Cameron Joins Talks On E
[0 0 0 0]

Feature Generation fromText¶

Documents are treated as bags of words (tokens)
- Each token becomes a feature
- The order of tokens is ignored
Different techniques to determine feature values (feature vector creation)
- Binary Term Occurrence: 1 if the token is present, 0 otherwise
  - CountVectorizer(binary=True)
- Term Occurrence: Absolute frequency of the token, i.e., 5
  - CountVectorizer()
- Term Frequency: Relative frequency of the token, i.e., 5%
  - term frequency adjusted for document length (not directly implemented)
- Term Frequency –Inverse Document Frequency:
  - TfidfVectorizer()
  - More weight if the token is rare
  - Less weight if the token is frequent $$d_{i,j} = \frac{TF(d_i, t_j)}{DF(t_j)} = TF(d_i, t_j) * IDF(t_j)$$

Vectorizer¶

transforms text into a document-term matrix
different types of Vectorizer
- TfidfVectorizer
- CountVectorizer
- HashingVectorizer (no feature names)

Feature Generation Examples – (Binary) Term Occurrences¶

Sample document set:
- d1 = “Saturn is the gas planet with rings.”
- d2 = “Jupiter is the largest gas planet.”
- d3 = “Saturn is the Roman god of sowing.”
Documents as vectors:

Feature Generation Examples –Term Frequency¶

from sklearn.feature_extraction.text import CountVectorizer

d1 = "Saturn is the gas planet with rings."
d2 = "Jupiter is the largest gas planet."
d3 = "Saturn is the Roman god of sowing."
docs = [d1, d2, d3]

count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(docs) #corpus_4_docs.data)

print(count_matrix.toarray())
for feature_name in count_vectorizer.get_feature_names():
    print(feature_name)

[[1 0 1 0 0 0 1 1 0 1 0 1 1]
 [1 0 1 1 1 0 1 0 0 0 0 1 0]
 [0 1 1 0 0 1 0 0 1 1 1 1 0]]
gas
god
is
jupiter
largest
of
planet
rings
roman
saturn
sowing
the
with

Length of features/attributes¶

print(len(count_vectorizer.get_feature_names()))

13

List the words together with the frequencies¶

def get_word_freq(matrix, vectorizer):
    '''Function for generating a list of (freq, word)'''
    return sorted([(matrix.getcol(idx).sum(), word) for word, idx in vectorizer.vocabulary_.items()], reverse=True)

for tfidf, word in get_word_freq(count_matrix, count_vectorizer)[:40]:
    print("{:.3f} {}".format(tfidf, word))

3.000 the
3.000 is
2.000 saturn
2.000 planet
2.000 gas
1.000 with
1.000 sowing
1.000 roman
1.000 rings
1.000 of
1.000 largest
1.000 jupiter
1.000 god

Using a different vectorizer¶

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()
tf_idf_matrix = tf_idf_vectorizer.fit_transform(docs) #corpus_4_docs.data)

print(tf_idf_matrix.toarray())

for feature_name in tf_idf_vectorizer.get_feature_names()[:20]:
    print(feature_name)

[[ 0.3612204   0.          0.28051986  0.          0.          0.
   0.3612204   0.47496141  0.          0.3612204   0.          0.28051986
   0.47496141]
 [ 0.38737583  0.          0.30083189  0.50935267  0.50935267  0.
   0.38737583  0.          0.          0.          0.          0.30083189
   0.        ]
 [ 0.          0.43535684  0.25712876  0.          0.          0.43535684
   0.          0.          0.43535684  0.3311001   0.43535684  0.25712876
   0.        ]]
gas
god
is
jupiter
largest
of
planet
rings
roman
saturn
sowing
the
with

Removing stopwords¶

vectorizer_with_stopwords = TfidfVectorizer(stop_words='english')

tf_idf_stop_matrix = vectorizer_with_stopwords.fit_transform(corpus_4_docs.data)

for tfidf, word in get_word_freq(tf_idf_stop_matrix, vectorizer_with_stopwords)[:40]:
    print("{:.3f} {}".format(tfidf, word))

0.552 people
0.332 local
0.316 united
0.301 summit
0.301 eu
0.294 league
0.281 madrid
0.266 crisis
0.258 cent
0.258 99
0.246 vs
0.241 issue
0.239 real
0.216 ronaldo
0.216 newcastle
0.211 game
0.211 barcelona
0.196 sunday
0.195 city
0.180 leaders
0.180 eurozone
0.180 european
0.180 effect
0.180 brussels
0.171 economic
0.171 council
0.168 said
0.162 unbeaten
0.162 tie
0.162 rule
0.162 goodell
0.162 black
0.156 victory
0.156 goals
0.148 street
0.148 protest
0.144 community
0.141 teams
0.141 soccer
0.141 sevilla

Apply stemming¶

idea: create an own tokenizer which also stemms the tokens

from nltk.stem.porter import PorterStemmer # use the stemmer from nltk
import re

class TokenizerWithStemming(object):
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.token_pattern = re.compile(r"(?u)\b\w\w+\b")
    def __call__(self, doc):
        # tokenize the input with a regex and stem each token
        return [self.stemmer.stem(t) for t in self.token_pattern.findall(doc)] 

stem_vectorizer = TfidfVectorizer(tokenizer=TokenizerWithStemming())
stem_matrix = stem_vectorizer.fit_transform(corpus_4_docs.data)

for tfidf, word in get_word_freq(stem_matrix, stem_vectorizer)[:40]:
    print("{:.3f} {}".format(tfidf, word))

1.993 the
0.869 of
0.832 to
0.794 and
0.791 in
0.393 on
0.378 for
0.352 is
0.312 it
0.297 peopl
0.253 at
0.248 with
0.243 have
0.226 are
0.220 by
0.216 summit
0.216 eu
0.212 an
0.206 game
0.204 that
0.204 their
0.194 two
0.186 crisi
0.182 be
0.181 but
0.179 local
0.177 leagu
0.176 madrid
0.170 unit
0.168 bank
0.167 thi
0.163 which
0.162 wa
0.155 ha
0.152 issu
0.148 ronaldo
0.148 newcastl
0.147 real
0.147 goal
0.146 as

Computing the similiarity scores¶

all pairwise metrics can be found in the scikit learn documentation
Descriptions for pairwise metrics can be found in the metrics section of the userguide
Cosine Similarity
- Dot product only considers combinations that are both non-zero
- Normalised by length of both vectors $$ k(x,y) = \frac{xy^T}{\lVert x \rVert \lVert y \rVert }$$
With stopwords
- 𝐶𝑜𝑠𝑖𝑛𝑒𝐷1,𝐷2=0.12
- 𝐶𝑜𝑠𝑖𝑛𝑒𝐷1,𝐷3=0.04
- 𝐶𝑜𝑠𝑖𝑛𝑒𝐷2,𝐷3=0.00

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(stem_matrix)

array([[ 1.        ,  0.59939317,  0.4711856 ,  0.49572664],
       [ 0.59939317,  1.        ,  0.60022404,  0.51036918],
       [ 0.4711856 ,  0.60022404,  1.        ,  0.41403784],
       [ 0.49572664,  0.51036918,  0.41403784,  1.        ]])

# One can also use a star to import all pairwise metrics
from sklearn.metrics.pairwise import *
linear_kernel(stem_matrix)

array([[ 1.        ,  0.59939317,  0.4711856 ,  0.49572664],
       [ 0.59939317,  1.        ,  0.60022404,  0.51036918],
       [ 0.4711856 ,  0.60022404,  1.        ,  0.41403784],
       [ 0.49572664,  0.51036918,  0.41403784,  1.        ]])

Feature Selection¶

High dimensional data!
Not all features help!
Pruning: Remove too frequent or too infrequent tokens
- Percentual: ignore words that appear in less / more than a given percentage of all documents
- Absolute: ignore words that appear in less / more than a given number of documents
- By Rank: ignore a given percentage of the most frequent / infrequent words

prune_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.3) # Percentual
prune_vectorizer = TfidfVectorizer(min_df=5, max_df=20) # Absolute

Clustering¶

corpus_30_docs = load_files('DataSetEx6/corpus-30docs',encoding='utf-8')
tf_idf_vectorizer = TfidfVectorizer()
tf_idf_matrix = tf_idf_vectorizer.fit_transform(corpus_30_docs.data)

from sklearn.cluster import KMeans
k_means_estimator = KMeans(n_clusters = 2)
labels = k_means_estimator.fit_predict(tf_idf_matrix)
print(labels)
print(corpus_30_docs.target)

[0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0]
[0 2 1 1 2 2 2 1 1 2 0 1 0 1 2 2 0 2 0 0 1 1 0 0 2 0 0 2 1 1]

Adjusted rand index¶

the adjusted Rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization
more information at adjusted_rand_score function documentation and at clustering performance evaluation

from sklearn import metrics
labels_true =   [0, 0, 0, 1, 1, 1]
labels_pred =   [0, 0, 1, 1, 2, 2]
labels_pred_2 = [1, 1, 0, 0, 3, 3]# permute 0 and 1  and rename 2 to 3 
print(metrics.adjusted_rand_score(labels_true, labels_pred))
print(metrics.adjusted_rand_score(labels_true, labels_pred_2))

0.242424242424
0.242424242424

Print the document frequency per class¶

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option('max_rows', 500)


text_zero_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 0]
text_one_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 1]
text_two_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 2]

count_vectorizer = CountVectorizer(tokenizer= TokenizerWithStemming(), stop_words='english')
count_matrix = count_vectorizer.fit_transform(corpus_30_docs.data)

rows = []
for word, idx in count_vectorizer.vocabulary_.items():
    rows.append((word, 
                (count_matrix.getcol(idx) > 0).sum(), 
                (count_matrix.getcol(idx)[text_zero_indices] > 0).sum(), 
                (count_matrix.getcol(idx)[text_one_indices] > 0).sum(), 
                (count_matrix.getcol(idx)[text_two_indices] > 0).sum()))

document_freqs = pd.DataFrame(rows, columns = ['word', 'All', corpus_30_docs.target_names[0], corpus_30_docs.target_names[1], corpus_30_docs.target_names[2]])

document_freqs.sort_values('All', ascending=False)

More on similarities¶

see gensim tutorials

	word	All	sci.space	soc.religion.christian	talk.politics.guns
0	path	30	10	10	10
40	id	30	10	10	10
51	organ	30	10	10	10
44	apr	30	10	10	10
43	date	30	10	10	10
34	subject	30	10	10	10
1	cantaloup	30	10	10	10
39	messag	30	10	10	10
5	edu	30	10	10	10
4	cmu	30	10	10	10
3	cs	30	10	10	10
2	srv	30	10	10	10
57	line	29	9	10	10
49	gmt	26	8	10	8
157	1993	24	7	10	7
144	net	23	6	8	9
12	com	23	7	7	9
6	crabappl	21	6	7	8
45	93	20	5	10	5
178	articl	20	7	3	10
205	thi	20	5	9	6
11	gtefsd	20	6	7	7
10	eng	20	6	7	7
9	europa	20	6	7	7
53	univers	19	7	6	6
142	howland	19	4	7	8
143	reston	19	4	7	8
7	fs7	19	5	7	7
8	ece	19	5	7	7
184	write	18	5	5	8
50	sender	18	4	10	4
167	refer	17	8	1	8
175	post	17	7	2	8
23	news	17	8	1	8
212	wa	17	4	8	5
176	host	15	7	1	7
64	use	15	6	5	4
174	nntp	14	7	0	7
306	ha	13	3	6	4
381	say	13	2	6	5
19	state	13	4	5	4
477	make	13	5	4	4
543	don	12	3	3	6
469	ani	12	2	6	4
263	approv	12	2	10	0
297	know	12	1	8	3
329	peopl	12	1	8	3
475	onli	12	1	7	4
561	good	11	2	4	5
526	thing	11	4	4	3
275	question	11	2	5	4
70	time	11	4	4	3
241	rutger	10	0	10	0
389	just	10	4	5	1
198	like	10	2	3	5
679	differ	10	2	4	4
285	becaus	10	1	6	3
47	05	10	2	8	0
264	arami	10	0	10	0
242	igor	10	0	10	0
243	atho	10	0	10	0
244	christian	10	0	10	0
652	howev	10	2	5	3
260	hedrick	10	0	10	0
467	doe	10	4	3	3
232	believ	10	1	6	3
654	control	9	3	0	6
254	10	9	3	5	1
481	second	9	5	1	3
286	think	9	1	6	2
736	15	9	4	0	5
307	hi	9	1	5	3
327	reason	9	1	3	5
514	gun	9	0	0	9
425	new	9	3	3	3
838	possibl	9	2	4	3
38	need	9	6	2	1
13	gatech	9	3	4	2
295	year	8	3	2	3
432	scienc	8	3	1	4
152	anoth	8	1	4	3
163	distribut	8	3	0	5
1217	01	8	3	4	1
1062	total	8	4	2	2
69	space	8	8	0	0
540	look	8	3	2	3
63	someth	8	4	3	1
972	talk	8	0	5	3
220	mean	8	2	3	3
77	point	8	3	3	2
482	uunet	8	2	3	3
258	09	8	1	7	0
843	whi	8	3	3	2
1447	inform	7	3	2	2
161	comput	7	1	3	3
339	follow	7	3	4	0
187	effect	7	2	1	4
186	research	7	5	1	1
506	14	7	3	3	1
410	ve	7	1	3	3
493	usenet	7	4	0	3
114	00	7	3	3	1
217	claim	7	1	3	3
18	ohio	7	3	2	2
225	case	7	2	1	4
1182	way	7	2	3	2
104	true	7	2	3	2
461	come	7	1	5	1
841	cours	7	1	3	3
101	argument	7	2	2	3
945	god	7	1	6	0
459	problem	7	2	3	2
300	mani	7	1	3	3
782	ll	7	4	1	2
294	13	6	2	3	1
570	result	6	2	2	2
16	zaphod	6	2	2	2
645	provid	6	2	2	2
323	real	6	0	3	3
158	17	6	1	2	3
315	exampl	6	2	2	2
287	person	6	0	4	2
1197	nasa	6	6	0	0
1192	note	6	2	2	2
17	mp	6	2	2	2
383	truth	6	0	3	3
276	ask	6	3	2	1
368	noth	6	2	4	0
1549	utexa	6	2	2	2
379	small	6	3	1	2
1530	ground	6	3	1	2
463	didn	6	2	3	1
468	anyon	6	0	4	2
1484	associ	6	2	1	3
476	better	6	2	3	1
662	caus	6	2	2	2
271	opinion	6	1	3	2
351	group	6	2	4	0
490	1993apr6	6	3	0	3
214	base	6	3	1	2
1448	discuss	6	3	2	1
206	man	6	3	1	2
518	weapon	6	0	0	6
200	center	6	3	0	3
661	support	6	3	1	2
457	work	6	1	3	2
962	evid	6	1	2	3
1068	nation	6	2	0	4
116	02	6	2	3	1
709	31	6	1	3	2
779	realli	6	2	2	2
78	book	6	2	3	1
815	specif	6	1	4	1
776	hand	6	3	1	2
834	12	6	3	1	2
826	doesn	6	2	2	2
134	thank	6	1	2	3
699	right	6	0	3	3
210	consid	5	1	1	3
505	tue	5	3	0	2
208	hell	5	0	3	2
54	california	5	4	0	1
730	stephen	5	2	1	2
388	jesu	5	0	5	0
519	danger	5	2	1	2
532	offic	5	2	2	1
911	harvard	5	3	1	1
374	high	5	3	1	1
1015	sourc	5	1	2	2
438	16	5	1	2	2
401	sort	5	1	4	0
975	rochest	5	1	2	2
404	thought	5	1	4	0
429	20	5	2	2	1
444	rememb	5	2	2	1
1510	pleas	5	3	1	1
431	21	5	3	1	1
229	befor	5	1	2	2
951	tri	5	2	1	2
953	fact	5	1	1	3
428	30	5	0	2	3
1586	mail	5	3	1	1
959	let	5	1	4	0
24	servic	5	2	1	2
676	gener	5	2	1	2
797	end	5	2	2	1
48	46	5	2	2	1
754	agre	5	0	3	2
1029	els	5	0	3	2
560	hit	5	1	1	3
357	said	5	1	3	1
1218	11	5	0	4	1
15	wupost	5	2	1	2
298	bibl	5	0	5	0
649	certainli	5	2	2	1
304	human	5	1	3	1
646	isn	5	2	0	3
643	veri	5	1	1	3
886	john	5	1	2	2
164	world	5	2	1	2
322	day	5	2	2	1
135	advanc	5	2	3	0
1796	texa	5	2	1	2
889	alway	5	0	3	2
278	live	5	2	1	2
615	target	5	1	1	3
442	sure	5	0	2	3
582	requir	5	2	0	3
1686	forc	5	4	0	1
274	exist	5	2	2	1
1636	gov	4	4	0	0
310	develop	4	2	1	1
346	natur	4	1	1	2
1609	shuttl	4	4	0	0
910	da	4	3	1	0
1536	usa	4	2	0	2
1732	basic	4	2	1	1
811	add	4	2	0	2
785	load	4	2	0	2
37	help	4	2	1	1
950	poster	4	2	1	1
296	institut	4	1	2	1
1681	wing	4	4	0	0
399	popul	4	0	2	2
427	citi	4	1	1	2
400	commun	4	2	1	1
933	view	4	1	2	1
318	apollo	4	2	1	1
799	long	4	2	0	2
918	04	4	2	2	0
817	mayb	4	0	3	1
334	safeti	4	1	1	2
800	west	4	3	0	1
333	life	4	0	3	1
32	david	4	2	1	1
818	place	4	1	2	1
330	hope	4	2	2	0
1731	access	4	4	0	0
1103	avail	4	3	0	1
1520	probabl	4	1	0	3
701	1993apr5	4	2	0	2
577	public	4	1	1	2
1270	far	4	0	3	1
1262	oper	4	2	1	1
1252	idea	4	1	3	0
1248	belief	4	1	3	0
593	abl	4	1	1	2
1232	anyth	4	0	2	2
603	self	4	0	1	3
1059	effort	4	2	0	2
...	...	...	...	...	...
1833	390	1	1	0	0
1832	590	1	1	0	0
1831	282	1	1	0	0
1830	270	1	1	0	0
1829	340	1	1	0	0
1828	224	1	1	0	0
1827	850	1	1	0	0
1826	227	1	1	0	0
1660	gain	1	1	0	0
1659	advantag	1	1	0	0
1658	structur	1	1	0	0
1657	allevi	1	1	0	0
466	dixon	1	0	1	0
1488	apq	1	0	0	1
1487	c519mt	1	0	0	1
472	pretti	1	0	1	0
1485	nyc	1	0	0	1
474	peac	1	0	1	0
1483	mad	1	0	0	1
1482	steve	1	0	0	1
1481	linknet	1	0	0	1
1480	mane	1	0	0	1
1479	magpi	1	0	0	1
1478	jpradley	1	0	0	1
1477	murphi	1	0	0	1
1476	ge	1	0	0	1
1475	crd	1	0	0	1
1474	53328	1	0	0	1
1473	41582	1	0	0	1
1471	misc	1	0	0	1
1470	joe	1	0	1	0
1469	rigid	1	0	1	0
1468	clue	1	0	1	0
479	walker	1	0	1	0
480	perci	1	0	1	0
1465	wipe	1	0	1	0
1463	pastor	1	0	1	0
1461	vote	1	0	1	0
1460	seldom	1	0	1	0
1490	newsread	1	0	0	1
1491	tin	1	0	0	1
465	jean	1	0	1	0
1508	britain	1	0	0	1
1521	bother	1	0	0	1
446	vision	1	0	1	0
1519	attent	1	0	0	1
1518	superior	1	0	0	1
1517	chose	1	0	0	1
1516	debat	1	0	0	1
1515	comparison	1	0	0	1
1514	emphasi	1	0	0	1
1512	okay	1	0	0	1
1511	vs	1	0	0	1
448	bought	1	0	1	0
1509	council	1	0	0	1
1507	children	1	0	0	1
1493	pl9	1	0	0	1
1506	licens	1	0	0	1
449	gold	1	0	1	0
450	hurt	1	0	1	0
1503	250	1	0	0	1
1502	fw7	1	0	0	1
1501	c4u3x5	1	0	0	1
451	financi	1	0	1	0
452	800	1	0	1	0
454	stuck	1	0	1	0
462	forget	1	0	1	0
464	la	1	0	1	0
1494	1ppk8hinn7k7	1	0	0	1
1459	rare	1	0	1	0
491	140756	1	0	0	1
492	29159	1	0	0	1
1406	devot	1	0	1	0
1420	closest	1	0	1	0
1419	worldwid	1	0	1	0
515	disagre	1	0	0	1
1416	later	1	0	1	0
1415	spirit	1	0	1	0
1413	42	1	0	1	0
1412	favor	1	0	1	0
1411	sincer	1	0	1	0
1410	glad	1	0	1	0
1409	awe	1	0	1	0
1408	communion	1	0	1	0
1407	apostl	1	0	1	0
1404	fruit	1	0	1	0
1422	constantli	1	0	1	0
1403	corinthian	1	0	1	0
1401	extent	1	0	1	0
1400	divis	1	0	1	0
1399	harm	1	0	1	0
1398	meet	1	0	1	0
1397	prais	1	0	1	0
1395	split	1	0	1	0
1394	scriptur	1	0	1	0
1393	untoler	1	0	1	0
523	uranium	1	0	0	1
1391	firmli	1	0	1	0
1390	feel	1	0	1	0
1421	boston	1	0	1	0
1423	strive	1	0	1	0
1454	cliqu	1	0	1	0
1439	reward	1	0	1	0
1453	cliquey	1	0	1	0
1452	door	1	0	1	0
1451	open	1	0	1	0
1449	middl	1	0	1	0
495	145815	1	0	0	1
496	13383	1	0	0	1
1446	strang	1	0	1	0
497	ld	1	0	0	1
1444	outsid	1	0	1	0
1443	admit	1	0	1	0
498	loral	1	0	0	1
499	1pj1s8inn48k	1	0	0	1
501	734084516	1	0	0	1
1424	error	1	0	1	0
502	ponder	1	0	0	1
504	iastat	1	0	0	1
509	77	1	0	0	1
1433	mark	1	0	1	0
510	dan	1	0	0	1
1431	occur	1	0	1	0
1430	import	1	0	1	0
1429	cooper	1	0	1	0
1428	relat	1	0	1	0
1427	close	1	0	1	0
1426	sold	1	0	1	0
511	sorenson	1	0	0	1
1523	pull	1	0	0	1
443	credibl	1	0	1	0
441	propheci	1	0	1	0
1611	coverag	1	1	0	0
1625	407	1	1	0	0
1624	upcom	1	1	0	0
1623	remov	1	1	0	0
1622	formerli	1	1	0	0
1621	portion	1	1	0	0
1619	manifest	1	1	0	0
1618	compress	1	1	0	0
1617	ct	1	1	0	0
363	sleep	1	0	1	0
1615	gandalf	1	1	0	0
1614	holli	1	1	0	0
365	reproduc	1	0	1	0
370	escap	1	0	1	0
1627	info	1	1	0	0
375	piti	1	0	1	0
382	blind	1	0	1	0
384	realiz	1	0	1	0
1603	schedule_733694347	1	1	0	0
385	_they_	1	0	1	0
386	mask	1	0	1	0
387	fake	1	0	1	0
1599	177	1	1	0	0
391	37696	1	0	1	0
394	roman	1	0	1	0
396	69ad	1	0	1	0
1593	schedule_730956538	1	1	0	0
1626	867	1	1	0	0
361	_just_	1	0	1	0
402	cohes	1	0	1	0
355	guidelin	1	0	1	0
1656	neg	1	1	0	0
1655	vehicl	1	1	0	0
350	enter	1	0	1	0
1653	maneuv	1	1	0	0
1652	tilt	1	1	0	0
1650	clearanc	1	1	0	0
1649	tower	1	1	0	0
1648	assur	1	1	0	0
1647	command	1	1	0	0
354	ministri	1	0	1	0
1645	phase	1	1	0	0
1644	vertic	1	1	0	0
1642	2102	1	1	0	0
359	spread	1	0	1	0
1641	asc	1	1	0	0
1640	manual	1	1	0	0
1639	train	1	1	0	0
356	basicali	1	0	1	0
1637	ascent	1	1	0	0
358	prioriti	1	0	1	0
1635	jsc	1	1	0	0
1634	gothamc	1	1	0	0
1633	kjenk	1	1	0	0
1632	jenk	1	1	0	0
1631	liftoff	1	1	0	0
1630	roll	1	1	0	0
397	dispers	1	0	1	0
403	dire	1	0	1	0
1526	twist	1	0	0	1
1542	scold	1	0	0	1
1555	server	1	0	0	1
1554	panason	1	0	0	1
1553	adagio	1	0	0	1
1552	sgiblab	1	0	0	1
1551	sgigat	1	0	0	1
1550	olivea	1	0	0	1
422	sri	1	0	1	0
1547	brudda	1	0	0	1
1546	seddit	1	0	0	1
1545	blow	1	0	0	1
1544	style	1	0	0	1
1543	whiney	1	0	0	1
1541	nitpick	1	0	0	1
1557	waco	1	0	0	1
1539	afraid	1	0	0	1
1538	whine	1	0	0	1
1537	gb	1	0	0	1
430	14316	1	0	1	0
1534	compar	1	0	0	1
1533	gratuit	1	0	0	1
1532	citat	1	0	0	1
435	menlo	1	0	1	0
436	park	1	0	1	0
437	ca	1	0	1	0
439	regard	1	0	1	0
440	wilkerson	1	0	1	0
1556	aclu	1	0	0	1
1558	shootout	1	0	0	1
405	damien	1	0	1	0
415	nocturn	1	0	1	0
406	endemyr	1	0	1	0
407	unpur	1	0	1	0
408	knight	1	0	1	0
409	doom	1	0	1	0
1582	7223	1	1	0	0
1581	108	1	1	0	0
1580	59904	1	1	0	0
1579	girl	1	0	0	1
1577	critter	1	0	0	1
1576	threat	1	0	0	1
1575	pose	1	0	0	1
414	adopt	1	0	1	0
416	lifestyl	1	0	1	0
1559	1pifnjinnscb	1	0	0	1
417	doesnt	1	0	1	0
1570	devic	1	0	0	1
418	vampir	1	0	1	0
1568	evolv	1	0	0	1
420	gilham	1	0	1	0
1566	hagerp	1	0	0	1
1565	hager	1	0	0	1
1564	sandman	1	0	0	1
421	csl	1	0	1	0
1562	1pdb6qinn3sl	1	0	0	1
1561	10843	1	0	0	1
1560	140529	1	0	0	1
2884	divid	1	0	1	0