Text Mining

  • Tokenisation
    • Break text into single words or n-grams
    • “example text”
      • (“example”, “text”)
      • (“exam”, “xamp”, “ampl”, “mple”, “ple”, “le t”, “e te”, “tex”, “text”)
  • StopwordRemoval
    • Remove frequentwordsthatmayconfuseyouralgorithm
    • “this is an example” -> “example”
  • Stemming
    • Finding the root/stem of a word helps matching similar words
    • “user”,“users”,“used”,“using” -> “use”

Text Mining (scikit learn)

In order to extract numerical feature vectors from a sequence of symbols , scikit-learn provides utilities for the most common ways to extract numerical features:

  • tokenizing strings and giving an integer id for each possible token.
  • counting the occurrences of tokens.
  • normalizing and weighting tokens.

also have a look at the text feature extraction section in the userguide and the working with text data section

Loading files

Individual samples are assumed to be files stored a two levels folder structure such as the following:

  • container_folder/
    • category_1_folder/
      • file_1.txt file_2.txt … file_42.txt
    • category_2_folder/
      • file_43.txt file_44.txt …

The load_files function needs following parameters:

  • container_path
  • categories=['category_1_folder', ....]
  • encoding='utf-8'
In [1]:
from sklearn.datasets import load_files

# corpus-4docs has no categories -> thus use the parent directory as root and restrict it to the 'corpus-4docs' folder
corpus_4_docs = load_files('DataSetEx6', categories=['corpus-4docs'], encoding='utf-8') 

#corpus_30_docs = load_files('DataSetEx6/corpus-30docs',encoding='utf-8')

for text in corpus_4_docs.data:
    print(text[:30])
print(corpus_4_docs.target)
An Occupation for the 99 Per C
Málaga vs. Real Madrid Barcelo
Real Madrid Slips Into First W
David Cameron Joins Talks On E
[0 0 0 0]

Feature Generation fromText

  • Documents are treated as bags of words (tokens)
    • Each token becomes a feature
    • The order of tokens is ignored
  • Different techniques to determine feature values (feature vector creation)
    • Binary Term Occurrence: 1 if the token is present, 0 otherwise
      • CountVectorizer(binary=True)
    • Term Occurrence: Absolute frequency of the token, i.e., 5
      • CountVectorizer()
    • Term Frequency: Relative frequency of the token, i.e., 5%
      • term frequency adjusted for document length (not directly implemented)
    • Term Frequency –Inverse Document Frequency:
      • TfidfVectorizer()
      • More weight if the token is rare
      • Less weight if the token is frequent $$d_{i,j} = \frac{TF(d_i, t_j)}{DF(t_j)} = TF(d_i, t_j) * IDF(t_j)$$

Vectorizer

Feature Generation Examples – (Binary) Term Occurrences

  • Sample document set:
    • d1 = “Saturn is the gas planet with rings.”
    • d2 = “Jupiter is the largest gas planet.”
    • d3 = “Saturn is the Roman god of sowing.”
  • Documents as vectors: image.png

Feature Generation Examples –Term Frequency

image.png

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

d1 = "Saturn is the gas planet with rings."
d2 = "Jupiter is the largest gas planet."
d3 = "Saturn is the Roman god of sowing."
docs = [d1, d2, d3]

count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(docs) #corpus_4_docs.data)

print(count_matrix.toarray())
for feature_name in count_vectorizer.get_feature_names():
    print(feature_name)
[[1 0 1 0 0 0 1 1 0 1 0 1 1]
 [1 0 1 1 1 0 1 0 0 0 0 1 0]
 [0 1 1 0 0 1 0 0 1 1 1 1 0]]
gas
god
is
jupiter
largest
of
planet
rings
roman
saturn
sowing
the
with

Length of features/attributes

In [15]:
print(len(count_vectorizer.get_feature_names()))
13

List the words together with the frequencies

In [3]:
def get_word_freq(matrix, vectorizer):
    '''Function for generating a list of (freq, word)'''
    return sorted([(matrix.getcol(idx).sum(), word) for word, idx in vectorizer.vocabulary_.items()], reverse=True)
In [4]:
for tfidf, word in get_word_freq(count_matrix, count_vectorizer)[:40]:
    print("{:.3f} {}".format(tfidf, word))
3.000 the
3.000 is
2.000 saturn
2.000 planet
2.000 gas
1.000 with
1.000 sowing
1.000 roman
1.000 rings
1.000 of
1.000 largest
1.000 jupiter
1.000 god

Using a different vectorizer

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer()
tf_idf_matrix = tf_idf_vectorizer.fit_transform(docs) #corpus_4_docs.data)

print(tf_idf_matrix.toarray())

for feature_name in tf_idf_vectorizer.get_feature_names()[:20]:
    print(feature_name)
[[ 0.3612204   0.          0.28051986  0.          0.          0.
   0.3612204   0.47496141  0.          0.3612204   0.          0.28051986
   0.47496141]
 [ 0.38737583  0.          0.30083189  0.50935267  0.50935267  0.
   0.38737583  0.          0.          0.          0.          0.30083189
   0.        ]
 [ 0.          0.43535684  0.25712876  0.          0.          0.43535684
   0.          0.          0.43535684  0.3311001   0.43535684  0.25712876
   0.        ]]
gas
god
is
jupiter
largest
of
planet
rings
roman
saturn
sowing
the
with

Removing stopwords

In [6]:
vectorizer_with_stopwords = TfidfVectorizer(stop_words='english')

tf_idf_stop_matrix = vectorizer_with_stopwords.fit_transform(corpus_4_docs.data)

for tfidf, word in get_word_freq(tf_idf_stop_matrix, vectorizer_with_stopwords)[:40]:
    print("{:.3f} {}".format(tfidf, word))
0.552 people
0.332 local
0.316 united
0.301 summit
0.301 eu
0.294 league
0.281 madrid
0.266 crisis
0.258 cent
0.258 99
0.246 vs
0.241 issue
0.239 real
0.216 ronaldo
0.216 newcastle
0.211 game
0.211 barcelona
0.196 sunday
0.195 city
0.180 leaders
0.180 eurozone
0.180 european
0.180 effect
0.180 brussels
0.171 economic
0.171 council
0.168 said
0.162 unbeaten
0.162 tie
0.162 rule
0.162 goodell
0.162 black
0.156 victory
0.156 goals
0.148 street
0.148 protest
0.144 community
0.141 teams
0.141 soccer
0.141 sevilla

Apply stemming

  • idea: create an own tokenizer which also stemms the tokens
In [7]:
from nltk.stem.porter import PorterStemmer # use the stemmer from nltk
import re

class TokenizerWithStemming(object):
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.token_pattern = re.compile(r"(?u)\b\w\w+\b")
    def __call__(self, doc):
        # tokenize the input with a regex and stem each token
        return [self.stemmer.stem(t) for t in self.token_pattern.findall(doc)] 

stem_vectorizer = TfidfVectorizer(tokenizer=TokenizerWithStemming())
stem_matrix = stem_vectorizer.fit_transform(corpus_4_docs.data)

for tfidf, word in get_word_freq(stem_matrix, stem_vectorizer)[:40]:
    print("{:.3f} {}".format(tfidf, word))
1.993 the
0.869 of
0.832 to
0.794 and
0.791 in
0.393 on
0.378 for
0.352 is
0.312 it
0.297 peopl
0.253 at
0.248 with
0.243 have
0.226 are
0.220 by
0.216 summit
0.216 eu
0.212 an
0.206 game
0.204 that
0.204 their
0.194 two
0.186 crisi
0.182 be
0.181 but
0.179 local
0.177 leagu
0.176 madrid
0.170 unit
0.168 bank
0.167 thi
0.163 which
0.162 wa
0.155 ha
0.152 issu
0.148 ronaldo
0.148 newcastl
0.147 real
0.147 goal
0.146 as

Computing the similiarity scores

  • all pairwise metrics can be found in the scikit learn documentation
  • Descriptions for pairwise metrics can be found in the metrics section of the userguide
  • Cosine Similarity
    • Dot product only considers combinations that are both non-zero
    • Normalised by length of both vectors $$ k(x,y) = \frac{xy^T}{\lVert x \rVert \lVert y \rVert }$$ image.png
  • With stopwords
    • 𝐶𝑜𝑠𝑖𝑛𝑒𝐷1,𝐷2=0.12
    • 𝐶𝑜𝑠𝑖𝑛𝑒𝐷1,𝐷3=0.04
    • 𝐶𝑜𝑠𝑖𝑛𝑒𝐷2,𝐷3=0.00
In [8]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(stem_matrix)
Out[8]:
array([[ 1.        ,  0.59939317,  0.4711856 ,  0.49572664],
       [ 0.59939317,  1.        ,  0.60022404,  0.51036918],
       [ 0.4711856 ,  0.60022404,  1.        ,  0.41403784],
       [ 0.49572664,  0.51036918,  0.41403784,  1.        ]])
In [9]:
# One can also use a star to import all pairwise metrics
from sklearn.metrics.pairwise import *
linear_kernel(stem_matrix)
Out[9]:
array([[ 1.        ,  0.59939317,  0.4711856 ,  0.49572664],
       [ 0.59939317,  1.        ,  0.60022404,  0.51036918],
       [ 0.4711856 ,  0.60022404,  1.        ,  0.41403784],
       [ 0.49572664,  0.51036918,  0.41403784,  1.        ]])

Feature Selection

  • High dimensional data!
  • Not all features help!
  • Pruning: Remove too frequent or too infrequent tokens
    • Percentual: ignore words that appear in less / more than a given percentage of all documents
    • Absolute: ignore words that appear in less / more than a given number of documents
    • By Rank: ignore a given percentage of the most frequent / infrequent words
In [10]:
prune_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.3) # Percentual
prune_vectorizer = TfidfVectorizer(min_df=5, max_df=20) # Absolute

Clustering

In [11]:
corpus_30_docs = load_files('DataSetEx6/corpus-30docs',encoding='utf-8')
tf_idf_vectorizer = TfidfVectorizer()
tf_idf_matrix = tf_idf_vectorizer.fit_transform(corpus_30_docs.data)

from sklearn.cluster import KMeans
k_means_estimator = KMeans(n_clusters = 2)
labels = k_means_estimator.fit_predict(tf_idf_matrix)
print(labels)
print(corpus_30_docs.target)
[0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0]
[0 2 1 1 2 2 2 1 1 2 0 1 0 1 2 2 0 2 0 0 1 1 0 0 2 0 0 2 1 1]

Adjusted rand index

In [12]:
from sklearn import metrics
labels_true =   [0, 0, 0, 1, 1, 1]
labels_pred =   [0, 0, 1, 1, 2, 2]
labels_pred_2 = [1, 1, 0, 0, 3, 3]# permute 0 and 1  and rename 2 to 3 
print(metrics.adjusted_rand_score(labels_true, labels_pred))
print(metrics.adjusted_rand_score(labels_true, labels_pred_2))
0.242424242424
0.242424242424
In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
pd.set_option('max_rows', 500)


text_zero_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 0]
text_one_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 1]
text_two_indices = [i for i, text in enumerate(corpus_30_docs.data) if corpus_30_docs.target[i] == 2]

count_vectorizer = CountVectorizer(tokenizer= TokenizerWithStemming(), stop_words='english')
count_matrix = count_vectorizer.fit_transform(corpus_30_docs.data)

rows = []
for word, idx in count_vectorizer.vocabulary_.items():
    rows.append((word, 
                (count_matrix.getcol(idx) > 0).sum(), 
                (count_matrix.getcol(idx)[text_zero_indices] > 0).sum(), 
                (count_matrix.getcol(idx)[text_one_indices] > 0).sum(), 
                (count_matrix.getcol(idx)[text_two_indices] > 0).sum()))

document_freqs = pd.DataFrame(rows, columns = ['word', 'All', corpus_30_docs.target_names[0], corpus_30_docs.target_names[1], corpus_30_docs.target_names[2]])

document_freqs.sort_values('All', ascending=False)
Out[13]:
word All sci.space soc.religion.christian talk.politics.guns
0 path 30 10 10 10
40 id 30 10 10 10
51 organ 30 10 10 10
44 apr 30 10 10 10
43 date 30 10 10 10
34 subject 30 10 10 10
1 cantaloup 30 10 10 10
39 messag 30 10 10 10
5 edu 30 10 10 10
4 cmu 30 10 10 10
3 cs 30 10 10 10
2 srv 30 10 10 10
57 line 29 9 10 10
49 gmt 26 8 10 8
157 1993 24 7 10 7
144 net 23 6 8 9
12 com 23 7 7 9
6 crabappl 21 6 7 8
45 93 20 5 10 5
178 articl 20 7 3 10
205 thi 20 5 9 6
11 gtefsd 20 6 7 7
10 eng 20 6 7 7
9 europa 20 6 7 7
53 univers 19 7 6 6
142 howland 19 4 7 8
143 reston 19 4 7 8
7 fs7 19 5 7 7
8 ece 19 5 7 7
184 write 18 5 5 8
50 sender 18 4 10 4
167 refer 17 8 1 8
175 post 17 7 2 8
23 news 17 8 1 8
212 wa 17 4 8 5
176 host 15 7 1 7
64 use 15 6 5 4
174 nntp 14 7 0 7
306 ha 13 3 6 4
381 say 13 2 6 5
19 state 13 4 5 4
477 make 13 5 4 4
543 don 12 3 3 6
469 ani 12 2 6 4
263 approv 12 2 10 0
297 know 12 1 8 3
329 peopl 12 1 8 3
475 onli 12 1 7 4
561 good 11 2 4 5
526 thing 11 4 4 3
275 question 11 2 5 4
70 time 11 4 4 3
241 rutger 10 0 10 0
389 just 10 4 5 1
198 like 10 2 3 5
679 differ 10 2 4 4
285 becaus 10 1 6 3
47 05 10 2 8 0
264 arami 10 0 10 0
242 igor 10 0 10 0
243 atho 10 0 10 0
244 christian 10 0 10 0
652 howev 10 2 5 3
260 hedrick 10 0 10 0
467 doe 10 4 3 3
232 believ 10 1 6 3
654 control 9 3 0 6
254 10 9 3 5 1
481 second 9 5 1 3
286 think 9 1 6 2
736 15 9 4 0 5
307 hi 9 1 5 3
327 reason 9 1 3 5
514 gun 9 0 0 9
425 new 9 3 3 3
838 possibl 9 2 4 3
38 need 9 6 2 1
13 gatech 9 3 4 2
295 year 8 3 2 3
432 scienc 8 3 1 4
152 anoth 8 1 4 3
163 distribut 8 3 0 5
1217 01 8 3 4 1
1062 total 8 4 2 2
69 space 8 8 0 0
540 look 8 3 2 3
63 someth 8 4 3 1
972 talk 8 0 5 3
220 mean 8 2 3 3
77 point 8 3 3 2
482 uunet 8 2 3 3
258 09 8 1 7 0
843 whi 8 3 3 2
1447 inform 7 3 2 2
161 comput 7 1 3 3
339 follow 7 3 4 0
187 effect 7 2 1 4
186 research 7 5 1 1
506 14 7 3 3 1
410 ve 7 1 3 3
493 usenet 7 4 0 3
114 00 7 3 3 1
217 claim 7 1 3 3
18 ohio 7 3 2 2
225 case 7 2 1 4
1182 way 7 2 3 2
104 true 7 2 3 2
461 come 7 1 5 1
841 cours 7 1 3 3
101 argument 7 2 2 3
945 god 7 1 6 0
459 problem 7 2 3 2
300 mani 7 1 3 3
782 ll 7 4 1 2
294 13 6 2 3 1
570 result 6 2 2 2
16 zaphod 6 2 2 2
645 provid 6 2 2 2
323 real 6 0 3 3
158 17 6 1 2 3
315 exampl 6 2 2 2
287 person 6 0 4 2
1197 nasa 6 6 0 0
1192 note 6 2 2 2
17 mp 6 2 2 2
383 truth 6 0 3 3
276 ask 6 3 2 1
368 noth 6 2 4 0
1549 utexa 6 2 2 2
379 small 6 3 1 2
1530 ground 6 3 1 2
463 didn 6 2 3 1
468 anyon 6 0 4 2
1484 associ 6 2 1 3
476 better 6 2 3 1
662 caus 6 2 2 2
271 opinion 6 1 3 2
351 group 6 2 4 0
490 1993apr6 6 3 0 3
214 base 6 3 1 2
1448 discuss 6 3 2 1
206 man 6 3 1 2
518 weapon 6 0 0 6
200 center 6 3 0 3
661 support 6 3 1 2
457 work 6 1 3 2
962 evid 6 1 2 3
1068 nation 6 2 0 4
116 02 6 2 3 1
709 31 6 1 3 2
779 realli 6 2 2 2
78 book 6 2 3 1
815 specif 6 1 4 1
776 hand 6 3 1 2
834 12 6 3 1 2
826 doesn 6 2 2 2
134 thank 6 1 2 3
699 right 6 0 3 3
210 consid 5 1 1 3
505 tue 5 3 0 2
208 hell 5 0 3 2
54 california 5 4 0 1
730 stephen 5 2 1 2
388 jesu 5 0 5 0
519 danger 5 2 1 2
532 offic 5 2 2 1
911 harvard 5 3 1 1
374 high 5 3 1 1
1015 sourc 5 1 2 2
438 16 5 1 2 2
401 sort 5 1 4 0
975 rochest 5 1 2 2
404 thought 5 1 4 0
429 20 5 2 2 1
444 rememb 5 2 2 1
1510 pleas 5 3 1 1
431 21 5 3 1 1
229 befor 5 1 2 2
951 tri 5 2 1 2
953 fact 5 1 1 3
428 30 5 0 2 3
1586 mail 5 3 1 1
959 let 5 1 4 0
24 servic 5 2 1 2
676 gener 5 2 1 2
797 end 5 2 2 1
48 46 5 2 2 1
754 agre 5 0 3 2
1029 els 5 0 3 2
560 hit 5 1 1 3
357 said 5 1 3 1
1218 11 5 0 4 1
15 wupost 5 2 1 2
298 bibl 5 0 5 0
649 certainli 5 2 2 1
304 human 5 1 3 1
646 isn 5 2 0 3
643 veri 5 1 1 3
886 john 5 1 2 2
164 world 5 2 1 2
322 day 5 2 2 1
135 advanc 5 2 3 0
1796 texa 5 2 1 2
889 alway 5 0 3 2
278 live 5 2 1 2
615 target 5 1 1 3
442 sure 5 0 2 3
582 requir 5 2 0 3
1686 forc 5 4 0 1
274 exist 5 2 2 1
1636 gov 4 4 0 0
310 develop 4 2 1 1
346 natur 4 1 1 2
1609 shuttl 4 4 0 0
910 da 4 3 1 0
1536 usa 4 2 0 2
1732 basic 4 2 1 1
811 add 4 2 0 2
785 load 4 2 0 2
37 help 4 2 1 1
950 poster 4 2 1 1
296 institut 4 1 2 1
1681 wing 4 4 0 0
399 popul 4 0 2 2
427 citi 4 1 1 2
400 commun 4 2 1 1
933 view 4 1 2 1
318 apollo 4 2 1 1
799 long 4 2 0 2
918 04 4 2 2 0
817 mayb 4 0 3 1
334 safeti 4 1 1 2
800 west 4 3 0 1
333 life 4 0 3 1
32 david 4 2 1 1
818 place 4 1 2 1
330 hope 4 2 2 0
1731 access 4 4 0 0
1103 avail 4 3 0 1
1520 probabl 4 1 0 3
701 1993apr5 4 2 0 2
577 public 4 1 1 2
1270 far 4 0 3 1
1262 oper 4 2 1 1
1252 idea 4 1 3 0
1248 belief 4 1 3 0
593 abl 4 1 1 2
1232 anyth 4 0 2 2
603 self 4 0 1 3
1059 effort 4 2 0 2
... ... ... ... ... ...
1833 390 1 1 0 0
1832 590 1 1 0 0
1831 282 1 1 0 0
1830 270 1 1 0 0
1829 340 1 1 0 0
1828 224 1 1 0 0
1827 850 1 1 0 0
1826 227 1 1 0 0
1660 gain 1 1 0 0
1659 advantag 1 1 0 0
1658 structur 1 1 0 0
1657 allevi 1 1 0 0
466 dixon 1 0 1 0
1488 apq 1 0 0 1
1487 c519mt 1 0 0 1
472 pretti 1 0 1 0
1485 nyc 1 0 0 1
474 peac 1 0 1 0
1483 mad 1 0 0 1
1482 steve 1 0 0 1
1481 linknet 1 0 0 1
1480 mane 1 0 0 1
1479 magpi 1 0 0 1
1478 jpradley 1 0 0 1
1477 murphi 1 0 0 1
1476 ge 1 0 0 1
1475 crd 1 0 0 1
1474 53328 1 0 0 1
1473 41582 1 0 0 1
1471 misc 1 0 0 1
1470 joe 1 0 1 0
1469 rigid 1 0 1 0
1468 clue 1 0 1 0
479 walker 1 0 1 0
480 perci 1 0 1 0
1465 wipe 1 0 1 0
1463 pastor 1 0 1 0
1461 vote 1 0 1 0
1460 seldom 1 0 1 0
1490 newsread 1 0 0 1
1491 tin 1 0 0 1
465 jean 1 0 1 0
1508 britain 1 0 0 1
1521 bother 1 0 0 1
446 vision 1 0 1 0
1519 attent 1 0 0 1
1518 superior 1 0 0 1
1517 chose 1 0 0 1
1516 debat 1 0 0 1
1515 comparison 1 0 0 1
1514 emphasi 1 0 0 1
1512 okay 1 0 0 1
1511 vs 1 0 0 1
448 bought 1 0 1 0
1509 council 1 0 0 1
1507 children 1 0 0 1
1493 pl9 1 0 0 1
1506 licens 1 0 0 1
449 gold 1 0 1 0
450 hurt 1 0 1 0
1503 250 1 0 0 1
1502 fw7 1 0 0 1
1501 c4u3x5 1 0 0 1
451 financi 1 0 1 0
452 800 1 0 1 0
454 stuck 1 0 1 0
462 forget 1 0 1 0
464 la 1 0 1 0
1494 1ppk8hinn7k7 1 0 0 1
1459 rare 1 0 1 0
491 140756 1 0 0 1
492 29159 1 0 0 1
1406 devot 1 0 1 0
1420 closest 1 0 1 0
1419 worldwid 1 0 1 0
515 disagre 1 0 0 1
1416 later 1 0 1 0
1415 spirit 1 0 1 0
1413 42 1 0 1 0
1412 favor 1 0 1 0
1411 sincer 1 0 1 0
1410 glad 1 0 1 0
1409 awe 1 0 1 0
1408 communion 1 0 1 0
1407 apostl 1 0 1 0
1404 fruit 1 0 1 0
1422 constantli 1 0 1 0
1403 corinthian 1 0 1 0
1401 extent 1 0 1 0
1400 divis 1 0 1 0
1399 harm 1 0 1 0
1398 meet 1 0 1 0
1397 prais 1 0 1 0
1395 split 1 0 1 0
1394 scriptur 1 0 1 0
1393 untoler 1 0 1 0
523 uranium 1 0 0 1
1391 firmli 1 0 1 0
1390 feel 1 0 1 0
1421 boston 1 0 1 0
1423 strive 1 0 1 0
1454 cliqu 1 0 1 0
1439 reward 1 0 1 0
1453 cliquey 1 0 1 0
1452 door 1 0 1 0
1451 open 1 0 1 0
1449 middl 1 0 1 0
495 145815 1 0 0 1
496 13383 1 0 0 1
1446 strang 1 0 1 0
497 ld 1 0 0 1
1444 outsid 1 0 1 0
1443 admit 1 0 1 0
498 loral 1 0 0 1
499 1pj1s8inn48k 1 0 0 1
501 734084516 1 0 0 1
1424 error 1 0 1 0
502 ponder 1 0 0 1
504 iastat 1 0 0 1
509 77 1 0 0 1
1433 mark 1 0 1 0
510 dan 1 0 0 1
1431 occur 1 0 1 0
1430 import 1 0 1 0
1429 cooper 1 0 1 0
1428 relat 1 0 1 0
1427 close 1 0 1 0
1426 sold 1 0 1 0
511 sorenson 1 0 0 1
1523 pull 1 0 0 1
443 credibl 1 0 1 0
441 propheci 1 0 1 0
1611 coverag 1 1 0 0
1625 407 1 1 0 0
1624 upcom 1 1 0 0
1623 remov 1 1 0 0
1622 formerli 1 1 0 0
1621 portion 1 1 0 0
1619 manifest 1 1 0 0
1618 compress 1 1 0 0
1617 ct 1 1 0 0
363 sleep 1 0 1 0
1615 gandalf 1 1 0 0
1614 holli 1 1 0 0
365 reproduc 1 0 1 0
370 escap 1 0 1 0
1627 info 1 1 0 0
375 piti 1 0 1 0
382 blind 1 0 1 0
384 realiz 1 0 1 0
1603 schedule_733694347 1 1 0 0
385 _they_ 1 0 1 0
386 mask 1 0 1 0
387 fake 1 0 1 0
1599 177 1 1 0 0
391 37696 1 0 1 0
394 roman 1 0 1 0
396 69ad 1 0 1 0
1593 schedule_730956538 1 1 0 0
1626 867 1 1 0 0
361 _just_ 1 0 1 0
402 cohes 1 0 1 0
355 guidelin 1 0 1 0
1656 neg 1 1 0 0
1655 vehicl 1 1 0 0
350 enter 1 0 1 0
1653 maneuv 1 1 0 0
1652 tilt 1 1 0 0
1650 clearanc 1 1 0 0
1649 tower 1 1 0 0
1648 assur 1 1 0 0
1647 command 1 1 0 0
354 ministri 1 0 1 0
1645 phase 1 1 0 0
1644 vertic 1 1 0 0
1642 2102 1 1 0 0
359 spread 1 0 1 0
1641 asc 1 1 0 0
1640 manual 1 1 0 0
1639 train 1 1 0 0
356 basicali 1 0 1 0
1637 ascent 1 1 0 0
358 prioriti 1 0 1 0
1635 jsc 1 1 0 0
1634 gothamc 1 1 0 0
1633 kjenk 1 1 0 0
1632 jenk 1 1 0 0
1631 liftoff 1 1 0 0
1630 roll 1 1 0 0
397 dispers 1 0 1 0
403 dire 1 0 1 0
1526 twist 1 0 0 1
1542 scold 1 0 0 1
1555 server 1 0 0 1
1554 panason 1 0 0 1
1553 adagio 1 0 0 1
1552 sgiblab 1 0 0 1
1551 sgigat 1 0 0 1
1550 olivea 1 0 0 1
422 sri 1 0 1 0
1547 brudda 1 0 0 1
1546 seddit 1 0 0 1
1545 blow 1 0 0 1
1544 style 1 0 0 1
1543 whiney 1 0 0 1
1541 nitpick 1 0 0 1
1557 waco 1 0 0 1
1539 afraid 1 0 0 1
1538 whine 1 0 0 1
1537 gb 1 0 0 1
430 14316 1 0 1 0
1534 compar 1 0 0 1
1533 gratuit 1 0 0 1
1532 citat 1 0 0 1
435 menlo 1 0 1 0
436 park 1 0 1 0
437 ca 1 0 1 0
439 regard 1 0 1 0
440 wilkerson 1 0 1 0
1556 aclu 1 0 0 1
1558 shootout 1 0 0 1
405 damien 1 0 1 0
415 nocturn 1 0 1 0
406 endemyr 1 0 1 0
407 unpur 1 0 1 0
408 knight 1 0 1 0
409 doom 1 0 1 0
1582 7223 1 1 0 0
1581 108 1 1 0 0
1580 59904 1 1 0 0
1579 girl 1 0 0 1
1577 critter 1 0 0 1
1576 threat 1 0 0 1
1575 pose 1 0 0 1
414 adopt 1 0 1 0
416 lifestyl 1 0 1 0
1559 1pifnjinnscb 1 0 0 1
417 doesnt 1 0 1 0
1570 devic 1 0 0 1
418 vampir 1 0 1 0
1568 evolv 1 0 0 1
420 gilham 1 0 1 0
1566 hagerp 1 0 0 1
1565 hager 1 0 0 1
1564 sandman 1 0 0 1
421 csl 1 0 1 0
1562 1pdb6qinn3sl 1 0 0 1
1561 10843 1 0 0 1
1560 140529 1 0 0 1
2884 divid 1 0 1 0

2885 rows × 5 columns

More on similarities