Textual Analysis for Archive MetaData or Journal Articles¶

%%capture
!pip install JATA
!pip install altair
from Text import *
from CJH import CJH_Archives
import altair as alt

Choosing the Data Source: PDFs or Meta-Data¶

Importing Meta Data¶

Run the following cell if you want to work with meta-data. If not, skip over it.

collections= CJH_Archives('AJHS').get_meta_data('collections', 1, 2)

Creating CJHA Scraper Object for AJHS
Scraping Collections (Finding Aids)
Scraping Archive Index for Entry Links 1
Scraping Archive Index for Entry Links 2
Number of Objects Extracted:  60
Scraping entry meta data...
Record:  1 https://archives.cjh.org/repositories/3/resources/15236 
      The White Jew Newspaper
    
Record:  2 https://archives.cjh.org/repositories/3/resources/13248 
      Synagogue Council of America Records
    
Record:  3 https://archives.cjh.org/repositories/3/resources/15562 
      Admiral Lewis Lichtenstein Strauss Papers
    
Record:  4 https://archives.cjh.org/repositories/3/resources/15566 
      Meyer Greenberg Papers
    
Record:  5 https://archives.cjh.org/repositories/3/resources/15570 
      Louis Lipsky Papers
    
Record:  6 https://archives.cjh.org/repositories/3/resources/15623 
      Noah Benevolent Society Records
    
Record:  7 https://archives.cjh.org/repositories/3/resources/15557 
      Leo Hershkowitz Collection of Court Records
    
Record:  8 https://archives.cjh.org/repositories/3/resources/15770 
      E. Michael Bluestone Papers
    
Record:  9 https://archives.cjh.org/repositories/3/resources/18294 
      Oscar M Lifshutz (1916-1990) Papers
    
Record:  10 https://archives.cjh.org/repositories/3/resources/18296 
      Norman Hapgood (1868-1937) Papers
    
Record:  11 https://archives.cjh.org/repositories/3/resources/18357 
      Jonah J. Goldstein Papers
    
Record:  12 https://archives.cjh.org/repositories/3/resources/18366 
      Melvin Urofsky collection
    
Record:  13 https://archives.cjh.org/repositories/3/resources/19663 
      Jewish Music Forum (New York, N.Y.) records
    
Record:  14 https://archives.cjh.org/repositories/3/resources/5998 
      Aaron Kramer (1921-1997) Papers
    
Record:  15 https://archives.cjh.org/repositories/3/resources/6010 
      Chaim Weizmann Papers
    
Record:  16 https://archives.cjh.org/repositories/3/resources/6113 
      Lawrence Sampter collection
    
Record:  17 https://archives.cjh.org/repositories/3/resources/6114 
      Emanuel de la Motta prayerbook collection
    
Record:  18 https://archives.cjh.org/repositories/3/resources/6115 
      Israel Goldberg papers
    
Record:  19 https://archives.cjh.org/repositories/3/resources/6116 
      M.S. Polack collection
    
Record:  20 https://archives.cjh.org/repositories/3/resources/6117 
      David Lloyd George Paris Peace Conference autograph album
    
Record:  21 https://archives.cjh.org/repositories/3/resources/6118 
      Henry Hochheimer marriage record book
    
Record:  22 https://archives.cjh.org/repositories/3/resources/6119 
      Judah family (New York City and Richmond) papers
    
Record:  23 https://archives.cjh.org/repositories/3/resources/6196 
      Baron family papers
    
Record:  24 https://archives.cjh.org/repositories/3/resources/6197 
      Lewisohn family genealogy
    
Record:  25 https://archives.cjh.org/repositories/3/resources/6198 
      Ewenczyk family genealogy collection
    
Record:  26 https://archives.cjh.org/repositories/3/resources/6199 
      Sulzberger family collection
    
Record:  27 https://archives.cjh.org/repositories/3/resources/6200 
      Moses Alexander autograph
    
Record:  28 https://archives.cjh.org/repositories/3/resources/6201 
      Sholem Asch autograph photograph
    
Record:  29 https://archives.cjh.org/repositories/3/resources/6202 
      Simon Bamburger collection
    
Record:  30 https://archives.cjh.org/repositories/3/resources/6203 
      Simon Guggenheimer letter
    
Record:  31 https://archives.cjh.org/repositories/3/resources/6204 
      Henry M. Moos correspondence
    
Record:  32 https://archives.cjh.org/repositories/3/resources/6205 
      Joseph Austrian autobiographical and historical sketches
    
Record:  33 https://archives.cjh.org/repositories/3/resources/6206 
      Samuel Lawrence scrapbook
    
Record:  34 https://archives.cjh.org/repositories/3/resources/6225 
      Adolph J. Sabath papers
    
Record:  35 https://archives.cjh.org/repositories/3/resources/6226 
      Roy H. Millenson collection of Senator Jacob K. Javits
    
Record:  36 https://archives.cjh.org/repositories/3/resources/6227 
      Kuttenplum family legal records
    
Record:  37 https://archives.cjh.org/repositories/3/resources/6228 
      Selkind family Yiddish postcards
    
Record:  38 https://archives.cjh.org/repositories/3/resources/6229 
      John Gellman papers
    
Record:  39 https://archives.cjh.org/repositories/3/resources/6230 
      Elliott S. Shapiro biographical materials
    
Record:  40 https://archives.cjh.org/repositories/3/resources/6231 
      Joan Breslow Woodbine colony reference materials
    
Record:  41 https://archives.cjh.org/repositories/3/resources/6233 
      Halpern family papers
    
Record:  42 https://archives.cjh.org/repositories/3/resources/6234 
      Blu Greenberg papers
    
Record:  43 https://archives.cjh.org/repositories/3/resources/6236 
      Ellen Norman Stern, Collection of Elie Wiesel newsclippings
    
Record:  44 https://archives.cjh.org/repositories/3/resources/6237 
      Saralea Zohar Aaron papers
    
Record:  45 https://archives.cjh.org/repositories/3/resources/6238 
      Vivian White Soboleski papers
    
Record:  46 https://archives.cjh.org/repositories/3/resources/6120 
      Judah family (New York, Montreal, Indiana) papers
    
Record:  47 https://archives.cjh.org/repositories/3/resources/6121 
      Esther Levy estate inventory
    
Record:  48 https://archives.cjh.org/repositories/3/resources/6137 
      Solomon Eudovich papers
    
Record:  49 https://archives.cjh.org/repositories/3/resources/6139 
      Abendanone family papers
    
Record:  50 https://archives.cjh.org/repositories/3/resources/6140 
      Mark Levy estate inventory
    
Record:  51 https://archives.cjh.org/repositories/3/resources/6141 
      Ehrenreich family papers
    
Record:  52 https://archives.cjh.org/repositories/3/resources/6142 
      Selman A. Waksman papers
    
Record:  53 https://archives.cjh.org/repositories/3/resources/6143 
      Martin Van Buren papers
    
Record:  54 https://archives.cjh.org/repositories/3/resources/6145 
      Stephen Wise papers
    
Record:  55 https://archives.cjh.org/repositories/3/resources/6146 
      Philip Slomovitz United Hebrew Schools of Detroit collection
    
Record:  56 https://archives.cjh.org/repositories/3/resources/6147 
      Louis Arthur Ungar papers
    
Record:  57 https://archives.cjh.org/repositories/3/resources/6148 
      Morris Rosenfeld papers
    
Record:  58 https://archives.cjh.org/repositories/3/resources/6149 
      Herman W. Block papers
    
Record:  59 https://archives.cjh.org/repositories/3/resources/6150 
      Peter Gouled papers
    
Record:  60 https://archives.cjh.org/repositories/3/resources/6151 
      Meier Steinbrink papers

Finding aid descriptions are set as default but you can pick any column name from the imported data. You may also want to experiment with records data!

#Set inital quotes df.
df_quotes = collections
li_quotes = df_quotes['Finding Aid & Administrative Information'].tolist()
stringV = li_quotes
print("Number of Rows", len(li_quotes))
a = ' '.join(stringV)
b_meta = wordninja.split(a)
print(len(b_meta))

Number of Rows 60
5814

All possible fields we could analyze:

df_quotes.columns

Index(['Additional Description', 'Creator', 'Dates', 'Extent',
       'Finding Aid & Administrative Information', 'Language of Materials',
       'Link', 'Name', 'Physical Storage Information', 'Related Names',
       'Repository Details', 'Scope and Content Note', 'Subjects', 'Use Terms',
       'Access Terms'],
      dtype='object')

Parsing and Converting a Group of PDFS to Plain Text¶

Considerations for analyzing this medium¶

The step that poses the most issues when analyzing journal articles or academic papers is converting the file from a pdf to plain text. A pdf has a lot of other information on each page other than the content of the actual text. Think page numbers, citation caveats, margin notes, or tables and graphs.

Load the article text from our parsed data¶

Set the variable in the following sell to True if you want to work with pdfs, if not, leave it be. Load your pdfs into the content file on the left

working_with_pdfs = False

if working_with_pdfs:
  fileDF = parse_all_pdfs_in_curr_dir()
  #Set inital quotes df.
  df_quotes = fileDF

  li_quotes = df_quotes['Text'].tolist()
  stringV = li_quotes
  print("Number of Articles", len(li_quotes))

  a = ' '.join(stringV)

  b_pdf = wordninja.split(a)
  print(len(b_pdf))
else:
  pass

Tokenize sentences and words, remove stopwords, use stemmer & lemmatizer¶

First, a note on the difference between Stemming vs Lemmatization:

Stemming: Trying to shorten a word with simple regex rules
Lemmatization: Trying to find the root word with linguistics rules (with the use of regex rules)

Process results, find the most popular lemmatized words and group results by Part of Speech (POS)¶

df_words = stopStemLem(li_quotes)

df_token_lists.head(5):
       0      1   2    3        4          5          6          7             8        9       10      11         12      13       14        15      16         17        18      19         20      21      22        23        24           25           26          27        28    29           30            31           32            33           34           35            36            37      38            39      40           41            42           43        44           45           46    47           48         49          50         51          52       53         54              55        56            57          58          59      60         61          62         63   64    65       66          67              68     69            70     71          72       73          74         75          76          77    78          79          80          81          82          83        84          85        86       87       88           89       90           91           92        93    94          95       96          97        98         99       100      101       102     103         104         105         106         107      108         109      110   111       112     113         114      115         116      117         118      119      120      121        122   123     124     125   126     127   128   129     130     131        132     133        134   135     136
0  title  guide  to  the    white        jew  newspaper                   august                                    i  status       in  progress  author  processed        by   tanya      elder    date                    language           of  description     english    script    of  description         latin     language            of  description         note   description            is      in       english           repository       details   repository   details         part           of   the     american     jewish  historical    society  repository     http                    ajhsorg   contact                                  west      th     street         new       york   ny         united      states       inquiries               cjhorg   None        None     None        None       None        None        None  None        None        None        None        None        None      None        None      None     None     None         None     None         None         None      None  None        None     None        None      None       None      None     None      None    None        None        None        None        None     None        None     None  None      None    None        None     None        None     None        None     None     None     None       None  None    None    None  None    None  None  None    None    None       None    None       None  None    None
1  title  guide  to  the  records         of        the  synagogue       council       of  america                                       undated                                               i  status      in  progress    author    processed           by       tanya     elder  date            a                                 language           of  description  undetermined        script      of   description    code          for  undetermined       script  language           of  description  note  description         is          in    english              edition  statement            this   version           was     derived        from  scaxml   revision  statements      march                 ead     updated              by  tanya         elder         repository  details  repository    details        part          of   the    american      jewish  historical     society  repository      http               ajhsorg  contact                           west           th       street       new  york          ny               united    states  inquiries             cjhorg      None    None        None        None        None        None     None        None     None  None      None    None        None     None        None     None        None     None     None     None       None  None    None    None  None    None  None  None    None    None       None    None       None  None    None
2  title  guide  to  the   papers         of    admiral      lewis  lichtenstein  strauss                                                      p  status         in  progress  author  processed      by    mark         a                 raider         date               october           language            of  description  undetermined       script           of   description          code     for  undetermined  script     language            of  description      note  description           is    in      english                edition  statement        this  version        was         derived      from  llstraussxml    revision  statements   april                         converted   to   ead                              revised     as  llstraussxml     by       tanya    elder                removed  deprecated    elements   and  attributes                 updated  repository       codes                 added  language    codes               changed  doctype  declaration                    etc           january                       entities    removed      from      ead   finding     aid              repository     details  repository  details        part       of   the  american  jewish  historical  society  repository     http              ajhsorg  contact                      west      th  street   new    york    ny        united  states  inquiries             cjhorg  None    None
3  title  guide  to  the    meyer  greenberg     papers    undated                               p  status  completed  author  finding       aid     was    created        by  rachel  alexandra  tutera    date                      description        rules  describing  archives                  a       content     standard      language           of  description       english        script      of   description   latin     language            of  description      note  description           is    in      english                sponsor         as        part       of        the            leon      levy      archival  processing  initiative               made    possible         by  the  leon     levy  foundation                   this    collection    was   processed       by      rachel  alexandra      tutera        with   the  assistance          of       katie   rovanpera              revision  statements      june                         ehyman            postaspace    migration   cleanup        repository  details  repository   details       part        of      the  american  jewish  historical     society  repository        http              ajhsorg  contact                    west          th   street         new     york          ny            united   states  inquiries        cjhorg    None  None    None  None  None    None    None       None    None       None  None    None
4  title  guide  to  the   papers         of      louis     lipsky                                            undated                          p  status         in  progress  author  processed      by  louise  sandberg      date                  november              language    of  description  undetermined       script            of  description         code           for  undetermined  script      language      of  description          note  description        is           in      english            edition  statement        this    version         was  derived       from  louislipskyxml  revision    statements       april                      converted          to        ead             revised          as  louislipskyxml     by         tanya  elder              removed  deprecated   elements         and  attributes           updated  repository       codes                   added  language       codes            changed  doctype  declaration               removed  boilerplate  entities               etc              january                       entities  removed      from     ead     finding         aid              repository  details  repository  details  part        of     the    american   jewish  historical  society  repository     http           ajhsorg    contact                  west    th  street   new  york      ny             united  states  inquiries        cjhorg


df_lem_strings.head():
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             lem quote
0                                                                                                                                                                                                                                                                                                                                                                                          title guide - - white jew newspaper - august - - - - status - progress author process - tanya elder date - - language - description english script - description latin language - description note description - - english - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
1                                                                                                                                                                                                                            title guide - - record - - synagogue council - america - - - - undated - - - - - status - progress author process - tanya elder date - - - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - scaxml revision statement march - - ead update - tanya elder - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
2                       title guide - - paper - admiral lewis lichtenstein strauss - - - - - - status - progress author process - mark - - raider date - october - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - llstraussxml revision statement april - - convert - ead - - revise - llstraussxml - tanya elder - remove deprecate element - attribute - update repository code - add language code - change doctype declaration - etc - january - - entity remove - ead find aid - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
3                                                 title guide - - meyer greenberg paper undated - - - status complete author find aid - create - rachel alexandra tutera date - - description rule describe archive - - content standard language - description english script - description latin language - description note description - - english - sponsor - part - - leon levy archival processing initiative - make possible - - leon levy foundation - - collection - process - rachel alexandra tutera - - assistance - katie rovanpera - revision statement june - - ehyman - postaspace migration cleanup - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
4  title guide - - paper - louis lipsky - - - - undated - - - status - progress author process - louise sandberg date - november - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - louislipskyxml revision statement april - - convert - ead - - revise - louislipskyxml - tanya elder - remove deprecate element - attribute - update repository code - add language code - change doctype declaration - remove boilerplate entity - etc - january - - entity remove - ead find aid - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
Group by lemmatized words, add count and sort:
Get just the first row in each lemmatized group

print("df_words.head(10):")
print(df_words.head(10))

df_words.head(10):
           lem  index        token        stem pos  counts
0  description     14  description    descript  NN     217
1   repository     24   repository  repositori  NN     182
2     language     13     language     languag  NN     122
3       detail     25      details      detail  NN     120
4      english     15      english     english  JJ     116
5          aid    166          aid         aid  NN     106
6         find    165      finding        find  VB     106
7         part     28         part        part  NN      67
8       jewish     30       jewish      jewish  NN      65
9       script     16       script      script  NN      64

Frequency of Lemmatized Words Grouped by Parts of Speech.¶

#hide-input
df_words.head(50)

Top 10 words per Part Of Speech (POS)¶

df_words = df_words[['lem', 'pos', 'counts']].head(200)
dfList_pos = format_stopstemlem(df_words)

Nouns¶

#hide-input
dfList_pos[0]

Adjectives¶

dfList_pos[1]

Verbs¶

dfList_pos[2]

Adverb¶

dfList_pos[3]

Frequency plot grouped by POS type¶

source = df_words[df_words.counts>1].sort_values(by=['counts'], ascending=False)
alt.Chart(source).mark_bar(opacity=0.7).encode(
    y=alt.Y('lem:N',sort= {"op": "distinct", "field": "sort_order:O"}),
    x=alt.X('counts:Q', stack=None),
    color="pos:N",
)

Machine Learning Text Generation Model¶

While parsing PDFs the most common thing to see are page numbers and words that are stucktogetherlikethis. To handle this and to make our training data more robust we use a package called word ninja that uses english corpuses (corpii?) and some fancy math to split them up correctly. We also remove all numbers that are not spelled out in the text.

#collapse-hide
clean_text_for_training = clean_plain_text_for_training(stringV)
file = " ".join(clean_text_for_training)

60

x_data, X, y, chars = find_patterns(file)

Total number of characters: 32060
Total vocab: 37
Total Patterns: 31960

Setting Paramaters and Training¶

JATA comes built in with a params function but you can feel free to override them in the custom function if you like! Set the my_own_params flag to True if you want this setting!

FYI - This can take some time (Sometimes up to an hour using the built in settings), so grab a snack or take a nap!

Reducing the epochs will reduce the time it takes to train however it will also reduce the robustness of your output!

my_own_params = False

def set_model_params_custom(X, y):
    model = Sequential()
    model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(256, return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(y.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    filepath = "model_weights_saved.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
    desired_callbacks = [checkpoint]
    model.fit(X, y, epochs=1, batch_size=200, callbacks=desired_callbacks)
    return model

import time

start = time.time()
if my_own_params:
  model = set_model_params_custom(X,y)
else:
  model = set_model_params(X,y)
end = time.time()
print("Time Elapsed:")
print(end - start)

Epoch 1/5
160/160 [==============================] - ETA: 0s - loss: 3.0644
Epoch 00001: loss improved from inf to 3.06437, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 3.0644
Epoch 2/5
160/160 [==============================] - ETA: 0s - loss: 3.0125
Epoch 00002: loss improved from 3.06437 to 3.01255, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 3.0125
Epoch 3/5
160/160 [==============================] - ETA: 0s - loss: 2.9992
Epoch 00003: loss improved from 3.01255 to 2.99917, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 2.9992
Epoch 4/5
160/160 [==============================] - ETA: 0s - loss: 2.9298
Epoch 00004: loss improved from 2.99917 to 2.92984, saving model to model_weights_saved.hdf5
160/160 [==============================] - 472s 3s/step - loss: 2.9298
Epoch 5/5
160/160 [==============================] - ETA: 0s - loss: 2.6665
Epoch 00005: loss improved from 2.92984 to 2.66646, saving model to model_weights_saved.hdf5
160/160 [==============================] - 475s 3s/step - loss: 2.6665
Time Elapsed:
2405.342127799988

Loading the Model and Generating Text¶

Getting Some Output¶

filepath = "model_weights_saved.hdf5"

print(what_does_the_robot_say(x_data,model, chars,filepath))

ry details part american jewish historical society repository http j hs org contact  west  th st
s   p  status completed author finding aid created marc  ead j hs xsl date  descripti
 papers undated    p  status progress author finding aid created marc  ead j hs xsl 
ety repository http j hs org contact  west  th street new york ny  united states inquiries 
eanup physical storage information container consolidated box p  folder p  mixed materials repos
uthor finding aid michael mont albano part cj h holocaust resource initiative made possible conferen
ing aid created marc  ead j hs xsl date  language description english script description latin 
ndated   p  status completed author processed yakov ill ich sk lar date  description 
scription english script description latin language description note finding aid written english rev
  status completed author finding aid created marc  ead j hs xsl date  description rules
pt description latin language description note finding aid written english repository details reposi
up repository details repository details part american jewish historical society repository http j h
ry details part american jewish historical society repository http j hs org contact west th sts p status completed author finding aid created marc ead j hs xsl date descripti papers undated p status progress author finding aid created marc ead j hs xsl ety repository http j hs org contact west th street new york ny united states inquiries eanup physical storage information container consolidated box p folder p mixed materials reposuthor finding aid michael mont albano part cj h holocaust resource initiative made possible conferening aid created marc ead j hs xsl date language description english script description latin ndated p status completed author processed yakov ill ich sk lar date description scription english script description latin language description note finding aid written english rev status completed author finding aid created marc ead j hs xsl date description rulespt description latin language description note finding aid written english repository details reposiup repository details repository details part american jewish historical society repository http j h
part historical society repository j contact west th p status author finding aid marc j date undated p status progress author finding aid marc j repository j contact west th street new york united physical storage information container consolidated box p folder p mixed finding aid part h holocaust resource initiative made possible aid marc j date language description script description p status author ill ich lar date description scription script description language description note finding aid written rev status author finding aid marc j date description description language description note finding aid written repository repository repository part historical society repository j h

Play around with the training data and model params until you find your desired output!

	lem	index	token	stem	pos	counts
0	description	14	description	descript	NN	217
1	repository	24	repository	repositori	NN	182
2	language	13	language	languag	NN	122
3	detail	25	details	detail	NN	120
4	english	15	english	english	JJ	116
5	aid	166	aid	aid	NN	106
6	find	165	finding	find	VB	106
7	part	28	part	part	NN	67
8	jewish	30	jewish	jewish	NN	65
9	script	16	script	script	NN	64
10	new	39	new	new	JJ	63
11	york	40	york	york	NN	63
12	note	21	note	note	NN	62
13	united	41	united	unit	JJ	61
14	historical	31	historical	histor	JJ	61
15	society	32	society	societi	NN	61
16	cjhorg	44	cjhorg	cjhorg	NN	60
17	street	38	street	street	NN	60
18	status	6	status	statu	NN	60
19	date	12	date	date	NN	60
20	author	8	author	author	NN	60
21	title	0	title	titl	NN	60
22	west	37	west	west	NN	60
23	contact	36	contact	contact	NN	60
24	american	29	american	american	JJ	60
25	inquiry	43	inquiries	inquiri	NN	60
26	ajhsorg	35	ajhsorg	ajhsorg	NN	60
27	http	34	http	http	NN	60
28	state	42	states	state	NN	60
29	guide	1	guide	guid	NN	59
30	latin	18	latin	latin	NN	56
31	write	605	written	written	VB	50
32	create	199	created	creat	VB	48
33	marceadajhsxsl	783	marceadajhsxsl	marceadajhsxsl	NN	47
34	statement	73	statement	statement	NN	37
35	progress	7	progress	progress	NN	33
36	revision	77	revision	revis	NN	32
37	paper	107	papers	paper	NN	30
38	undated	51	undated	undat	JJ	28
39	archive	207	archives	archiv	NN	28
40	standard	209	standard	standard	NN	27
41	content	208	content	content	NN	27
42	describe	206	describing	describ	VB	27
43	complete	195	completed	complet	VB	27
44	cleanup	247	cleanup	cleanup	NN	27
45	rule	205	rules	rule	NN	27
46	migration	246	migration	migrat	NN	27
47	ehyman	244	ehyman	ehyman	NN	26
48	postaspace	245	postaspace	postaspac	NN	26
49	april	140	april	april	NN	26