Textual Analysis for Archive MetaData or Journal Articles

In [1]:
%%capture
!pip install JATA
!pip install altair
from Text import *
from CJH import CJH_Archives
import altair as alt

Choosing the Data Source: PDFs or Meta-Data

Importing Meta Data

Run the following cell if you want to work with meta-data. If not, skip over it.

In [2]:
collections= CJH_Archives('AJHS').get_meta_data('collections', 1, 2)
Creating CJHA Scraper Object for AJHS
Scraping Collections (Finding Aids)
Scraping Archive Index for Entry Links 1
Scraping Archive Index for Entry Links 2
Number of Objects Extracted:  60
Scraping entry meta data...
Record:  1 https://archives.cjh.org/repositories/3/resources/15236 
      The White Jew Newspaper
    
Record:  2 https://archives.cjh.org/repositories/3/resources/13248 
      Synagogue Council of America Records
    
Record:  3 https://archives.cjh.org/repositories/3/resources/15562 
      Admiral Lewis Lichtenstein Strauss Papers
    
Record:  4 https://archives.cjh.org/repositories/3/resources/15566 
      Meyer Greenberg Papers
    
Record:  5 https://archives.cjh.org/repositories/3/resources/15570 
      Louis Lipsky Papers
    
Record:  6 https://archives.cjh.org/repositories/3/resources/15623 
      Noah Benevolent Society Records
    
Record:  7 https://archives.cjh.org/repositories/3/resources/15557 
      Leo Hershkowitz Collection of Court Records
    
Record:  8 https://archives.cjh.org/repositories/3/resources/15770 
      E. Michael Bluestone Papers
    
Record:  9 https://archives.cjh.org/repositories/3/resources/18294 
      Oscar M Lifshutz (1916-1990) Papers
    
Record:  10 https://archives.cjh.org/repositories/3/resources/18296 
      Norman Hapgood (1868-1937) Papers
    
Record:  11 https://archives.cjh.org/repositories/3/resources/18357 
      Jonah J. Goldstein Papers
    
Record:  12 https://archives.cjh.org/repositories/3/resources/18366 
      Melvin Urofsky collection
    
Record:  13 https://archives.cjh.org/repositories/3/resources/19663 
      Jewish Music Forum (New York, N.Y.) records
    
Record:  14 https://archives.cjh.org/repositories/3/resources/5998 
      Aaron Kramer (1921-1997) Papers
    
Record:  15 https://archives.cjh.org/repositories/3/resources/6010 
      Chaim Weizmann Papers
    
Record:  16 https://archives.cjh.org/repositories/3/resources/6113 
      Lawrence Sampter collection
    
Record:  17 https://archives.cjh.org/repositories/3/resources/6114 
      Emanuel de la Motta prayerbook collection
    
Record:  18 https://archives.cjh.org/repositories/3/resources/6115 
      Israel Goldberg papers
    
Record:  19 https://archives.cjh.org/repositories/3/resources/6116 
      M.S. Polack collection
    
Record:  20 https://archives.cjh.org/repositories/3/resources/6117 
      David Lloyd George Paris Peace Conference autograph album
    
Record:  21 https://archives.cjh.org/repositories/3/resources/6118 
      Henry Hochheimer marriage record book
    
Record:  22 https://archives.cjh.org/repositories/3/resources/6119 
      Judah family (New York City and Richmond) papers
    
Record:  23 https://archives.cjh.org/repositories/3/resources/6196 
      Baron family papers
    
Record:  24 https://archives.cjh.org/repositories/3/resources/6197 
      Lewisohn family genealogy
    
Record:  25 https://archives.cjh.org/repositories/3/resources/6198 
      Ewenczyk family genealogy collection
    
Record:  26 https://archives.cjh.org/repositories/3/resources/6199 
      Sulzberger family collection
    
Record:  27 https://archives.cjh.org/repositories/3/resources/6200 
      Moses Alexander autograph
    
Record:  28 https://archives.cjh.org/repositories/3/resources/6201 
      Sholem Asch autograph photograph
    
Record:  29 https://archives.cjh.org/repositories/3/resources/6202 
      Simon Bamburger collection
    
Record:  30 https://archives.cjh.org/repositories/3/resources/6203 
      Simon Guggenheimer letter
    
Record:  31 https://archives.cjh.org/repositories/3/resources/6204 
      Henry M. Moos correspondence
    
Record:  32 https://archives.cjh.org/repositories/3/resources/6205 
      Joseph Austrian autobiographical and historical sketches
    
Record:  33 https://archives.cjh.org/repositories/3/resources/6206 
      Samuel Lawrence scrapbook
    
Record:  34 https://archives.cjh.org/repositories/3/resources/6225 
      Adolph J. Sabath papers
    
Record:  35 https://archives.cjh.org/repositories/3/resources/6226 
      Roy H. Millenson collection of Senator Jacob K. Javits
    
Record:  36 https://archives.cjh.org/repositories/3/resources/6227 
      Kuttenplum family legal records
    
Record:  37 https://archives.cjh.org/repositories/3/resources/6228 
      Selkind family Yiddish postcards
    
Record:  38 https://archives.cjh.org/repositories/3/resources/6229 
      John Gellman papers
    
Record:  39 https://archives.cjh.org/repositories/3/resources/6230 
      Elliott S. Shapiro biographical materials
    
Record:  40 https://archives.cjh.org/repositories/3/resources/6231 
      Joan Breslow Woodbine colony reference materials
    
Record:  41 https://archives.cjh.org/repositories/3/resources/6233 
      Halpern family papers
    
Record:  42 https://archives.cjh.org/repositories/3/resources/6234 
      Blu Greenberg papers
    
Record:  43 https://archives.cjh.org/repositories/3/resources/6236 
      Ellen Norman Stern, Collection of Elie Wiesel newsclippings
    
Record:  44 https://archives.cjh.org/repositories/3/resources/6237 
      Saralea Zohar Aaron papers
    
Record:  45 https://archives.cjh.org/repositories/3/resources/6238 
      Vivian White Soboleski papers
    
Record:  46 https://archives.cjh.org/repositories/3/resources/6120 
      Judah family (New York, Montreal, Indiana) papers
    
Record:  47 https://archives.cjh.org/repositories/3/resources/6121 
      Esther Levy estate inventory
    
Record:  48 https://archives.cjh.org/repositories/3/resources/6137 
      Solomon Eudovich papers
    
Record:  49 https://archives.cjh.org/repositories/3/resources/6139 
      Abendanone family papers
    
Record:  50 https://archives.cjh.org/repositories/3/resources/6140 
      Mark Levy estate inventory
    
Record:  51 https://archives.cjh.org/repositories/3/resources/6141 
      Ehrenreich family papers
    
Record:  52 https://archives.cjh.org/repositories/3/resources/6142 
      Selman A. Waksman papers
    
Record:  53 https://archives.cjh.org/repositories/3/resources/6143 
      Martin Van Buren papers
    
Record:  54 https://archives.cjh.org/repositories/3/resources/6145 
      Stephen Wise papers
    
Record:  55 https://archives.cjh.org/repositories/3/resources/6146 
      Philip Slomovitz United Hebrew Schools of Detroit collection
    
Record:  56 https://archives.cjh.org/repositories/3/resources/6147 
      Louis Arthur Ungar papers
    
Record:  57 https://archives.cjh.org/repositories/3/resources/6148 
      Morris Rosenfeld papers
    
Record:  58 https://archives.cjh.org/repositories/3/resources/6149 
      Herman W. Block papers
    
Record:  59 https://archives.cjh.org/repositories/3/resources/6150 
      Peter Gouled papers
    
Record:  60 https://archives.cjh.org/repositories/3/resources/6151 
      Meier Steinbrink papers
    

Finding aid descriptions are set as default but you can pick any column name from the imported data. You may also want to experiment with records data!

In [3]:
#Set inital quotes df.
df_quotes = collections
li_quotes = df_quotes['Finding Aid & Administrative Information'].tolist()
stringV = li_quotes
print("Number of Rows", len(li_quotes))
a = ' '.join(stringV)
b_meta = wordninja.split(a)
print(len(b_meta))
Number of Rows 60
5814

All possible fields we could analyze:

In [4]:
df_quotes.columns
Out[4]:
Index(['Additional Description', 'Creator', 'Dates', 'Extent',
       'Finding Aid & Administrative Information', 'Language of Materials',
       'Link', 'Name', 'Physical Storage Information', 'Related Names',
       'Repository Details', 'Scope and Content Note', 'Subjects', 'Use Terms',
       'Access Terms'],
      dtype='object')

Parsing and Converting a Group of PDFS to Plain Text

Considerations for analyzing this medium

The step that poses the most issues when analyzing journal articles or academic papers is converting the file from a pdf to plain text. A pdf has a lot of other information on each page other than the content of the actual text. Think page numbers, citation caveats, margin notes, or tables and graphs.

Load the article text from our parsed data

Set the variable in the following sell to True if you want to work with pdfs, if not, leave it be. Load your pdfs into the content file on the left

In [5]:
working_with_pdfs = False
In [6]:
if working_with_pdfs:
  fileDF = parse_all_pdfs_in_curr_dir()
  #Set inital quotes df.
  df_quotes = fileDF

  li_quotes = df_quotes['Text'].tolist()
  stringV = li_quotes
  print("Number of Articles", len(li_quotes))

  a = ' '.join(stringV)

  b_pdf = wordninja.split(a)
  print(len(b_pdf))
else:
  pass

Tokenize sentences and words, remove stopwords, use stemmer & lemmatizer

First, a note on the difference between Stemming vs Lemmatization:

  • Stemming: Trying to shorten a word with simple regex rules

  • Lemmatization: Trying to find the root word with linguistics rules (with the use of regex rules)

In [7]:
df_words = stopStemLem(li_quotes)
df_token_lists.head(5):
       0      1   2    3        4          5          6          7             8        9       10      11         12      13       14        15      16         17        18      19         20      21      22        23        24           25           26          27        28    29           30            31           32            33           34           35            36            37      38            39      40           41            42           43        44           45           46    47           48         49          50         51          52       53         54              55        56            57          58          59      60         61          62         63   64    65       66          67              68     69            70     71          72       73          74         75          76          77    78          79          80          81          82          83        84          85        86       87       88           89       90           91           92        93    94          95       96          97        98         99       100      101       102     103         104         105         106         107      108         109      110   111       112     113         114      115         116      117         118      119      120      121        122   123     124     125   126     127   128   129     130     131        132     133        134   135     136
0  title  guide  to  the    white        jew  newspaper                   august                                    i  status       in  progress  author  processed        by   tanya      elder    date                    language           of  description     english    script    of  description         latin     language            of  description         note   description            is      in       english           repository       details   repository   details         part           of   the     american     jewish  historical    society  repository     http                    ajhsorg   contact                                  west      th     street         new       york   ny         united      states       inquiries               cjhorg   None        None     None        None       None        None        None  None        None        None        None        None        None      None        None      None     None     None         None     None         None         None      None  None        None     None        None      None       None      None     None      None    None        None        None        None        None     None        None     None  None      None    None        None     None        None     None        None     None     None     None       None  None    None    None  None    None  None  None    None    None       None    None       None  None    None
1  title  guide  to  the  records         of        the  synagogue       council       of  america                                       undated                                               i  status      in  progress    author    processed           by       tanya     elder  date            a                                 language           of  description  undetermined        script      of   description    code          for  undetermined       script  language           of  description  note  description         is          in    english              edition  statement            this   version           was     derived        from  scaxml   revision  statements      march                 ead     updated              by  tanya         elder         repository  details  repository    details        part          of   the    american      jewish  historical     society  repository      http               ajhsorg  contact                           west           th       street       new  york          ny               united    states  inquiries             cjhorg      None    None        None        None        None        None     None        None     None  None      None    None        None     None        None     None        None     None     None     None       None  None    None    None  None    None  None  None    None    None       None    None       None  None    None
2  title  guide  to  the   papers         of    admiral      lewis  lichtenstein  strauss                                                      p  status         in  progress  author  processed      by    mark         a                 raider         date               october           language            of  description  undetermined       script           of   description          code     for  undetermined  script     language            of  description      note  description           is    in      english                edition  statement        this  version        was         derived      from  llstraussxml    revision  statements   april                         converted   to   ead                              revised     as  llstraussxml     by       tanya    elder                removed  deprecated    elements   and  attributes                 updated  repository       codes                 added  language    codes               changed  doctype  declaration                    etc           january                       entities    removed      from      ead   finding     aid              repository     details  repository  details        part       of   the  american  jewish  historical  society  repository     http              ajhsorg  contact                      west      th  street   new    york    ny        united  states  inquiries             cjhorg  None    None
3  title  guide  to  the    meyer  greenberg     papers    undated                               p  status  completed  author  finding       aid     was    created        by  rachel  alexandra  tutera    date                      description        rules  describing  archives                  a       content     standard      language           of  description       english        script      of   description   latin     language            of  description      note  description           is    in      english                sponsor         as        part       of        the            leon      levy      archival  processing  initiative               made    possible         by  the  leon     levy  foundation                   this    collection    was   processed       by      rachel  alexandra      tutera        with   the  assistance          of       katie   rovanpera              revision  statements      june                         ehyman            postaspace    migration   cleanup        repository  details  repository   details       part        of      the  american  jewish  historical     society  repository        http              ajhsorg  contact                    west          th   street         new     york          ny            united   states  inquiries        cjhorg    None  None    None  None  None    None    None       None    None       None  None    None
4  title  guide  to  the   papers         of      louis     lipsky                                            undated                          p  status         in  progress  author  processed      by  louise  sandberg      date                  november              language    of  description  undetermined       script            of  description         code           for  undetermined  script      language      of  description          note  description        is           in      english            edition  statement        this    version         was  derived       from  louislipskyxml  revision    statements       april                      converted          to        ead             revised          as  louislipskyxml     by         tanya  elder              removed  deprecated   elements         and  attributes           updated  repository       codes                   added  language       codes            changed  doctype  declaration               removed  boilerplate  entities               etc              january                       entities  removed      from     ead     finding         aid              repository  details  repository  details  part        of     the    american   jewish  historical  society  repository     http           ajhsorg    contact                  west    th  street   new  york      ny             united  states  inquiries        cjhorg


df_lem_strings.head():
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             lem quote
0                                                                                                                                                                                                                                                                                                                                                                                          title guide - - white jew newspaper - august - - - - status - progress author process - tanya elder date - - language - description english script - description latin language - description note description - - english - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
1                                                                                                                                                                                                                            title guide - - record - - synagogue council - america - - - - undated - - - - - status - progress author process - tanya elder date - - - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - scaxml revision statement march - - ead update - tanya elder - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
2                       title guide - - paper - admiral lewis lichtenstein strauss - - - - - - status - progress author process - mark - - raider date - october - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - llstraussxml revision statement april - - convert - ead - - revise - llstraussxml - tanya elder - remove deprecate element - attribute - update repository code - add language code - change doctype declaration - etc - january - - entity remove - ead find aid - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
3                                                 title guide - - meyer greenberg paper undated - - - status complete author find aid - create - rachel alexandra tutera date - - description rule describe archive - - content standard language - description english script - description latin language - description note description - - english - sponsor - part - - leon levy archival processing initiative - make possible - - leon levy foundation - - collection - process - rachel alexandra tutera - - assistance - katie rovanpera - revision statement june - - ehyman - postaspace migration cleanup - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
4  title guide - - paper - louis lipsky - - - - undated - - - status - progress author process - louise sandberg date - november - language - description undetermined script - description code - undetermined script language - description note description - - english - edition statement - version - derive - louislipskyxml revision statement april - - convert - ead - - revise - louislipskyxml - tanya elder - remove deprecate element - attribute - update repository code - add language code - change doctype declaration - remove boilerplate entity - etc - january - - entity remove - ead find aid - repository detail repository detail part - - american jewish historical society repository http - ajhsorg contact - - west - street new york - - united state inquiry - cjhorg
Group by lemmatized words, add count and sort:
Get just the first row in each lemmatized group
In [8]:
print("df_words.head(10):")
print(df_words.head(10))
df_words.head(10):
           lem  index        token        stem pos  counts
0  description     14  description    descript  NN     217
1   repository     24   repository  repositori  NN     182
2     language     13     language     languag  NN     122
3       detail     25      details      detail  NN     120
4      english     15      english     english  JJ     116
5          aid    166          aid         aid  NN     106
6         find    165      finding        find  VB     106
7         part     28         part        part  NN      67
8       jewish     30       jewish      jewish  NN      65
9       script     16       script      script  NN      64

Frequency of Lemmatized Words Grouped by Parts of Speech.

In [9]:
#hide-input
df_words.head(50)
Out[9]:
lem index token stem pos counts
0 description 14 description descript NN 217
1 repository 24 repository repositori NN 182
2 language 13 language languag NN 122
3 detail 25 details detail NN 120
4 english 15 english english JJ 116
5 aid 166 aid aid NN 106
6 find 165 finding find VB 106
7 part 28 part part NN 67
8 jewish 30 jewish jewish NN 65
9 script 16 script script NN 64
10 new 39 new new JJ 63
11 york 40 york york NN 63
12 note 21 note note NN 62
13 united 41 united unit JJ 61
14 historical 31 historical histor JJ 61
15 society 32 society societi NN 61
16 cjhorg 44 cjhorg cjhorg NN 60
17 street 38 street street NN 60
18 status 6 status statu NN 60
19 date 12 date date NN 60
20 author 8 author author NN 60
21 title 0 title titl NN 60
22 west 37 west west NN 60
23 contact 36 contact contact NN 60
24 american 29 american american JJ 60
25 inquiry 43 inquiries inquiri NN 60
26 ajhsorg 35 ajhsorg ajhsorg NN 60
27 http 34 http http NN 60
28 state 42 states state NN 60
29 guide 1 guide guid NN 59
30 latin 18 latin latin NN 56
31 write 605 written written VB 50
32 create 199 created creat VB 48
33 marceadajhsxsl 783 marceadajhsxsl marceadajhsxsl NN 47
34 statement 73 statement statement NN 37
35 progress 7 progress progress NN 33
36 revision 77 revision revis NN 32
37 paper 107 papers paper NN 30
38 undated 51 undated undat JJ 28
39 archive 207 archives archiv NN 28
40 standard 209 standard standard NN 27
41 content 208 content content NN 27
42 describe 206 describing describ VB 27
43 complete 195 completed complet VB 27
44 cleanup 247 cleanup cleanup NN 27
45 rule 205 rules rule NN 27
46 migration 246 migration migrat NN 27
47 ehyman 244 ehyman ehyman NN 26
48 postaspace 245 postaspace postaspac NN 26
49 april 140 april april NN 26

Top 10 words per Part Of Speech (POS)

In [10]:
df_words = df_words[['lem', 'pos', 'counts']].head(200)
dfList_pos = format_stopstemlem(df_words)

Nouns

In [11]:
#hide-input
dfList_pos[0]
Out[11]:
index lem pos counts
0 0 description NN 217
1 1 repository NN 182
2 2 language NN 122
3 3 detail NN 120
4 5 aid NN 106
5 7 part NN 67
6 8 jewish NN 65
7 9 script NN 64
8 11 york NN 63
9 12 note NN 62

Adjectives

In [12]:
dfList_pos[1]
Out[12]:
index lem pos counts
0 4 english JJ 116
1 10 new JJ 63
2 13 united JJ 61
3 14 historical JJ 61
4 24 american JJ 60
5 38 undated JJ 28
6 56 consolidated JJ 10
7 58 mixed JJ 10
8 63 physical JJ 10
9 66 undetermined JJ 8

Verbs

In [13]:
dfList_pos[2]
Out[13]:
index lem pos counts
0 6 find VB 106
1 31 write VB 50
2 32 create VB 48
3 42 describe VB 27
4 43 complete VB 27
5 57 process VB 10
6 68 make VB 8
7 71 add VB 6
8 73 remove VB 6
9 78 derive VB 5

Adverb

In [14]:
dfList_pos[3]
Out[14]:
index lem pos counts

Frequency plot grouped by POS type

In [17]:
source = df_words[df_words.counts>1].sort_values(by=['counts'], ascending=False)
alt.Chart(source).mark_bar(opacity=0.7).encode(
    y=alt.Y('lem:N',sort= {"op": "distinct", "field": "sort_order:O"}),
    x=alt.X('counts:Q', stack=None),
    color="pos:N",
)
Out[17]:

Machine Learning Text Generation Model

While parsing PDFs the most common thing to see are page numbers and words that are stucktogetherlikethis. To handle this and to make our training data more robust we use a package called word ninja that uses english corpuses (corpii?) and some fancy math to split them up correctly. We also remove all numbers that are not spelled out in the text.

In [18]:
#collapse-hide
clean_text_for_training = clean_plain_text_for_training(stringV)
file = " ".join(clean_text_for_training)
60
In [19]:
x_data, X, y, chars = find_patterns(file)
Total number of characters: 32060
Total vocab: 37
Total Patterns: 31960

Setting Paramaters and Training

JATA comes built in with a params function but you can feel free to override them in the custom function if you like! Set the my_own_params flag to True if you want this setting!

FYI - This can take some time (Sometimes up to an hour using the built in settings), so grab a snack or take a nap!

Reducing the epochs will reduce the time it takes to train however it will also reduce the robustness of your output!

In [20]:
my_own_params = False
In [21]:
def set_model_params_custom(X, y):
    model = Sequential()
    model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(256, return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(y.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    filepath = "model_weights_saved.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
    desired_callbacks = [checkpoint]
    model.fit(X, y, epochs=1, batch_size=200, callbacks=desired_callbacks)
    return model

import time

start = time.time()
if my_own_params:
  model = set_model_params_custom(X,y)
else:
  model = set_model_params(X,y)
end = time.time()
print("Time Elapsed:")
print(end - start)
Epoch 1/5
160/160 [==============================] - ETA: 0s - loss: 3.0644
Epoch 00001: loss improved from inf to 3.06437, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 3.0644
Epoch 2/5
160/160 [==============================] - ETA: 0s - loss: 3.0125
Epoch 00002: loss improved from 3.06437 to 3.01255, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 3.0125
Epoch 3/5
160/160 [==============================] - ETA: 0s - loss: 2.9992
Epoch 00003: loss improved from 3.01255 to 2.99917, saving model to model_weights_saved.hdf5
160/160 [==============================] - 479s 3s/step - loss: 2.9992
Epoch 4/5
160/160 [==============================] - ETA: 0s - loss: 2.9298
Epoch 00004: loss improved from 2.99917 to 2.92984, saving model to model_weights_saved.hdf5
160/160 [==============================] - 472s 3s/step - loss: 2.9298
Epoch 5/5
160/160 [==============================] - ETA: 0s - loss: 2.6665
Epoch 00005: loss improved from 2.92984 to 2.66646, saving model to model_weights_saved.hdf5
160/160 [==============================] - 475s 3s/step - loss: 2.6665
Time Elapsed:
2405.342127799988

Loading the Model and Generating Text

Getting Some Output

In [22]:
filepath = "model_weights_saved.hdf5"
In [24]:
print(what_does_the_robot_say(x_data,model, chars,filepath))
ry details part american jewish historical society repository http j hs org contact  west  th st
s   p  status completed author finding aid created marc  ead j hs xsl date  descripti
 papers undated    p  status progress author finding aid created marc  ead j hs xsl 
ety repository http j hs org contact  west  th street new york ny  united states inquiries 
eanup physical storage information container consolidated box p  folder p  mixed materials repos
uthor finding aid michael mont albano part cj h holocaust resource initiative made possible conferen
ing aid created marc  ead j hs xsl date  language description english script description latin 
ndated   p  status completed author processed yakov ill ich sk lar date  description 
scription english script description latin language description note finding aid written english rev
  status completed author finding aid created marc  ead j hs xsl date  description rules
pt description latin language description note finding aid written english repository details reposi
up repository details repository details part american jewish historical society repository http j h
ry details part american jewish historical society repository http j hs org contact west th sts p status completed author finding aid created marc ead j hs xsl date descripti papers undated p status progress author finding aid created marc ead j hs xsl ety repository http j hs org contact west th street new york ny united states inquiries eanup physical storage information container consolidated box p folder p mixed materials reposuthor finding aid michael mont albano part cj h holocaust resource initiative made possible conferening aid created marc ead j hs xsl date language description english script description latin ndated p status completed author processed yakov ill ich sk lar date description scription english script description latin language description note finding aid written english rev status completed author finding aid created marc ead j hs xsl date description rulespt description latin language description note finding aid written english repository details reposiup repository details repository details part american jewish historical society repository http j h
part historical society repository j contact west th p status author finding aid marc j date undated p status progress author finding aid marc j repository j contact west th street new york united physical storage information container consolidated box p folder p mixed finding aid part h holocaust resource initiative made possible aid marc j date language description script description p status author ill ich lar date description scription script description language description note finding aid written rev status author finding aid marc j date description description language description note finding aid written repository repository repository part historical society repository j h

Play around with the training data and model params until you find your desired output!