%%capture
!pip install JATA
!pip install altair
from Text import *
from CJH import CJH_Archives
import altair as alt
Run the following cell if you want to work with meta-data. If not, skip over it.
collections= CJH_Archives('AJHS').get_meta_data('collections', 1, 2)
Finding aid descriptions are set as default but you can pick any column name from the imported data. You may also want to experiment with records data!
#Set inital quotes df.
df_quotes = collections
li_quotes = df_quotes['Finding Aid & Administrative Information'].tolist()
stringV = li_quotes
print("Number of Rows", len(li_quotes))
a = ' '.join(stringV)
b_meta = wordninja.split(a)
print(len(b_meta))
All possible fields we could analyze:
df_quotes.columns
The step that poses the most issues when analyzing journal articles or academic papers is converting the file from a pdf to plain text. A pdf has a lot of other information on each page other than the content of the actual text. Think page numbers, citation caveats, margin notes, or tables and graphs.
Set the variable in the following sell to True if you want to work with pdfs, if not, leave it be. Load your pdfs into the content file on the left
working_with_pdfs = False
if working_with_pdfs:
fileDF = parse_all_pdfs_in_curr_dir()
#Set inital quotes df.
df_quotes = fileDF
li_quotes = df_quotes['Text'].tolist()
stringV = li_quotes
print("Number of Articles", len(li_quotes))
a = ' '.join(stringV)
b_pdf = wordninja.split(a)
print(len(b_pdf))
else:
pass
First, a note on the difference between Stemming vs Lemmatization:
Stemming: Trying to shorten a word with simple regex rules
Lemmatization: Trying to find the root word with linguistics rules (with the use of regex rules)
df_words = stopStemLem(li_quotes)
print("df_words.head(10):")
print(df_words.head(10))
#hide-input
df_words.head(50)
df_words = df_words[['lem', 'pos', 'counts']].head(200)
dfList_pos = format_stopstemlem(df_words)
#hide-input
dfList_pos[0]
dfList_pos[1]
dfList_pos[2]
dfList_pos[3]
source = df_words[df_words.counts>1].sort_values(by=['counts'], ascending=False)
alt.Chart(source).mark_bar(opacity=0.7).encode(
y=alt.Y('lem:N',sort= {"op": "distinct", "field": "sort_order:O"}),
x=alt.X('counts:Q', stack=None),
color="pos:N",
)
While parsing PDFs the most common thing to see are page numbers and words that are stucktogetherlikethis. To handle this and to make our training data more robust we use a package called word ninja that uses english corpuses (corpii?) and some fancy math to split them up correctly. We also remove all numbers that are not spelled out in the text.
#collapse-hide
clean_text_for_training = clean_plain_text_for_training(stringV)
file = " ".join(clean_text_for_training)
x_data, X, y, chars = find_patterns(file)
JATA comes built in with a params function but you can feel free to override them in the custom function if you like! Set the my_own_params flag to True if you want this setting!
FYI - This can take some time (Sometimes up to an hour using the built in settings), so grab a snack or take a nap!
Reducing the epochs will reduce the time it takes to train however it will also reduce the robustness of your output!
my_own_params = False
def set_model_params_custom(X, y):
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
desired_callbacks = [checkpoint]
model.fit(X, y, epochs=1, batch_size=200, callbacks=desired_callbacks)
return model
import time
start = time.time()
if my_own_params:
model = set_model_params_custom(X,y)
else:
model = set_model_params(X,y)
end = time.time()
print("Time Elapsed:")
print(end - start)
filepath = "model_weights_saved.hdf5"
print(what_does_the_robot_say(x_data,model, chars,filepath))
Play around with the training data and model params until you find your desired output!