Archive Scraping Tutorial

Learning Goals

  • Understand what an archive scraping is
  • Understand how what a computer sees is different from what we see
  • Understand how to pick what to scrape from the AJHS archive
  • Understand what information archive scraping can provide researchers that a regular archive search can’t
  • Practice running a script that will scrapes the AJHS or CJH archive in order to gain specific knowledge about the collection

In this advanced tutorial, you will learn about archive scraping and what kinds of data archive scraping provides that we don’t normally have access to. Web scraping uses automation to retrieve large amounts of data points that would take a human months or years to complete. In the business world, web scraping is used for things such as price monitoring, market research, and news monitoring. Recent scholars have used web scraping to study the popularity of 2012 and 2016 US Presidential Election candidates, to analyze women’s presence on YouTube in Spain, and to track COVID-19 cases in India. But how can historians use this tool?

Scraping the AJHS archive allows us to see what is in the archive without having to click through the whole website. In this sense archive scraping allows viewers to have a “Bird’s Eye View” of the archive and to analyze the AJHS website as an artifact itself. This interactivity creates a dataset that people will be able to analyze further in the Advanced Data Visualization and Textual Analysis Interactivity. Questions one will be able to answer (and visualize) with this data set include things such as the geographic distribution of the materials in the archive or what keywords (subjects) are most important across the archive.

Run the code yourself in google Colaboratory! All cells must be run in order, or use the run-all button in the run menu on the top left! This activity should take around 15 mins.