Extension Projects

Extension layers are scholar-created datasets that enrich/augment library collections data (i.e., source data and base layers), with accompanying documentation. They may improve the quality of the metadata in the source data and base layers, contribute data missing in library collection data, introduce new elements/data points, or not only (or at all) take the form of spreadsheet files; for example, an extension layer might include data visualizations. Below is a listing of extension projects, by collection, as of August 2021.

American Left Ephemera Collection

Geotags Layer – Van Schenck

Creator: Reed Van Schenck (Email: crv18@pitt.edu; Twitter: Reed Van Schenck)

Last modified: August 4, 2021

Description: This extension layer contributes latitude, longitude, and address data for all of the events located in Pittsburgh and its surrounding metropolitan area that appear in items within the American Left Ephemera Collection.

View project »

Bob Nelkin Collection of Allegheny County Chapter of the Pennsylvania Association for Retarded Children (ACC-PARC) Records

Natural Language Processing Layer – Naismith

Creator: Ben Naismith (Email: bnaismith@pitt.edu)

Last modified: July 31, 2021

Description: This project offers a natural language processing extension layer for the Bob Nelkin Collection of Allegheny County Chapter of the Pennsylvania Association for Retarded Children (ACC-PARC) Records. There are five sections which may be of use to researchers interested in the collection:

Exploratory Data Analysis (EDA): Exploratory data analysis provides a standard first step in any data exploration and corpus analysis. In this case, the EDA looks at the contents of the source-data and base-layers folders to better understand the quantity and types of data present in the collection.
Pre-processing: Pre-processing is carried out to create a single dataframe with the texts and their metadata, with standardized fields and no missing data.
Processing: Processing involves manipulating the text into formats which may be of use to researchers and allows for greater analysis. This notebook carries out the following processes: tokenization, part-of-speech tagging, lemmatization, spelling correction, and genre tagging.
Text analysis: Text analysis in this case refers to the use machine learning tools to extract information about the texts in terms of entities, topics, and sentiment. All of the text analysis tools use APIs from meaningcloud.com. This information can be used to filter only those texts related to certain topics or containing certain sentiments.
Lexical analysis: The tools and data in this notebook are intended to allow for a greater understanding of the lexis used in the collection's texts through consideration of frequencies of lexical items and the contexts in which they occur. There are two sections to the notebook, Concordancing and Collocations.

View project »

OCR Layer – Naismith

Creator: Ben Naismith (Email: bnaismith@pitt.edu)

Last modified: July 31, 2021

Description: This extension layer contains text files for each of the texts in the Bob Nelkin Collection of Allegheny County Chapter of the Pennsylvania Association for Retarded Children (ACC-PARC) Records collection. The original source data OCR folder contained a number of blank files, so all PDF files were converted to text files again using the Optical Character Recognition (OCR) workstation at Pitt, which uses ABBYY FineReader 14 to convert images to text. These new text files are used throughout the Natural Language Processing Layer – Naismith.

View project »

Part-of-Speech Tags Layer – Naismith

Creator: Ben Naismith (Email: bnaismith@pitt.edu)

Last modified: July 31, 2021

Description: This extension layer contains part-of-speech (POS) tags for the Bob Nelkin Collection of Allegheny County Chapter of the Pennsylvania Association for Retarded Children (ACC-PARC) Records. The pos-tagged folder contains each of the text files from the collection after tokenization and POS tagging using the CLAWS7 tagset (see the bob-nelkin-collection_processing.ipynb notebook for details).

View project »

Communist Collection of A.E. Forbes

Geotags Layer – Van Schenck

Creator: Reed Van Schenck (Email: crv18@pitt.edu; Twitter: Reed Van Schenck)

Last modified: August 4, 2021

Description: This extension layer contributes latitute and longitude coordinates as well as modern-day street addresses corresponding to events held in and around Pittsburgh that are mentioned in items within the Communist Collection of A.E. Forbes. This data helps researchers visualize the Communist network in Pittsburgh and identify neighborhoods with greater or lesser Communist activity. This layer also includes a "Precision" field where discrepancies about locational data and ranges of confidence are mentioned; entries with no precision data are high confidence.

View project »

Event Location Data Layer – Stockton

Creator: Moira J. Stockton (Email: mas850@pitt.edu)

Last modified: November 9, 2021

Description: This extension layer describes a subset of the fliers and pamphlets from the Communist Collection of A.E. Forbes, containing information for 55 items which advertised events within the city of Pittsburgh. This data provides the ID and title of each item, creators, dates and abstracts from ULS Digital Collections, as well as the street address and latitude and longitude of each event location, with a note of the current status of the location (i.e., still standing).

View project »

Medieval and Early Modern Facsimiles and Original Materials at the University of Pittsburgh – Wipf

Enrichment Layer – Wipf

Creator: Briana J. Wipf (Email: brianawipf@gmail.com)

Last modified: September 1, 2021

Description: This extension layer contributes enrichments to the base layer of the Medieval and Early Modern Facsimiles and Original Materials at the University of Pittsburgh – Wipf collection. There are two datasets in this extension layer: the medical dataset, which describes 590 of the 1,239 items in the collection, and the nonmedical dataset, which describes 520 items:

Medical dataset: The medical dataset includes data that standardizes or supplements values for the publication_place, language, and publication_date elements. Empty and pre-existing values for these three elements have been replaced with more accurate and consistent data. Data for the other 25 elements remain the same. See the About the Data section for more information.
Nonmedical dataset: The nonmedical dataset includes data that standardizes or supplements values for the language, geographic_coverage, and temporal_coverage elements. Empty and pre-existing values for these three elements have been replaced with more accurate and consistent data. Data for the other 25 elements remain the same. See the About the Data section for more information.

View project »