About CaD@Pitt

From Collection Records to Data Layers:

A Critical Experiment in Collaborative Practice

made possible by Collections as Data: Part to Whole, an Andrew K. Mellon Foundation-funded initiative

Background and Rationale

To what degree can reframing how we think about collections as data change the culture that surrounds its production and distribution? This is the central question of the CaD@Pitt project, which centers on a collaborative experiment that includes the Archives and Special Collections and the Digital Scholarship Services unit in the University Library System (ULS) and the English Department at the University of Pittsburgh. Before taking on this project, the project team members all had prior experience with a methodological pattern of scholarship based on enriching existing library-generated collections data through research-driven, often interpretive, layers of additional data. There is great value in this method, but in practice it is often exceedingly difficult to execute because the workflows and tooling between the library datastores and researchers' access and activity are lacking. Furthermore, there are significant lost opportunities to "close the circle," either by directing appropriate researcher-created description back into library records and/or for the library to support the stewardship and scholarly reach of the newly-created research datasets.

This project comes out of a longstanding desire on the part of the project team members to build better collaborative models for analyzing and improving the datasets behind the ULS digital collections. For several years, ULS units, including but not limited to Archives and Special Collections, Cataloguing, Digital Scholarship Services, have been working on a unified platform for digital collections that would facilitate a range of digital projects. Simultaneously, Lavin's scholarship and teaching have focused on cultural analytics approaches to book historical research questions, especially the role of periodicals in the framing of literary taste and prestige. The members of this project team collectively represent an important example of the kinds of stakeholders who might be coming together on a range of collections as data implementations. Our project foremost serves communities of would-be collaborators by contributing to ongoing discussions about how the overlapping goals of teaching, research, and collections development can work in concert to proliferate and enrich digital materials, especially materials highlighting the voices of underrepresented groups, which are often given inadequate attention for structural or systemic reasons.

Objectives

The CaD@Pitt project aims to increase the visibility and discoverability of library collections, make library collections data accessible for computational use, and enable scholars to extend/enrich collections data with critical, research-driven layers of additional data. This project is targeted toward library workers, scholars (researchers, students, and instructors), and constituents of the public who are seeking to implement similar endeavors at their own institution, acquire and generate collections data, or develop/incorporate collections data-centered curricula to teach critical and computationally minded data practices.

Collections

The collections in the CaD@Pitt project currently include digital collections from the Library's archives and general, special, and distinctive collections in the Library's catalog. Our project also enables researchers to curate new collections by selecting items from these collections. While our project is intended to support any and all library collections, it has prioritized those reflecting the perspectives of underrepresented groups. These collections have strong potential to tell as-yet untold stories and can increase institutional commitment to bringing underrepresented histories to light. Our target collections feature the following:

the voices of African and African Diasporic communities, American labor unionists, American left-wing organizations, the LGBTQ community, and feminists;
a diverse array of serials (e.g., journals, magazines, newspapers, newsletters) and ephemera (e.g., broadsides, flyers, cartoons).

Some of these materials have been digitized, such as the Shooting Star Review, a literary magazine that explores the African American experiences through literature and art; materials in the American Left Ephemera Collection, a variety of textual and visual materials—including periodicals, photographs, letters, pamphlets, posters, flyers, labels, pins, etc—that document left-wing organizations in the U.S. as early as 1890 and during the 20th century; the Communist Collection of A.E. Forbes, which contains hand-rendered and mimeographed broadsides, newsletters, and flyers that capture Pittsburgh's radical politics during the 1930s and 1940s; and Fred Wright Cartoons, a collection of thousands of cartoons created between 1939 and 1984 by American labor activist and cartoonist Fred Wright, which reflected the politics and labor issues of the day and were featured in newspapers and union publications internationally. The majority of these materials, however, have not yet been digitized.

The ULS's Archives and Special Collections unit has curated a collection of periodicals and publications that subvert dominant political powers or religious institutions, referred to as the Underground Press. This collection is comprised of radical publications, student press, socialist press, alternative professional press, LGBTQ press, and feminist press. The library has also collected over a hundred publications that reflect the perspectives of African Americans.

Our project specifically targets (collections of) serials and ephemera because the ULS has many such materials, both digitized and non-digitized, in which the voices of marginalized and underrepresented groups are expressed.¹ These materials are of strong interest to student and faculty researchers looking for inclusive representation because historically, the conditions of their production and reception often made serials and ephemera more desirable venues for specific counterpublics than monograph publication.²

Data Layers

Our project takes a “data layers” approach, which diverges from monolithic data paradigms (i.e., singular, non-interpretive, and exhaustive). Instead, it presents an interface comprising data from multiple sources that vary in encoding scheme, granularity of description, and completeness/richness. This approach liberates data creation and curation from the expectation of perfection or singularity of authority, and it allows data to be enriched, augmented and interpreted incrementally over time through a layering process. It also provides a practical and low-barrier entry point for scholars to create datasets, or data layers, based on their research priorities and to share those layers. Our project model specifies three types of data layers:

source data: snapshots of library collections metadata files in their original source format;
base layers: curated datasets (CSV files) derived and simplified from the source data layers; and,
extension layers: scholar-created datasets or outputs that enrich/augment library collections data (i.e., source data and base layers).

The base layers are created using a Python program that extracts and transforms source data into a flat data model (i.e., CSV). This data munging/wrangling process involves mapping the base layer elements to their respective nodes in source data XML files (i.e., EAD, MODS, MARCXML, or RELS-EXT/RDF), selecting only these mapped nodes for extraction, inputting (sometimes processed) values into rows and columns, and outputting one or two CSV files for collection-level and/or item-level data.

The data layers and Python scripts are hosted in the CaD@Pitt Data Layers Repository. Scholars can download data layers and scripts from the repository or clone the repository, request source data and base layers for library collections not already represented in the repository, and share their own extension layers in the repository. Our project serves as a model for how other libraries can share collections data, and our repository can support data layers for collections from other libraries, though local and (inter)national element sets and controlled vocabularies may be used differently across institutions.

Instructional Modules

Finally, this project facilitates scholars’ engagement with and enrichment of library collections data through five [5] instructional modules that 1) orient learners to collections as data and as products of curation and 2) teach critical and computationally minded data practices:

Develop a Custom Collection: Create a collection, drawing from any combination of the Library’s general, special, distinctive and/or archival collections;
Design a Layer: Propose a data collection plan for a dataset (extension layer), based on a custom or pre-existing collection, that answers a research question or meets a particular need;
Critique a Layer: Critique the utility, feasibility, and ethicality of an extension layer;
Implement a Layer: Implement an extension layer by entering data into a spreadsheet;
Visualize a Layer: Visualize an extension layer using a visualization tool.

These modules have been designed as a sequence but may be used for individual lessons and otherwise modified to suit varying contexts.

Notes

Serials here refers to any publication issued in successive parts which are intended to be continued indefinitely. We use the term periodicals as a synonym for serials when quoting or paraphrasing scholars who do not distinguish between the two categories.
There is a preponderance of scholarship on this aspect of serials. Lavin describes some of these connections in his statement of support.