Link Search Menu Expand Document


Using the Data Layers Repository

The Data Layers Repository hosts primarily collections data, datasets and data extraction/transformation scripts.


This repository comprises the following directories:

  • base-layers: curated datasets derived from library metadata records
  • extension-layers: datasets that extend/enrich base layers
  • source-data: snapshots of library metadata records
  • transformation-scripts: Python scripts for extracting and transforming data from library metadata records

The “base-layers,” “extension-layers” and “source-data” directories each contain subdirectories, organized by collection. Subdirectories under “source-data” contain third-level directories, organized by metadata schema. Subdirectories under “extension-layers” contain third-level directories organized by project.

Data Dictionary

The CaD@Pitt Data Dictionary defines and describes the data layers in this repository, including the project’s related controlled vocabularies and encoding schemes, metadata element sets, and application profiles.

Getting Content from the Repository

Because the Data Layers Repository is published on GitHub, you have different options for accessing its content. You may download the repository contents or clone the entire repository. Determining which method will likely depend on how you plan to use the data and scripts.

Download the Repository Contents

If you are most interested in working with the base layers and/or extension layers that are published in the repository, downloading the repository is the easiest way to do so. From the Data Layers Repository main page, look for the green button, just above the file list, labeled “Code.” Click on this button and then look for the link that reads “Download Zip.” This option will give you a bundled Zip file that will decompress into an exact copy of the directories and files in the GitHub repository. You’ll find base layers in the “base-layers” directory, extension layers in the “extension-layers” directory, and so on.

One detail to note about the download option: it gives you a snapshot of the repository contents at that moment in time. If you are interested in working with any updates or additions to the content in the repository, you will have to repeat the download process again later, and manually merge the new files into your workspace, if necessary.

Clone the Repository

If you are interested in keeping your copy of the repository synced to the source, and/or plan on creating your own version of the entire Data Layers publishing framework within git or GitHub, you may wish to clone the repository.

If you are new to git or GitHub, the term “clone” may sound (understandably) strange. In git and GitHub, a clone is a special copy of a repository. Like a download, it obtains a complete copy of the repository for your local computer. However, unlike a simple download, which is disconnected from the source repository, a clone maintains the connection back to its source. It is then possible to pull down changes from the remote repository and merge them into your local copy, or push changes that you have made locally back to a repository on GitHub.

If you plan to develop your own custom version of the Data Layers framework, rather than only syncing with the CaD@Pitt repository, you may wish to fork the repository, which is similar to a clone but makes a separate personal copy for you to modify.

To clone the repository requires having git or GitHub Desktop installed on your local computer. That configuration is beyond the scope of this documentation, but there is plenty of guidance elsewhere. Once this is configured, you will look for the green “Code” button on the Data Layers Repository, and then choose “Clone” via SSH or GitHub Desktop.

Creating Base Layers

If you would like to extract data from source data (XML files) and transform them into base layers (CSV files), see the Data Extraction and Transformation Workflow documentation for instructions.

This content is available under the Attribution-NonCommercial 4.0 Unported (CC BY-NC 4.0), which provides users the right to "copy and redistribute the material in any medium or format" or "remix, transform, and build upon the material," provided that the usage gives appropriate credit, provides a link to the license, and indicates if changes were made. The NonCommercial aspect of this license means that you may not use the material for any commercial purposes.