California Digital Library (CDL) and Dryad recently announced a partnership to address researcher needs and to “move the needle” on digital data adoption throughout the scholarly research community by working together to understand and respond to researcher needs for high-quality scientific data publishing infrastructures. Metadata that supports accessibility to and understanding of published datasets plays a critical role in driving this adoption.
I have been working with CDL’s University of California Curation Center (UC3) on Metadata 2020 and this new partnership provides a great opportunity to apply some of the ideas developed there to a challenging real-world metadata evaluation, guidance and improvement problem. This initial post explores current Dryad metadata models and usage to lay a foundation for understanding these metadata and forming data-driven recommendations for improvements in the future.
Dryad Metadata Model
The Dryad Metadata Model describes two types of objects: dataFiles, and dataPackages that contain them. This metadata hierarchy is common in other metadata models as well, sometimes called granule and collection metadata. All of these models include terms that are shared by both types of objects and others that are unique to one or the other.
Curated examples of how Dryad metadata are used are available on the Dryad Wiki. Existing Dryad metadata provides a rich source for understanding these metadata and how they are used in real-life. In order to understand the usage, we examined a random sample of metadata for 114 dataPackages and 335 dataFiles from Dryad. Hopefully this sample is large enough to identify similarities and differences.
The Dryad metadata terms naturally divide into four groups shown in the Table below along with the % of the records of each type (dataPackage = blue, dataFile = orange) that include each of the terms on the Y axis. For example, the first group (Data Files) includes four terms (isPartOf, provenance, rights, and format) listed on the X axis. Three of these are required (marked by *) and two are repeatable (marked by +). The term isPartOf occurs in 100% of the dataFile metadata records (shown as orange) and 0% of the dataPackage records (shown as blue). This usage is expected, as this term gives the dataPackage that a dataFile is part of.
DataFile terms: Four terms that pertain only to dataFiles.
Three of these (isPartOf, provenance, and rights) are required (marked with *) and occur in virtually all dataFile records. They contain the identifier of dataPackage that the dataFile is part of (typically a DOI), the state of the dataFile in the Dryad workflow, and the license for the dataPackage (all Creative Commons Zero, CC0).
The format term is optional and occurs in 86% of the dataFile records. It typically contains the size of the dataFile in bytes.
Shared terms: Five optional terms that are used in roughly the same number of dataPackages and dataFiles to provide information about several types of keywords (time period, spatial location, scientificNames, and subject).
The similarity of these numbers reflects the fact that dataPackages aggregate the keywords from the dataFiles that they contain.
DataPackage terms: Five terms, three of which are required, that are used in essentially all dataPackages and varying numbers of dataFiles to provide submission history and descriptions and to connect dataPackages to dataFiles (hasPart) and to published papers (references).
All Files and Packages: Four required terms and the XML schemaLocation that provide basic discovery and access information (title, creator, identifier) and give the types for all dataPackages and dataFiles.
Several of these terms can occur more than once in a single metadata record (marked with + in the Table above). The most numerous terms in the dataPackage and dataFile metadata records are creators, averaging ~5 / record. Subject terms come in a close second with an average of just fewer than 5 / record. Most dataPackages in this sample (54%) contain a single dataFile and 77% contain less than three dataFiles. The maximum number of dataFiles in a dataPackage is 72 in this sample.
I proposed several metrics for characterizing and comparing metadata collections. One of these measures the collection homogeneity calculated as the number of terms that are in all records / the total number of terms in the collection. The Dryad dataPackage sample contains fourteen terms and nine of them are included in all records so homogeneity is 65%. The dataFile sample includes seventeen terms and six are included in all records so homogeneity is 35%. This difference may reflect the fact that much of the dataPackage metadata is managed as part of the central repository workflow while the dataFile metadata are provided by the researchers that use Dryad to publish data.
This initial exploration identifies the two types of Dryad metadata, outlines some interesting similarities and differences between them, and helps form a baseline for understanding intended metadata use cases. The metadata were designed to connect elements of information packages that include scientific papers (identified in the references element) and related dataFiles. They provide identifiers and discovery information that includes titles, creators, paper abstracts, and three types of keywords (subject, scientific name, spatial, and temporal) and track the data publication process (processing dates and provenance). The scientific papers are the primary discovery path for the data and are required for data users to understand details of how the data were collected and processed. This is different than the model used in many repositories where these details can be included in the dataset metadata and users can access them independently.