The Big Picture - Has CrossRef metadata completeness improved?

I recently introduced a simple visualization of data from the CrossRef Participation Reports that provides quantitative insight into how completeness of CrossRef metadata collections with respect to eleven key metadata elements has changed between the backfile and the current record collections. I showed examples of how these numbers could be used to help increase awareness of specific areas in which collections are more complete and areas where they could increase completeness. An important goal of that work is to identify members with relatively complete metadata that can provide examples for helping other members understand the benefits of improved metadata.

Before we explore and compare individual cases, it is interesting to establish the big picture as a background that we can keep in mind as we look at specifics. Continuing to focus on journal articles, as described in the earlier blog, I increased my sample size to include 8670 collections in the current time period and 7293 collections in the backfile, essentially all CrossRef members. This shows an increase of almost 20% in the number of collections contributed to and managed by CrossRef.  This is great. The next question is: are the more recent metadata more complete with respect to the elements included in the participation reports?

The first look at this comes from creating a standard metadata evolution plot for all journal-article collections. This plot shows the average completeness for eleven metadata elements during the backfile (orange) and current (blue) time periods. The first observation is that, on average, CrossRef collections are generally not very complete with respect to these metadata elements. The average completeness in the current period is 18%, up from 13% during the backfile. Note that the scale maximum in this plot is 50% rather than 100% as it has been in plots in the previous blogs.

Average completeness % for all CrossRef Journal-Article collections during the backfile (orange) and current (blue) time periods. The difference between these two time periods, i.e. the change in completeness, is positive for all metadata elements.

The data indicate that Similarity Checking is the most commonly used metadata element (of the eleven included) during the backfile (31%) and the current (37%) time periods and that Open References increased the most (13%) between the two time periods. Of the remaining elements, most show increases of 5-8%, while Award Numbers, Funders, and Update Policies show increases of 1-2%, and References shows no increase.

A second way to look at these data is to determine the % of collections that have content for the metadata elements included in the Participation Reports for two time periods. Similar to the earlier case, Similarity Checking occurs in more collections (43%) than any other metadata element in both time periods. The number of collections with Reference metadata is also high, but decreased in the current period.

The % of CrossRef collections that include metadata elements during the backfile (orange) and current (blue) time periods. Note that this percentage has increased for all metadata elements except references. While completeness of references has decreased by 7%, completeness of Open References has increased by 13%. This indicates migration of the entire collection towards open-references.

Another question of interest is: how many collections were nearly complete (>90%) in these two time periods with respect to these elements. This Figure shows numbers to address this question. Of course, the numbers are smaller, and in this case, the current period (blue) has more almost complete collections for all elements. The largest increase in % almost complete collections is 13% for open references.

The % of CrossRef collections that are almost complete (90+%) during the backfile (orange) and current (blue) time periods. The % of Open References shows the largest increase (13%).

These last two datasets can be combined to answer the question: how likely is it that a collection that includes a particular element has that element in all of the records in the collection. These data are shown in the final figure. The light blue region shows the current data on % collections with content and the dark blue data show the % of collections which are almost complete for each element. If these two numbers are close, it means that if a collection includes an element, it is present in most of the records in the collection. This is essentially true for Open References and means that a metadata provider that provides open references tends to include them in all records.

A comparison of the % of collections that currently have content (light blue) and those that are almost compete (dark blue) for each element. When these two numbers are close (as in the Open References case, it indicates that collections that include an element include it in almost all records in the collection.

Conclusions

Overall, these data show that CrossRef metadata have improved and that there is room for much larger improvements in the future.

CrossRef introduced the CrossRef Participation Reports saying “Membership of Crossref is not just about getting a persistent identifier for your content, it’s about placing your content in context by providing as much metadata as possible and looking after it long-term”. A presentation by Stephanie Dawson from ScienceOpen showed that “articles with additional metadata had much higher average views than those without - depositing richer metadata helps you get the best value from your DOIs”. The focus of these quantitative reports is clearly on improvements to the CrossRef metadata which increasing understanding of the context in which a particular resource exists, i.e. connecting resources into the web of scholarly communication.

The data presented here indicate that most CrossRef collections do not currently include most of the metadata elements measured in the CrossRef Participation reports, i.e. there are many opportunities for members to increase the value of their CrossRef metadata by adding connections to other kinds of content. The importance of this type of information is highlighted in the FAIR Data Principles which include “metadata include qualified references to other data”, so members also benefit from increases in “FAIRness” of their collections by adding this content. My interest is measuring those improvements to identify and highlight examples of CrossRef members that are implementing them. Those results are coming up.

Metadata Evolution - Metadata Completeness, Agility, and Collection Size

I recently introduced a simple metric for measuring metadata collection completeness with respect to elements in the CrossRef Participation Reports. The suggestion of this metric immediately led to speculation about relationships between collection size and completeness. Small collections include fewer records – are they more likely to be complete? Publishers with large collections have more resources – do they have more complete metadata? Are smaller publishers more agile - can they change more?

Read More

Metadata for Using and Understanding Software

All scientific communities have been linking research together for many years using references to related work in articles. Recently these communities are exploring options for linking to datasets and software. As part of this effort, the CodeMeta Project recently proposed a vocabulary for metadata for code based on schema.org.

A mapping between the codeMeta vocabulary and the ISO 19115-1 metadata standard was recently included on the codeMeta Git Repository. Creating this mapping surfaced some interesting differences between these two approaches. An interesting similarity also emerged. The codeMeta vocabulary has been mapped to over twenty metadata dialects listed along the bottom of this Figure. On average, these dialects mapped 11.2 of the 68 codeMeta terms. The ISO mapping is shown near the left edge of this Figure. It included 64 of the 68 items. This indicates that the codeMeta and ISO dialects are more similar than many of these other dialects.

The difference likely reflects the fact that most of these dialects focus primarily on citation and dependency identification while ISO and codeMeta include metadata that supports use and understanding of data and software. This broader scope requires more metadata concepts, many of which codeMeta and ISO share. Check out the details in Mapping ISO 19115-1 geographic metadata standards to CodeMeta in PeerJ Computer Science.

This Figure shows the number of codeMeta terms that are included in mappings to many dialects. The red line shows the average number for many of these dialects (11.2). The two bars on the left show codeMeta and ISO 191156-1. The ISO mapping includes sixty-for of sixty-eight codeMeta terms. This similarity reflects the fact that both of these dialects include metadata to help users use and understand software as well as cite it.

This Figure shows the number of codeMeta terms that are included in mappings to many dialects. The red line shows the average number for many of these dialects (11.2). The two bars on the left show codeMeta and ISO 191156-1. The ISO mapping includes sixty-for of sixty-eight codeMeta terms. This similarity reflects the fact that both of these dialects include metadata to help users use and understand software as well as cite it.

Dryad Data Packages and Files

California Digital Library (CDL) and Dryad recently announced a partnership to address researcher needs and to “move the needle” on digital data adoption throughout the scholarly research community by working together to understand and respond to researcher needs for high-quality scientific data publishing infrastructures. Metadata that supports accessibility to and understanding of published datasets plays a critical role in driving this adoption.

I have been working with CDL’s University of California Curation Center (UC3) on Metadata 2020 and this new partnership provides a great opportunity to apply some of the ideas developed there to a challenging real-world metadata evaluation, guidance and improvement problem. This initial post explores current Dryad metadata models and usage to lay a foundation for understanding these metadata and forming data-driven recommendations for improvements in the future.

Dryad Metadata Model

The Dryad Metadata Model describes two types of objects: dataFiles, and dataPackages that contain them. This metadata hierarchy is common in other metadata models as well, sometimes called granule and collection metadata. All of these models include terms that are shared by both types of objects and others that are unique to one or the other.

Curated examples of how Dryad metadata are used are available on the Dryad Wiki. Existing Dryad metadata provides a rich source for understanding these metadata and how they are used in real-life. In order to understand the usage, we examined a random sample of metadata for 114 dataPackages and 335 dataFiles from Dryad. Hopefully this sample is large enough to identify similarities and differences.

The Dryad metadata terms naturally divide into four groups shown in the Table below along with the % of the records of each type (dataPackage = blue, dataFile = orange) that include each of the terms on the Y axis. For example, the first group (Data Files) includes four terms (isPartOf, provenance, rights, and format) listed on the X axis. Three of these are required (marked by *) and two are repeatable (marked by +). The term isPartOf occurs in 100% of the dataFile metadata records (shown as orange) and 0% of the dataPackage records (shown as blue). This usage is expected, as this term gives the dataPackage that a dataFile is part of.

DataFile terms: Four terms that pertain only to dataFiles.

Three of these (isPartOf, provenance, and rights) are required (marked with *) and occur in virtually all dataFile records. They contain the identifier of dataPackage that the dataFile is part of (typically a DOI), the state of the dataFile in the Dryad workflow, and the license for the dataPackage (all Creative Commons Zero, CC0).

The format term is optional and occurs in 86% of the dataFile records. It typically contains the size of the dataFile in bytes.

Shared terms: Five optional terms that are used in roughly the same number of dataPackages and dataFiles to provide information about several types of keywords (time period, spatial location, scientificNames, and subject).

The similarity of these numbers reflects the fact that dataPackages aggregate the keywords from the dataFiles that they contain.

DataPackage terms: Five terms, three of which are required, that are used in essentially all dataPackages and varying numbers of dataFiles to provide submission history and descriptions and to connect dataPackages to dataFiles (hasPart) and to published papers (references).

All Files and Packages: Four required terms and the XML schemaLocation that provide basic discovery and access information (title, creator, identifier) and give the types for all dataPackages and dataFiles.

Several of these terms can occur more than once in a single metadata record (marked with + in the Table above). The most numerous terms in the dataPackage and dataFile metadata records are creators, averaging ~5 / record. Subject terms come in a close second with an average of just fewer than 5 / record. Most dataPackages in this sample (54%) contain a single dataFile and 77% contain less than three dataFiles. The maximum number of dataFiles in a dataPackage is 72 in this sample.

Collection Homogeneity

I proposed several metrics for characterizing and comparing metadata collections. One of these measures the collection homogeneity calculated as the number of terms that are in all records / the total number of terms in the collection. The Dryad dataPackage sample contains fourteen terms and nine of them are included in all records so homogeneity is 65%. The dataFile sample includes seventeen terms and six are included in all records so homogeneity is 35%. This difference may reflect the fact that much of the dataPackage metadata is managed as part of the central repository workflow while the dataFile metadata are provided by the researchers that use Dryad to publish data.

Conclusion

This initial exploration identifies the two types of Dryad metadata, outlines some interesting similarities and differences between them, and helps form a baseline for understanding intended metadata use cases. The metadata were designed to connect elements of information packages that include scientific papers (identified in the references element) and related dataFiles. They provide identifiers and discovery information that includes titles, creators, paper abstracts, and three types of keywords (subject, scientific name, spatial, and temporal) and track the data publication process (processing dates and provenance). The scientific papers are the primary discovery path for the data and are required for data users to understand details of how the data were collected and processed. This is different than the model used in many repositories where these details can be included in the dataset metadata and users can access them independently.


Talking and Thinking About Metadata

The idea that the language we use to talk about things shapes the way we think or can think about those things has been around since the 1800’s and even has a name, the Sapir–Whorf hypothesis, proposed during 1954. It was Whorf who said, “Language is not simply a reporting device for experience but a defining framework for it.” Last year Lera Boroditsky discussed a similar idea from the stage at TEDWomen with some nice examples and data from multiple languages and cultures. I have been thinking and writing about a universal documentation language for some time and bring together a couple of those ideas here.  

Some metadata terms emerged from my metadata evaluation and guidance work with many partners.  I described the concept of “metadata dialects” and suggested that many metadata standards are more like dialects of a universal documentation language then they are like separate languages. Some have questioned whether a universal “documentation language” really exists. I admit that it is really a concept that I hope exists rather than a real language described in an unabridged dictionary somewhere. 

More recently, I introduced this dialect nomenclature to the Metadata 2020 community of metadata experts that advocate richer, connected, reusable, and open metadata for all research outputs. The terms are slowly creeping into some Metadata 2020 discussions, hopefully helping to build and cross bridges between different communities that are committed to better metadata in all contexts.

Documentation or Metadata?

Many datasets and products are documented using approaches and tools developed by data collectors to support their analysis and understanding. This documentation exists in notebooks, scientific papers, web pages, user guides, word processing documents, spreadsheets, data dictionaries, PDF’s, databases, custom binary and ASCII formats, and almost any other conceivable form, each with associated storage and preservation strategies. This custom, often unstructured, approach may work well for independent investigators or in the confines of a particular laboratory or community, but it makes it difficult for users outside of these small groups to discover, use, and understand the data without consulting with its creators.

Metadata are standard and structured documentation.

Metadata are standard and structured documentation.

Metadata, in contrast to documentation, helps address discovery, use, and understanding by providing well-defined, structured content. This makes it possible for users to access and quickly understand many aspects of datasets that they have not collected. It also makes it possible to integrate information into discovery and analysis tools, and to provide consistent references from the metadata to external documentation.

Metadata standards provide standard element names and associated structures that can describe a wide variety of digital resources. The definitions and domain values are intended to be sufficiently generic to satisfy the metadata needs of various disciplines. These standards also include references to external documentation and well-defined mechanisms for adding structured information to address specific community needs.

Another important difference between documentation and metadata is the target audience. Documentation is targeted at humans and it relies heavily on our capability to make sense out of a variety of unstructured information. Metadata, on the other hand, is typically targeted at applications. Many of these applications facilitate searching metadata and displaying it in a way that facilitates data discovery by humans. As tools mature and, more importantly, the breadth of existing metadata increases, we will see more and more applications creating and using metadata to facilitate more sophisticated metadata and data driven discovery, comparisons between multiple datasets, and other analyses.

Of course, the audience is also very important when we create metadata. Humans like descriptions that help them understand the resources being described and citations to more, likely unstructured, information. Applications are generally much more demanding when it comes to consistency and completeness. It is important to consider both audiences when creating and improving metadata.

Note added: It is interesting to see that the word “documentation” has a much longer history than the word “metadata”. Metadata is really the new kid on the block.