Metadata Archeology: Hunting Affiliations and RORs in DataCite Metadata

In my last blog I introduced Metadata Archeology with a description of digging around in Crossref metadata for affiliations associated with authors of work published in the Dryad Data Repository. I showed how the Crossref Participation Reports could be used for an initial survey of Crossref metadata like satellite images are being used in the GlobalXplorer project to find ancient sites in Peru.

Many people and organizations are already involved in affiliation archeology. Experiences from the American Physical Society, NASA, Dryad, and may others all indicate that it is a tricky business, or maybe a swamp. DataCite  has rich potential for affiliation data. How can we survey that resource and begin to understand that potential?

Satellite image of archeological site in Peru from GlobalXplorer

Satellite image of archeological site in Peru from GlobalXplorer

DataCite Affiliations

The DataCite metadata model includes rich information about over 18 million items with DOIs. The schema includes at least four elements that describe organizations: publisher, creator, contributor, and fundingReference. Each of these elements has interesting characteristics:

  • The publisher element is required and, therefore, occurs in the vast majority of DataCite records. This guarantees lots of data.

  • The creator element is also required, but the affiliation sub-property is optional. Thus, creators exist in the vast majority of records, but how many of those have affiliations?

  • The contributor element is used to give credit to people or organizations that contribute to a research result in many different ways. In this case the element is optional as well as the affiliation information. How does that affect the provision of affiliations?

  • The funder element is optional, and it can include an identifier with a type. It has been in the DataCite schema since late 2016.

Do these differences affect the nature / quality of the affiliation data in the DataCite records? This is an initial look at those questions. 

The Data

The DataCite repository currently has metadata for ~18,000,000 registered items. That collection is huge and heterogeneous. Way too big for an initial survey! Can we break it into subsets that make sense?

DataCite has ~140 Providers that have between just 1 and almost 2 million records. Quite a variation in size! These Providers each manage DOIs for some number of Data Centers. Nearly half of the DataCite providers include only one Data Center. I expect these Data Center collections to be (at least somewhat) homogeneous, so they are the initial set of collections for this survey. All together there are 1432 Data Centers in the DataCite repository. Some are big and some are small.

I used the very useful DataCite sample capability that allows a random sample to be selected from any DataCite query to select a random sample of 100 items from each Data Center. This seems like a very small sample size, but the median Data Center collection size is 113, so, for 687 of the 1411 collections with findable content, a sample of 100 is the complete collection. In other cases, it is a small sample. Hopefully representative enough for an initial survey.

All together my sample included 94,032 records from 1411 Data Center collections. 


The vast majority of the collections have one publisher/record so I ended up with just over 93,000 publisher elements and just over 5,000 unique publishers.  

Figure 1. The number of Publishers in DataCite Data Center metadata. Note that a large portion of the Data Centers have very few unique publishers. This will simplify the process of assigning identifiers to these organizations.

The distribution of number of publishers / Data Center is shown in Figure 1 (note that the bin size along the y-axis is non-linear). The most common case (689) is one publisher / Data Center with almost 1300 Data Centers having ten or fewer publishers

Large numbers of Data Centers with small numbers of publishers is very exciting. It means that the process of transitioning to using persistent identifiers for publishers could be straight forward for many DataCite Data Centers. Almost 700 out of ~1400 Data Centers (49%) only need know one publisher identifier. In fact, the picture is even better because many of the Data Centers actually have multiple representations of the same publisher in their current metadata. In many cases these multiple representations boil down to simple misspellings, acronyms or differences in the addresses of organizations included in the affiliation strings. For example, the four publishers from the zbmed.ifado Data Center (shown below) are obviously all the same organization ( with small variations in the name:

  1. IfADo - Leibniz Research Centre for Working Environment and Human Factor

  2. IfADo - Leibniz Research Centre for Working Environment and Human Factors

  3. IfADo - Leibniz Research Centre for Working Environment and Human Factors, Dortmund

  4. Leibniz Research Centre for Working Environment and Human Factors, Dortmund

The long-term goal of this work is assigning identifiers to organizations in the scholarly communications community. These data suggest that identifying these inconsistencies and cleaning them up in the process of assigning identifiers will immediately improve connectivity for over 90% of the DataCite Data Centers. In the case shown above, a ROR already exists for Leibniz Research Centre for Working Environment and Human Factors and that is the only ROR that zbmed.ifado needs to know to assign RORs to their 477 records.

It is well known that the process of extracting meaningful organization names from affiliation information is a challenge. That remains true here. In the first pass I was able to identify RORs for roughly 25% of the publishers in this sample. This number will increase with a closer look at these strings and increased adoption of RORs across the scholarly communication community.

Creators and Contributors

As expected, the case for creator and contributor affiliations is different because this is optional information rather than required. Less than half of the Data Centers (591 / 1432) provide any affiliation information for creators or contributors. Figure 2 shows the number of creators / contributors per Data Center (note non-linear Y axis). Again, the most commonly occurring number of affiliations / Data Center is one for 108/591 (~18%) of the Data Centers. Once again, this helps make it possible to jump start the adoption process with these Data Centers. As in the Publisher case, there is also inconsistency in some of the affiliations which will decrease the number of RORs required as it is cleaned up.

Figure 2. The number of Creators / Contributors in DataCite Data Center metadata. Note that a large portion of the Data Centers have very few unique creators / contributors. This will simplify the process of assigning identifiers to these organizations.


Figure 1.. The number of Funders in DataCite Data Center metadata. Note that a large portion of the Data Centers have very few unique funders. This will simplify the process of assigning identifiers to these organizations.

Acknowledging funding for research projects and results in a specific section of research articles has long been a standard part of scholarly communications. Adding Funder names into the metadata makes it possible to index and search that information and identifiers improve the consistency of the identification and search results. In addition, they improve machine readability of the data

The number of Data Centers that provide funder information remains small (178/1432) three years after this content was introduced. Figure 3 shows the distribution of the number of Funders / Data Center (note non-linear Y axis). As in the other cases, many of the Data Centers that have funder names have a small number of Funders. Eighty-five Data Centers currently have five or less Funders. Also, as in the previous cases, there are multiple representations of many of the funders that, once cleaned up, will help increase the adoption of the current Crossref identifiers, or others.

As mentioned above, the fundingReference section of the DataCite metadata includes an element for the funding identifier and 46 of the Data Centers provide this information. The majority of the funder ids come from the Crossref Funder Registry, with most of the remainder from GRID or ISNE.

Can You Improve Your Affiliations?

Realizing the benefits of consistent organization names and permanent identifiers will take some time and effort across the whole community. The first pass ROR identification process does slightly better with affiliations than in the publisher case with RORs being identified for 27% of the affiliation strings. Some patterns emerge that might help researchers and providers improve their affiliation names:

1.     Standard Names – each organization has a standard name that is used in the ROR database along with the identifier. Look up your organization at and use that standard name along with the identifier of course.

2.     Acronyms – many of the affiliations in the DataCite metadata are just acronyms. Unexpanded acronyms are a well-known problem in the scientific literature. Using them alone to uniquely identify organizations can easily cause ambiguity and other problems in many cases. Probably best to avoid acronyms and abbreviations all together in affiliations.

3.     Addresses – all addresses are made up of many parts that can be combined or omitted in many ways. The DataCite metadata schema does not include an element that holds physical addresses so providers add parts of them onto affiliation strings. This exacerbates the recognition challenge. Probably best to avoid address information in DataCite affiliations unless it is critical.

4.     University Names – University names, like addresses, can have many parts. Is it “University of X”, “X University”, or “The University of X”? Of course, the word university can be abbreviated in many ways (U, U., Univ.) and is different in different languages. Then there are state universities with campuses in many locations. Is the separator a space, a comma, a dash, or the word ‘at’. All of these are used and sometimes they are different in a single state system. Probably best to fall back on suggestion #1. See how ROR already does it and stick with that.

5.     Funder Names – Stick with the Crossref Funder Registry for names of funding organizations. If this registry migrates to another identifier system (perhaps ROR), this will allow systematic migration from Crossref to the next identifier registry.

All of these suggestions are, at best, stopgap solutions. The ambiguity of these names is a principal motivation for identifiers in the first case. Fortunately, DataCite is including organizational identifiers for creators and contributors in the next release of their metadata schema (Version 4.3) which is imminent. Keep your eyes open for that announcement and integrate organizational identifiers (and others) into your DataCite metadata and tools as soon as possible.


The initial affiliation survey of DataCite metadata has uncovered some interesting results. First, the Data Center level seems reasonable for analysis and the DataCite API sampling capability is very useful for generating Data Center collections for analysis. Second, the number of unique publisher, creator/contributor, and funder affiliations per Data Center are generally small and, in many cases only one. This greatly simplifies the process of ROR adoption for these Data Centers as they only need a few RORs to cover all of their metadata records. Improving the consistency of the affiliation strings will also help this process along.

Removing ambiguity and ensuring credit where it is due are important goals of the ROR community and we can all contribute to achieving success. Now is the time to ROR!

The Big Picture - Has CrossRef metadata completeness improved?

I recently introduced a simple visualization of data from the CrossRef Participation Reports that provides quantitative insight into how completeness of CrossRef metadata collections with respect to eleven key metadata elements has changed between the backfile and the current record collections. I showed examples of how these numbers could be used to help increase awareness of specific areas in which collections are more complete and areas where they could increase completeness. An important goal of that work is to identify members with relatively complete metadata that can provide examples for helping other members understand the benefits of improved metadata.

Before we explore and compare individual cases, it is interesting to establish the big picture as a background that we can keep in mind as we look at specifics. Continuing to focus on journal articles, as described in the earlier blog, I increased my sample size to include 8670 collections in the current time period and 7293 collections in the backfile, essentially all CrossRef members. This shows an increase of almost 20% in the number of collections contributed to and managed by CrossRef.  This is great. The next question is: are the more recent metadata more complete with respect to the elements included in the participation reports?

The first look at this comes from creating a standard metadata evolution plot for all journal-article collections. This plot shows the average completeness for eleven metadata elements during the backfile (orange) and current (blue) time periods. The first observation is that, on average, CrossRef collections are generally not very complete with respect to these metadata elements. The average completeness in the current period is 18%, up from 13% during the backfile. Note that the scale maximum in this plot is 50% rather than 100% as it has been in plots in the previous blogs.

Average completeness % for all CrossRef Journal-Article collections during the backfile (orange) and current (blue) time periods. The difference between these two time periods, i.e. the change in completeness, is positive for all metadata elements.

The data indicate that Similarity Checking is the most commonly used metadata element (of the eleven included) during the backfile (31%) and the current (37%) time periods and that Open References increased the most (13%) between the two time periods. Of the remaining elements, most show increases of 5-8%, while Award Numbers, Funders, and Update Policies show increases of 1-2%, and References shows no increase.

A second way to look at these data is to determine the % of collections that have content for the metadata elements included in the Participation Reports for two time periods. Similar to the earlier case, Similarity Checking occurs in more collections (43%) than any other metadata element in both time periods. The number of collections with Reference metadata is also high, but decreased in the current period.

The % of CrossRef collections that include metadata elements during the backfile (orange) and current (blue) time periods. Note that this percentage has increased for all metadata elements except references. While completeness of references has decreased by 7%, completeness of Open References has increased by 13%. This indicates migration of the entire collection towards open-references.

Another question of interest is: how many collections were nearly complete (>90%) in these two time periods with respect to these elements. This Figure shows numbers to address this question. Of course, the numbers are smaller, and in this case, the current period (blue) has more almost complete collections for all elements. The largest increase in % almost complete collections is 13% for open references.

The % of CrossRef collections that are almost complete (90+%) during the backfile (orange) and current (blue) time periods. The % of Open References shows the largest increase (13%).

These last two datasets can be combined to answer the question: how likely is it that a collection that includes a particular element has that element in all of the records in the collection. These data are shown in the final figure. The light blue region shows the current data on % collections with content and the dark blue data show the % of collections which are almost complete for each element. If these two numbers are close, it means that if a collection includes an element, it is present in most of the records in the collection. This is essentially true for Open References and means that a metadata provider that provides open references tends to include them in all records.

A comparison of the % of collections that currently have content (light blue) and those that are almost compete (dark blue) for each element. When these two numbers are close (as in the Open References case, it indicates that collections that include an element include it in almost all records in the collection.


Overall, these data show that CrossRef metadata have improved and that there is room for much larger improvements in the future.

CrossRef introduced the CrossRef Participation Reports saying “Membership of Crossref is not just about getting a persistent identifier for your content, it’s about placing your content in context by providing as much metadata as possible and looking after it long-term”. A presentation by Stephanie Dawson from ScienceOpen showed that “articles with additional metadata had much higher average views than those without - depositing richer metadata helps you get the best value from your DOIs”. The focus of these quantitative reports is clearly on improvements to the CrossRef metadata which increasing understanding of the context in which a particular resource exists, i.e. connecting resources into the web of scholarly communication.

The data presented here indicate that most CrossRef collections do not currently include most of the metadata elements measured in the CrossRef Participation reports, i.e. there are many opportunities for members to increase the value of their CrossRef metadata by adding connections to other kinds of content. The importance of this type of information is highlighted in the FAIR Data Principles which include “metadata include qualified references to other data”, so members also benefit from increases in “FAIRness” of their collections by adding this content. My interest is measuring those improvements to identify and highlight examples of CrossRef members that are implementing them. Those results are coming up.

Metadata Evolution - Metadata Completeness, Agility, and Collection Size

I recently introduced a simple metric for measuring metadata collection completeness with respect to elements in the CrossRef Participation Reports. The suggestion of this metric immediately led to speculation about relationships between collection size and completeness. Small collections include fewer records – are they more likely to be complete? Publishers with large collections have more resources – do they have more complete metadata? Are smaller publishers more agile - can they change more?

Read More

Metadata for Using and Understanding Software

All scientific communities have been linking research together for many years using references to related work in articles. Recently these communities are exploring options for linking to datasets and software. As part of this effort, the CodeMeta Project recently proposed a vocabulary for metadata for code based on

A mapping between the codeMeta vocabulary and the ISO 19115-1 metadata standard was recently included on the codeMeta Git Repository. Creating this mapping surfaced some interesting differences between these two approaches. An interesting similarity also emerged. The codeMeta vocabulary has been mapped to over twenty metadata dialects listed along the bottom of this Figure. On average, these dialects mapped 11.2 of the 68 codeMeta terms. The ISO mapping is shown near the left edge of this Figure. It included 64 of the 68 items. This indicates that the codeMeta and ISO dialects are more similar than many of these other dialects.

The difference likely reflects the fact that most of these dialects focus primarily on citation and dependency identification while ISO and codeMeta include metadata that supports use and understanding of data and software. This broader scope requires more metadata concepts, many of which codeMeta and ISO share. Check out the details in Mapping ISO 19115-1 geographic metadata standards to CodeMeta in PeerJ Computer Science.

This Figure shows the number of codeMeta terms that are included in mappings to many dialects. The red line shows the average number for many of these dialects (11.2). The two bars on the left show codeMeta and ISO 191156-1. The ISO mapping includes sixty-for of sixty-eight codeMeta terms. This similarity reflects the fact that both of these dialects include metadata to help users use and understand software as well as cite it.

This Figure shows the number of codeMeta terms that are included in mappings to many dialects. The red line shows the average number for many of these dialects (11.2). The two bars on the left show codeMeta and ISO 191156-1. The ISO mapping includes sixty-for of sixty-eight codeMeta terms. This similarity reflects the fact that both of these dialects include metadata to help users use and understand software as well as cite it.

Dryad Data Packages and Files

California Digital Library (CDL) and Dryad recently announced a partnership to address researcher needs and to “move the needle” on digital data adoption throughout the scholarly research community by working together to understand and respond to researcher needs for high-quality scientific data publishing infrastructures. Metadata that supports accessibility to and understanding of published datasets plays a critical role in driving this adoption.

I have been working with CDL’s University of California Curation Center (UC3) on Metadata 2020 and this new partnership provides a great opportunity to apply some of the ideas developed there to a challenging real-world metadata evaluation, guidance and improvement problem. This initial post explores current Dryad metadata models and usage to lay a foundation for understanding these metadata and forming data-driven recommendations for improvements in the future.

Dryad Metadata Model

The Dryad Metadata Model describes two types of objects: dataFiles, and dataPackages that contain them. This metadata hierarchy is common in other metadata models as well, sometimes called granule and collection metadata. All of these models include terms that are shared by both types of objects and others that are unique to one or the other.

Curated examples of how Dryad metadata are used are available on the Dryad Wiki. Existing Dryad metadata provides a rich source for understanding these metadata and how they are used in real-life. In order to understand the usage, we examined a random sample of metadata for 114 dataPackages and 335 dataFiles from Dryad. Hopefully this sample is large enough to identify similarities and differences.

The Dryad metadata terms naturally divide into four groups shown in the Table below along with the % of the records of each type (dataPackage = blue, dataFile = orange) that include each of the terms on the Y axis. For example, the first group (Data Files) includes four terms (isPartOf, provenance, rights, and format) listed on the X axis. Three of these are required (marked by *) and two are repeatable (marked by +). The term isPartOf occurs in 100% of the dataFile metadata records (shown as orange) and 0% of the dataPackage records (shown as blue). This usage is expected, as this term gives the dataPackage that a dataFile is part of.

DataFile terms: Four terms that pertain only to dataFiles.

Three of these (isPartOf, provenance, and rights) are required (marked with *) and occur in virtually all dataFile records. They contain the identifier of dataPackage that the dataFile is part of (typically a DOI), the state of the dataFile in the Dryad workflow, and the license for the dataPackage (all Creative Commons Zero, CC0).

The format term is optional and occurs in 86% of the dataFile records. It typically contains the size of the dataFile in bytes.

Shared terms: Five optional terms that are used in roughly the same number of dataPackages and dataFiles to provide information about several types of keywords (time period, spatial location, scientificNames, and subject).

The similarity of these numbers reflects the fact that dataPackages aggregate the keywords from the dataFiles that they contain.

DataPackage terms: Five terms, three of which are required, that are used in essentially all dataPackages and varying numbers of dataFiles to provide submission history and descriptions and to connect dataPackages to dataFiles (hasPart) and to published papers (references).

All Files and Packages: Four required terms and the XML schemaLocation that provide basic discovery and access information (title, creator, identifier) and give the types for all dataPackages and dataFiles.

Several of these terms can occur more than once in a single metadata record (marked with + in the Table above). The most numerous terms in the dataPackage and dataFile metadata records are creators, averaging ~5 / record. Subject terms come in a close second with an average of just fewer than 5 / record. Most dataPackages in this sample (54%) contain a single dataFile and 77% contain less than three dataFiles. The maximum number of dataFiles in a dataPackage is 72 in this sample.

Collection Homogeneity

I proposed several metrics for characterizing and comparing metadata collections. One of these measures the collection homogeneity calculated as the number of terms that are in all records / the total number of terms in the collection. The Dryad dataPackage sample contains fourteen terms and nine of them are included in all records so homogeneity is 65%. The dataFile sample includes seventeen terms and six are included in all records so homogeneity is 35%. This difference may reflect the fact that much of the dataPackage metadata is managed as part of the central repository workflow while the dataFile metadata are provided by the researchers that use Dryad to publish data.


This initial exploration identifies the two types of Dryad metadata, outlines some interesting similarities and differences between them, and helps form a baseline for understanding intended metadata use cases. The metadata were designed to connect elements of information packages that include scientific papers (identified in the references element) and related dataFiles. They provide identifiers and discovery information that includes titles, creators, paper abstracts, and three types of keywords (subject, scientific name, spatial, and temporal) and track the data publication process (processing dates and provenance). The scientific papers are the primary discovery path for the data and are required for data users to understand details of how the data were collected and processed. This is different than the model used in many repositories where these details can be included in the dataset metadata and users can access them independently.

Talking and Thinking About Metadata

The idea that the language we use to talk about things shapes the way we think or can think about those things has been around since the 1800’s and even has a name, the Sapir–Whorf hypothesis, proposed during 1954. It was Whorf who said, “Language is not simply a reporting device for experience but a defining framework for it.” Last year Lera Boroditsky discussed a similar idea from the stage at TEDWomen with some nice examples and data from multiple languages and cultures. I have been thinking and writing about a universal documentation language for some time and bring together a couple of those ideas here.  

Some metadata terms emerged from my metadata evaluation and guidance work with many partners.  I described the concept of “metadata dialects” and suggested that many metadata standards are more like dialects of a universal documentation language then they are like separate languages. Some have questioned whether a universal “documentation language” really exists. I admit that it is really a concept that I hope exists rather than a real language described in an unabridged dictionary somewhere. 

More recently, I introduced this dialect nomenclature to the Metadata 2020 community of metadata experts that advocate richer, connected, reusable, and open metadata for all research outputs. The terms are slowly creeping into some Metadata 2020 discussions, hopefully helping to build and cross bridges between different communities that are committed to better metadata in all contexts.

Documentation or Metadata?

Many datasets and products are documented using approaches and tools developed by data collectors to support their analysis and understanding. This documentation exists in notebooks, scientific papers, web pages, user guides, word processing documents, spreadsheets, data dictionaries, PDF’s, databases, custom binary and ASCII formats, and almost any other conceivable form, each with associated storage and preservation strategies. This custom, often unstructured, approach may work well for independent investigators or in the confines of a particular laboratory or community, but it makes it difficult for users outside of these small groups to discover, use, and understand the data without consulting with its creators.

Metadata are standard and structured documentation.

Metadata are standard and structured documentation.

Metadata, in contrast to documentation, helps address discovery, use, and understanding by providing well-defined, structured content. This makes it possible for users to access and quickly understand many aspects of datasets that they have not collected. It also makes it possible to integrate information into discovery and analysis tools, and to provide consistent references from the metadata to external documentation.

Metadata standards provide standard element names and associated structures that can describe a wide variety of digital resources. The definitions and domain values are intended to be sufficiently generic to satisfy the metadata needs of various disciplines. These standards also include references to external documentation and well-defined mechanisms for adding structured information to address specific community needs.

Another important difference between documentation and metadata is the target audience. Documentation is targeted at humans and it relies heavily on our capability to make sense out of a variety of unstructured information. Metadata, on the other hand, is typically targeted at applications. Many of these applications facilitate searching metadata and displaying it in a way that facilitates data discovery by humans. As tools mature and, more importantly, the breadth of existing metadata increases, we will see more and more applications creating and using metadata to facilitate more sophisticated metadata and data driven discovery, comparisons between multiple datasets, and other analyses.

Of course, the audience is also very important when we create metadata. Humans like descriptions that help them understand the resources being described and citations to more, likely unstructured, information. Applications are generally much more demanding when it comes to consistency and completeness. It is important to consider both audiences when creating and improving metadata.

Note added: It is interesting to see that the word “documentation” has a much longer history than the word “metadata”. Metadata is really the new kid on the block.