The Big Picture - Has CrossRef metadata completeness improved?

I recently introduced a simple visualization of data from the CrossRef Participation Reports that provides quantitative insight into how completeness of CrossRef metadata collections with respect to eleven key metadata elements has changed between the backfile and the current record collections. I showed examples of how these numbers could be used to help increase awareness of specific areas in which collections are more complete and areas where they could increase completeness. An important goal of that work is to identify members with relatively complete metadata that can provide examples for helping other members understand the benefits of improved metadata.

Before we explore and compare individual cases, it is interesting to establish the big picture as a background that we can keep in mind as we look at specifics. Continuing to focus on journal articles, as described in the earlier blog, I increased my sample size to include 8670 collections in the current time period and 7293 collections in the backfile, essentially all CrossRef members. This shows an increase of almost 20% in the number of collections contributed to and managed by CrossRef.  This is great. The next question is: are the more recent metadata more complete with respect to the elements included in the participation reports?

The first look at this comes from creating a standard metadata evolution plot for all journal-article collections. This plot shows the average completeness for eleven metadata elements during the backfile (orange) and current (blue) time periods. The first observation is that, on average, CrossRef collections are generally not very complete with respect to these metadata elements. The average completeness in the current period is 18%, up from 13% during the backfile. Note that the scale maximum in this plot is 50% rather than 100% as it has been in plots in the previous blogs.

Average completeness % for all CrossRef Journal-Article collections during the backfile (orange) and current (blue) time periods. The difference between these two time periods, i.e. the change in completeness, is positive for all metadata elements.

The data indicate that Similarity Checking is the most commonly used metadata element (of the eleven included) during the backfile (31%) and the current (37%) time periods and that Open References increased the most (13%) between the two time periods. Of the remaining elements, most show increases of 5-8%, while Award Numbers, Funders, and Update Policies show increases of 1-2%, and References shows no increase.

A second way to look at these data is to determine the % of collections that have content for the metadata elements included in the Participation Reports for two time periods. Similar to the earlier case, Similarity Checking occurs in more collections (43%) than any other metadata element in both time periods. The number of collections with Reference metadata is also high, but decreased in the current period.

The % of CrossRef collections that include metadata elements during the backfile (orange) and current (blue) time periods. Note that this percentage has increased for all metadata elements except references. While completeness of references has decreased by 7%, completeness of Open References has increased by 13%. This indicates migration of the entire collection towards open-references.

Another question of interest is: how many collections were nearly complete (>90%) in these two time periods with respect to these elements. This Figure shows numbers to address this question. Of course, the numbers are smaller, and in this case, the current period (blue) has more almost complete collections for all elements. The largest increase in % almost complete collections is 13% for open references.

The % of CrossRef collections that are almost complete (90+%) during the backfile (orange) and current (blue) time periods. The % of Open References shows the largest increase (13%).

These last two datasets can be combined to answer the question: how likely is it that a collection that includes a particular element has that element in all of the records in the collection. These data are shown in the final figure. The light blue region shows the current data on % collections with content and the dark blue data show the % of collections which are almost complete for each element. If these two numbers are close, it means that if a collection includes an element, it is present in most of the records in the collection. This is essentially true for Open References and means that a metadata provider that provides open references tends to include them in all records.

A comparison of the % of collections that currently have content (light blue) and those that are almost compete (dark blue) for each element. When these two numbers are close (as in the Open References case, it indicates that collections that include an element include it in almost all records in the collection.

Conclusions

Overall, these data show that CrossRef metadata have improved and that there is room for much larger improvements in the future.

CrossRef introduced the CrossRef Participation Reports saying “Membership of Crossref is not just about getting a persistent identifier for your content, it’s about placing your content in context by providing as much metadata as possible and looking after it long-term”. A presentation by Stephanie Dawson from ScienceOpen showed that “articles with additional metadata had much higher average views than those without - depositing richer metadata helps you get the best value from your DOIs”. The focus of these quantitative reports is clearly on improvements to the CrossRef metadata which increasing understanding of the context in which a particular resource exists, i.e. connecting resources into the web of scholarly communication.

The data presented here indicate that most CrossRef collections do not currently include most of the metadata elements measured in the CrossRef Participation reports, i.e. there are many opportunities for members to increase the value of their CrossRef metadata by adding connections to other kinds of content. The importance of this type of information is highlighted in the FAIR Data Principles which include “metadata include qualified references to other data”, so members also benefit from increases in “FAIRness” of their collections by adding this content. My interest is measuring those improvements to identify and highlight examples of CrossRef members that are implementing them. Those results are coming up.