Metadata Evolution - CrossRef Participation Reports

Over the last several years CrossRef has grown into one of the largest and most significant metadata repositories in the world. It contains over 100,000,000 registered content items and offers many services to help members and other users take advantage of that content. In addition to managing registered content, CrossRef makes links between resources. It is the connective tissue for the web of scholarly communications.

Of course, CrossRef knows that complete and consistent metadata are the lifeblood of high-quality services. They recently introduced Participation Reports (with a help page) to help members and users get a handle on completeness of CrossRef metadata collections. These reports provide completeness metrics (% of records) for 10 key metadata elements for nineteen content types over three time-periods: current (past two years and year-to-date), backfile (older), and all.

With over 12,000 members with registered content, these reports are helpful for CrossRef members and the mother-lode for information about how metadata collections evolve over time. Even better, this mother-lode is available in bulk through the CrossRef API as well as the temporal snapshots created by the Participation Reports.

I used the CrossRef API to download complete participation report data for a sample of 1684 members, together providing over 6400 member/content-type/time-period combinations, each termed a metadata collection. These data were retrieved using sets of parameters shown in the last row of Table 1.

Each section of Table 1 shows the number of occurrences of each time-period (all, backfile, current) for each coverage-type in this sample. For example, the first sample, in columns 1-4, included book metadata for 28 members covering all time, 27 members covering backfile, and for 9 members covering current time. The second sample, in columns 5-8, included book metadata for 36 members covering all time, 16 members covering the backfile, and 31 members covering current time.

Table 1. Coverage types, Time periods, and Collection Counts

Coverage Type

all

backfile

current

Coverage Type

all

backfile

current

book

28

27

9

book

36

16

31

book-chapter

26

16

7

book-chapter

16

2

14

book-series

3

book-series

1

book-set

2

component

48

component

16

dataset

74

9

9

dataset

7

1

1

dissertation

2

dissertation

6

journal

101

journal

368

journal-article

843

838

683

journal-article

757

310

747

journal-issue

47

40

15

journal-issue

383

128

372

journal-volume

1

journal-volume

6

monograph

12

12

5

monograph

21

10

16

other

6

5

3

posted-content

2

2

posted-content

3

2

3

proceedings

6

6

1

proceedings

24

1

24

proceedings-article

18

18

10

proceedings-article

33

6

33

proceedings-series

2

1

2

reference-book

3

3

reference-book

1

1

1

report

16

15

6

report

18

8

16

report-series

1

1

1

standard

1

Parameters: rows = 1000, offset = 0 Parameters: rows = 1000, offset = 8001

It is clear from these data that most resources in this sample are journal-articles, with hundreds of members reporting in each time-period. An analysis of metadata evolution requires samples from more than one time period, so this initial work focuses only on journal-articles and compares current and backfile metrics. This restriction yields more than 2600 collections for analysis.

The member data retrieved using the API includes eleven individual metrics, each with a possible range of zero to one. The names of these elements are slightly different than the field labels on the Participation Reports and both sets of names are listed in Table 2.

Table 2. Metadata item names in CrossRef API and Participation Reports

API Elements

Participation Reports

API Elements

Participation Reports

Abstracts

Abstracts

Orcids

ORCID IDs

Affiliations

References

References

Award Numbers

Funding award numbers

Resource Links

Text mining URLs

Funders

Funder registry IDs

Similarity Checking

Similarity checking URLs

Licenses

License URLs

Update Policies

Crossmark enabled

Open References

Open References

An unweighted sum of these numbers yields a single metric for each time-period with a range between zero and eleven. This metric was calculated for all of the collections in my sample. The largest observed index is 8.76, 80% of the largest possible value (11).

Subtracting the backfile index from the current index provides a quick comparison of the current and backfile time periods. This yields a set of change metrics with values between -5.57 and 8.00. The extremes of this range provide examples that help understand how to visualize and interpret these numbers.

The largest value of the change metric (8.0) is observed for the University of Chitral, a Pakistani university dedicated to serving society by “developing in students a delicate intellectual, cultural, ethical, and humane sensitivity through a pleasant blend of ancient and modern wisdom”. We use a radar plot to visualize compare the metrics for the backfile and current time-periods (Figure 1). The eleven metadata elements are arranged around the radar plot with the % of records that include the elements shown along the axis of the plots with zero at the center and the maximum value, one in this case, at the outer edge of the plot.

Figure 1. Data for University of Chitral - This is the CrossRef member in my sample with the largest difference between the backfile and the current time periods. Note that there is no data during the backfile period and that current content is complete (on the outside of the circle) for all elements that have content.

The University of Chitral is a relatively new member of CrossRef with twenty-nine journal articles registered in 2017 and 2018, so they have no backfile data. In this case, the radar plot has data only for the current time-period. All eight of the elements included in the Participation Report are in 100% of the records, so all of the current data are around the outside edge of the plot and the current index is 8.0. This is an impressive starting point for this CrossRef member.


A longer-term member will include metadata from the backfile as well as the current time-period and provide information about evolution of the collection. The largest overall metric in this group is from the Rockefeller University Press which includes 57,526 registered items in the backfile and 1529 items in the current period. The detailed metrics for the current and backfile time-periods and the differences are shown in Table 3. The smallest differences are for Open References, Similarity Checking, and Update Policies which are all included in ~100% of the records during each time period. The largest difference is for Licenses which increased from 0.01 to 0.99. The overall metric increased from 3.40 to 8.76 for an increase of 5.36, the largest increase observed in my sample.

Table 3. Rockefeller University Press metrics for eleven metadata items and three time periods (current, backfile, and all). Scroll the Table right-left to see all columns.

Time-period

Total

Abstracts

Affiliations

Award Numbers

Funders

Licenses

Open References

Orcids

References

Resource Links

Similarity Checking

Update Policies

current

8.76

0.94

0.25

0.79

0.84

0.99

1.00

0.95

0.97

0.05

1.00

1.00

backfile

3.40

0.05

0.00

0.01

0.01

0.01

1.00

0.00

0.33

0.00

1.00

0.99

change

5.36

0.89

0.25

0.78

0.83

0.98

0.00

0.94

0.63

0.05

0.00

0.01

Figure 2 shows the backfile (orange) and current (blue) metrics for Rockefeller University Press (orange). The title gives the member name, the content-type, the metrics for the current and backfile periods, and the difference, i.e. (current - backfile = change). The large increases in completeness for Abstracts, Award Numbers, Funders, Licenses, Orcids, and References seen in Table 3 are also apparent in this Figure as well as the similarities of coverages for Update Policies, Similarity Checking, and Open References for the two periods.

Figure 2. Rockefeller University Press shows the largest increase in completeness for members with in my sample with data in the backfile and current time periods. Note that content was introduced for many elements and continued at a high level of completeness for others.

At the other end of the spectrum, members with large negative change metrics are those that have data during the backfile but not during the current period. The largest negative change is observed for Bayward Publishing Company, Inc. Figure 3 shows the radar plot for this member. In this case only the backfile has data so the plot shows just the orange data. This member has some very complete metadata during the backfile with four elements (Update Policies, Similarity Checking, Resource Links, and Licenses) included in all records and Affiliations and References included in over 70% of the records. It is interesting to note that the completeness in this collection occurs in generally different quadrants of the radar plot than in the other cases. At this point we do not know if this is a general pattern.

Figure 3. Bayward Publishing has very complete content during the backfile time period, but no data during the current time period. This results in a negative change of -5.57.

Conclusion – The CrossRef Participation Reports provide an opportunity for exploring snapshots of completeness for member metadata collections during several time periods. Data available using the CrossRef API make it possible to capture participation data for many members and to explore those data to identify patterns of metadata evolution. I suggest a simple set of metrics for describing and comparing metadata collections and show several end-member cases that demonstrate how those metrics can be applied to identify common patterns in evolution. Future posts will explore these data in more detail in order to quantitatively identify commonalities with the goal of understanding how CrossRef members are motivated to improve their metadata.