April | 2017 | Steven Carl Anderson's Blog

Disclaimer: From DPLA’s Michael Bitta, their stats may be off a bit. To quote the twitter exchange: “I guess the issue here is that we know that the mechanism that was tracking outlinks and exposing the referrer data was broken. We’re not 100% sure that that got fixed, so we would like to see a change in the data over time to verify. In any case, I think your general point about the value of hubs is is important, didn’t mean to distract from that conversation. In the nearish-term, we plan to add an interstitial redirection page to make sure we’re getting the best stats about outlinks.”

Background

Getting into Digital Public Library of America (DPLA) is seen as an important goal by many institutions within the United States. At the Digital Commonwealth, it is one of the “carrots” we use to convince people to contribute to the Massachusetts regional repository (Digital Commonwealth itself). Some institutions even have managed to bypass their regional hubs and force DPLA to harvest directly from them. But I’d argue that ignoring the “middle man” is a mistake in this case and that everyone in the library world is vastly underestimating the value of a well run regional hub.

The Data Setup

What follows is a look at statistics from DPLA and Digital Commonwealth for the two month period of January 14, 2017 to March 14, 2017. That may appear to be an odd date range but the DPLA had only just re-harvested our system the previous week of that January. As such, at the time of generating these statistics, that range made the most sense to ensure an apples to apples comparison.

Furthermore, these statistics only use items that Digital Commonwealth has harvested itself. All hosted items from Digital Commonwealth have been removed. Why? To keep things similar once again and compare only records that both DPLA and Digital Commonwealth have only metadata for. To be more specific, metadata that they cannot directly control and must be placed into an aggregate system. Removing hosted items also mostly eliminates any traffic from repeat visitors of that object since what is hot-linked or cited is the page with the actual object on it.

I should clarify that none of these objects actually link to Digital Commonwealth. There is no “double dipping” as we provide DPLA with the link directly the the source object and do not force the traffic through our application. After all, it would be a horrendous user experience if one found an object on DPLA, clicked to go to Digital Commonwealth, and then had to click another link to get to the actual object a user was looking for. So while we act like a “middle man”, we don’t steal any traffic from our members objects and there is no worry that DPLA clickthroughs are affecting the Digital Commonwealth numbers.

The final note is the sample size. There were 239,051 harvested objects that existed in Digital Commonwealth as of January 14th.

The Initial TLDR Takeaway

During that two month period, the following are the aggregate view statistics:

**Term Clarifications**:
“Total Views” column refers to people viewing the detail page of the item on their respective site over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Views	Total Records	Average Views Per Item
Digital Commonwealth	23,375	239,051	0.0978
DPLA	2,337	239,051	0.0098
Both Sources	25,712	239,051	0.1076

For a Massachusetts institute, we can see that about 10X of the eyeballs on the institution’s digital objects comes from the Digital Commonwealth! To put this in perspective, it would take about 2 years for each harvested item in Digital Commonwealth to have an average of one view each. For those same items, we are looking at a 20 year timeframe here.

But views are only one metric… what kind of traffic is this aggregation giving my digital repository? To understand that, we would need the metric known as clickthroughs, which I just do happen to have for you! The following chart takes a look at clickthroughs:

**Term Clarifications**:
“Total Clickthroughs” column refers to people who clicked on the link to go to the original source repository view page of the digital object over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Clickthroughs	Total Records	Average Clickthroughs Per Item
Digital Commonwealth	4,495	239,051	0.0188
DPLA	343	239,051	0.0014
Both Sources	25,712	239,051	0.0202

The division here is even more drastic than it was previously with Digital Commonwealth making up over 1,300% of the clickthroughs. That is more clickthroughs in Digital Commonwealth than views of this set of harvested items in DPLA! Interesting statistics, I think.

I’ll leave one with one final chart as a teaser for next time: what is we only include items that had both a subject topic and subject geographic in our views table? What impact does that have on our average?

**Term Clarifications**:
“Total Views w/ Topic and Geo” column refers to people viewing the detail page of an item containing both a subject topic and subject geographic on their respective site over a two month period.
“Total Records w/ Topic and Geo” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system and that have both a subject topic and a subject geographic.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Views w/ Topic and Geo	Total Records w/ Topic and Geo	Average Views Per Item
Digital Commonwealth	14,534	52,455	0.2771
DPLA	1,138	52,455	0.0217
Both Sources	15,672	52,455	0.2988

That is quite an improvement to our average! It shouldn’t be unexpected that records with better quality metadata end up being more discoverable. Despite being only 22% of the total records, this subset of items makes up 62% of the views in Digital Commonwealth and 49% of the views in DPLA! The numbers don’t lie: quality metadata matters. But are there any reasons why the increase was slightly more dramatic in Digital Commonwealth compared to DPLA? The setup to this post ate up more time than I expected so we will delve into more breakdowns and analysis of these numbers next time!

Final Notes and Part 2

These numbers do seem small but I want to stress that is doesn’t included the harvested items from Digital Commonwealth. For example, during a two month period, Digital Commonwealth would have nearly 200,000 views of it 214,138 hosted objects which is quite a step up from the harvested statistics. However, much of that difference would be from “repeat bookmarked / shared / etc” traffic… and I just want to state don’t evaluate the entire system based on this subset comparison. Blog posts about some of the other Digital Commonwealth comparisons will come in the future.

Additionally, the DPLA numbers are only a subset of one hub in their system and overall are much more impressive when looking at their system as a whole. We are but one small waterway feeding into their large ocean.

Lastly I feel I should mention that the above is more of a “team discoverability effort”. As a DPLA service hub, the benefit of regional hubs like Digital Commonwealth is just part of what one gets by being in the Digital Public Library of America infrastructure. This analysis just breaks down the DPLA pipeline to its component parts. Hopefully this analysis is useful for showing that it isn’t to one’s benefit to attempt to by a special snowflake that skips the smaller regional aggregation and that regional aggregation systems should have a user facing front-end. The latter is something that I feel I need to stress since so many DPLA hubs just view the goal as getting the data to DPLA (ie. don’t build a front-end) and fail to offer their members the boost that local aggregation can have on harvested items discoverability.

Part 2 will take a look at the more granular level of this dataset. For example, what items get the most views in both systems? Do those most viewed items share similarities? What about clickthrough similarities? What affect does subjects have on the average views and clickthroughs of an item? Etc.

Steven Carl Anderson's Blog

The melody of logic always plays the notes of truth.

Monthly Archives: April 2017

The Importance of Regional Aggregation Hubs for Digital Collections Part 2

Recap, Caveats, and Decisions

The Dataset Download

The Importance of Regional Aggregation Hubs for Digital Collections Part 1

Background

The Data Setup

The Initial TLDR Takeaway

Final Notes and Part 2