Steven Carl Anderson's Blog

The melody of logic always plays the notes of truth.

Steven Carl Anderson's Blog

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About

Monthly Archives: April 2017

The Importance of Regional Aggregation Hubs for Digital Collections Part 2

Posted on April 20, 2017 by scande3
Reply

Recap, Caveats, and Decisions

Last time I broke down some top level statistics comparing the traffic through the DPLA pipeline. As the notice above that post reads, there may be an issue with the DPLA numbers. It is worth noting that this isn’t directly the HTTP to HTTPS conversion: after all, none of the DPLA traffic for those statistics ever hits our site and the numbers come from DPLA’s Google Analytics. However, Michael Bitta of DPLA still feels there were broader statistical issues caused by some of their recent changes that could have affected their internal numbers as well.

With the above in mind, it is hard to do deep analysis of the dataset. It has become apparent that I don’t have it in me to complete a real “part 2″ of this series… but that doesn’t mean that you can’t dig into the numbers as they exist! I’ve decided that I’ll provide the source data for anyone who might be curious.

All the qualifications from my part 1 blog post apply to these numbers. Additionally:

  • Not all items have a DPLA ID listed because those items had no clickthroughs or views in the report from DPLA. I could have still scripted a way to look them up but never implemented that piece. You should be able to get the DPLA ID from the DPLA API using the Digital Commonwealth PID though.
  • Some DPLA items show a clickthrough but no item view. This is not a bug. On the DPLA site, is possible to click through to the place hosting an item without visiting the detailed item page on https://dp.la. Essentially the lack of a view means they did a search on DPLA and just clicked to view the item at its source location directly in the search results view.

The Dataset Download

Download the dataset here: dpla_stats_2017_01_14-2017_03_14.xlsx

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

The Importance of Regional Aggregation Hubs for Digital Collections Part 1

Posted on April 5, 2017 by scande3
Reply
Disclaimer: From DPLA’s Michael Bitta, their stats may be off a bit. To quote the twitter exchange: “I guess the issue here is that we know that the mechanism that was tracking outlinks and exposing the referrer data was broken. We’re not 100% sure that that got fixed, so we would like to see a change in the data over time to verify. In any case, I think your general point about the value of hubs is is important, didn’t mean to distract from that conversation. In the nearish-term, we plan to add an interstitial redirection page to make sure we’re getting the best stats about outlinks.”

Background

Getting into Digital Public Library of America (DPLA) is seen as an important goal by many institutions within the United States. At the Digital Commonwealth, it is one of the “carrots” we use to convince people to contribute to the Massachusetts regional repository (Digital Commonwealth itself). Some institutions even have managed to bypass their regional hubs and force DPLA to harvest directly from them. But I’d argue that ignoring the “middle man” is a mistake in this case and that everyone in the library world is vastly underestimating the value of a well run regional hub.

The Data Setup

What follows is a look at statistics from DPLA and Digital Commonwealth for the two month period of January 14, 2017 to March 14, 2017. That may appear to be an odd date range but the DPLA had only just re-harvested our system the previous week of that January. As such, at the time of generating these statistics, that range made the most sense to ensure an apples to apples comparison.

Furthermore, these statistics only use items that Digital Commonwealth has harvested itself. All hosted items from Digital Commonwealth have been removed. Why? To keep things similar once again and compare only records that both DPLA and Digital Commonwealth have only metadata for. To be more specific, metadata that they cannot directly control and must be placed into an aggregate system. Removing hosted items also mostly eliminates any traffic from repeat visitors of that object since what is hot-linked or cited is the page with the actual object on it.

I should clarify that none of these objects actually link to Digital Commonwealth. There is no “double dipping” as we provide DPLA with the link directly the the source object and do not force the traffic through our application. After all, it would be a horrendous user experience if one found an object on DPLA, clicked to go to Digital Commonwealth, and then had to click another link to get to the actual object a user was looking for. So while we act like a “middle man”, we don’t steal any traffic from our members objects and there is no worry that DPLA clickthroughs are affecting the Digital Commonwealth numbers.

The final note is the sample size. There  were 239,051 harvested objects that existed in Digital Commonwealth as of January 14th.

The Initial TLDR Takeaway

During that two month period, the following are the aggregate view statistics:

Term Clarifications:
“Total Views” column refers to people viewing the detail page of the item on their respective site over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Views Total Records Average Views Per Item  
Digital Commonwealth 23,375 239,051 0.0978
DPLA 2,337 239,051 0.0098
Both Sources 25,712 239,051 0.1076

For a Massachusetts institute, we can see that about 10X of the eyeballs on the institution’s digital objects comes from the Digital Commonwealth! To put this in perspective, it would take about 2 years for each harvested item in Digital Commonwealth to have an average of one view each. For those same items, we are looking at a 20 year timeframe here.

But views are only one metric… what kind of traffic is this aggregation giving my digital repository? To understand that, we would need the metric known as clickthroughs, which I just do happen to have for you! The following chart takes a look at clickthroughs:

Term Clarifications:
“Total Clickthroughs” column refers to people who clicked on the link to go to the original source repository view page of the digital object over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Clickthroughs Total Records Average Clickthroughs Per Item
Digital Commonwealth 4,495 239,051 0.0188
DPLA 343 239,051 0.0014
Both Sources 25,712 239,051 0.0202

The division here is even more drastic than it was previously with Digital Commonwealth making up over 1,300% of the clickthroughs. That is more clickthroughs in Digital Commonwealth than views of this set of harvested items in DPLA! Interesting statistics, I think.

I’ll leave one with one final chart as a teaser for next time: what is we only include items that had both a subject topic and subject geographic in our views table? What impact does that have on our average?

Term Clarifications:
“Total Views w/ Topic and Geo” column refers to people viewing the detail page of an item containing both a subject topic and subject geographic on their respective site over a two month period.
“Total Records w/ Topic and Geo” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system and that have both a subject topic and a subject geographic.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Views w/ Topic and Geo Total Records w/ Topic and Geo Average Views Per Item  
Digital Commonwealth 14,534 52,455 0.2771
DPLA 1,138 52,455 0.0217
Both Sources 15,672 52,455 0.2988

That is quite an improvement to our average! It shouldn’t be unexpected that records with better quality metadata end up being more discoverable. Despite being only 22% of the total records, this subset of items makes up 62% of the views in Digital Commonwealth and 49% of the views in DPLA! The numbers don’t lie: quality metadata matters. But are there any reasons why the increase was slightly more dramatic in Digital Commonwealth compared to DPLA? The setup to this post ate up more time than I expected so we will delve into more breakdowns and analysis of these numbers next time!

Final Notes and Part 2

These numbers do seem small but I want to stress that is doesn’t included the harvested items from Digital Commonwealth. For example, during a two month period, Digital Commonwealth would have nearly 200,000 views of it 214,138 hosted objects which is quite a step up from the harvested statistics. However, much of that difference would be from “repeat bookmarked / shared / etc” traffic… and I just want to state don’t evaluate the entire system based on this subset comparison. Blog posts about some of the other Digital Commonwealth comparisons will come in the future.

Additionally, the DPLA numbers are only a subset of one hub in their system and overall are much more impressive when looking at their system as a whole. We are but one small waterway feeding into their large ocean.

Lastly I feel I should mention that the above is more of a “team discoverability effort”. As a DPLA service hub, the benefit of regional hubs like Digital Commonwealth is just part of what one gets by being in the Digital Public Library of America infrastructure. This analysis just breaks down the DPLA pipeline to its component parts. Hopefully this analysis is useful for showing that it isn’t to one’s benefit to attempt to by a special snowflake that skips the smaller regional aggregation and that regional aggregation systems should have a user facing front-end. The latter is something that I feel I need to stress since so many DPLA hubs just view the goal as getting the data to DPLA (ie. don’t build a front-end) and fail to offer their members the boost that local aggregation can have on harvested items discoverability.

Part 2 will take a look at the more granular level of this dataset. For example, what items get the most views in both systems? Do those most viewed items share similarities? What about clickthrough similarities? What affect does subjects have on the average views and clickthroughs of an item? Etc.

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

Archives

  • April 2017 (2)
  • March 2017 (1)
  • March 2016 (1)
  • May 2015 (1)
  • April 2015 (3)

Categories

  • Coding (2)
    • Demo (1)
    • Ruby (2)
  • Digital Commonwealth (6)
  • Metadata (5)
    • linked data (1)
  • Software Engineering Profession (1)
    • Job Search (1)
  • Statistics (4)
  • Uncategorized (1)

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Proudly powered by WordPress