Steven Carl Anderson's Blog

The melody of logic always plays the notes of truth.

Steven Carl Anderson's Blog

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About

Category Archives: Statistics

The Importance of Regional Aggregation Hubs for Digital Collections Part 2

Posted on April 20, 2017 by scande3
Reply

Recap, Caveats, and Decisions

Last time I broke down some top level statistics comparing the traffic through the DPLA pipeline. As the notice above that post reads, there may be an issue with the DPLA numbers. It is worth noting that this isn’t directly the HTTP to HTTPS conversion: after all, none of the DPLA traffic for those statistics ever hits our site and the numbers come from DPLA’s Google Analytics. However, Michael Bitta of DPLA still feels there were broader statistical issues caused by some of their recent changes that could have affected their internal numbers as well.

With the above in mind, it is hard to do deep analysis of the dataset. It has become apparent that I don’t have it in me to complete a real “part 2″ of this series… but that doesn’t mean that you can’t dig into the numbers as they exist! I’ve decided that I’ll provide the source data for anyone who might be curious.

All the qualifications from my part 1 blog post apply to these numbers. Additionally:

  • Not all items have a DPLA ID listed because those items had no clickthroughs or views in the report from DPLA. I could have still scripted a way to look them up but never implemented that piece. You should be able to get the DPLA ID from the DPLA API using the Digital Commonwealth PID though.
  • Some DPLA items show a clickthrough but no item view. This is not a bug. On the DPLA site, is possible to click through to the place hosting an item without visiting the detailed item page on https://dp.la. Essentially the lack of a view means they did a search on DPLA and just clicked to view the item at its source location directly in the search results view.

The Dataset Download

Download the dataset here: dpla_stats_2017_01_14-2017_03_14.xlsx

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

The Importance of Regional Aggregation Hubs for Digital Collections Part 1

Posted on April 5, 2017 by scande3
Reply
Disclaimer: From DPLA’s Michael Bitta, their stats may be off a bit. To quote the twitter exchange: “I guess the issue here is that we know that the mechanism that was tracking outlinks and exposing the referrer data was broken. We’re not 100% sure that that got fixed, so we would like to see a change in the data over time to verify. In any case, I think your general point about the value of hubs is is important, didn’t mean to distract from that conversation. In the nearish-term, we plan to add an interstitial redirection page to make sure we’re getting the best stats about outlinks.”

Background

Getting into Digital Public Library of America (DPLA) is seen as an important goal by many institutions within the United States. At the Digital Commonwealth, it is one of the “carrots” we use to convince people to contribute to the Massachusetts regional repository (Digital Commonwealth itself). Some institutions even have managed to bypass their regional hubs and force DPLA to harvest directly from them. But I’d argue that ignoring the “middle man” is a mistake in this case and that everyone in the library world is vastly underestimating the value of a well run regional hub.

The Data Setup

What follows is a look at statistics from DPLA and Digital Commonwealth for the two month period of January 14, 2017 to March 14, 2017. That may appear to be an odd date range but the DPLA had only just re-harvested our system the previous week of that January. As such, at the time of generating these statistics, that range made the most sense to ensure an apples to apples comparison.

Furthermore, these statistics only use items that Digital Commonwealth has harvested itself. All hosted items from Digital Commonwealth have been removed. Why? To keep things similar once again and compare only records that both DPLA and Digital Commonwealth have only metadata for. To be more specific, metadata that they cannot directly control and must be placed into an aggregate system. Removing hosted items also mostly eliminates any traffic from repeat visitors of that object since what is hot-linked or cited is the page with the actual object on it.

I should clarify that none of these objects actually link to Digital Commonwealth. There is no “double dipping” as we provide DPLA with the link directly the the source object and do not force the traffic through our application. After all, it would be a horrendous user experience if one found an object on DPLA, clicked to go to Digital Commonwealth, and then had to click another link to get to the actual object a user was looking for. So while we act like a “middle man”, we don’t steal any traffic from our members objects and there is no worry that DPLA clickthroughs are affecting the Digital Commonwealth numbers.

The final note is the sample size. There  were 239,051 harvested objects that existed in Digital Commonwealth as of January 14th.

The Initial TLDR Takeaway

During that two month period, the following are the aggregate view statistics:

Term Clarifications:
“Total Views” column refers to people viewing the detail page of the item on their respective site over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Views Total Records Average Views Per Item  
Digital Commonwealth 23,375 239,051 0.0978
DPLA 2,337 239,051 0.0098
Both Sources 25,712 239,051 0.1076

For a Massachusetts institute, we can see that about 10X of the eyeballs on the institution’s digital objects comes from the Digital Commonwealth! To put this in perspective, it would take about 2 years for each harvested item in Digital Commonwealth to have an average of one view each. For those same items, we are looking at a 20 year timeframe here.

But views are only one metric… what kind of traffic is this aggregation giving my digital repository? To understand that, we would need the metric known as clickthroughs, which I just do happen to have for you! The following chart takes a look at clickthroughs:

Term Clarifications:
“Total Clickthroughs” column refers to people who clicked on the link to go to the original source repository view page of the digital object over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Clickthroughs Total Records Average Clickthroughs Per Item
Digital Commonwealth 4,495 239,051 0.0188
DPLA 343 239,051 0.0014
Both Sources 25,712 239,051 0.0202

The division here is even more drastic than it was previously with Digital Commonwealth making up over 1,300% of the clickthroughs. That is more clickthroughs in Digital Commonwealth than views of this set of harvested items in DPLA! Interesting statistics, I think.

I’ll leave one with one final chart as a teaser for next time: what is we only include items that had both a subject topic and subject geographic in our views table? What impact does that have on our average?

Term Clarifications:
“Total Views w/ Topic and Geo” column refers to people viewing the detail page of an item containing both a subject topic and subject geographic on their respective site over a two month period.
“Total Records w/ Topic and Geo” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system and that have both a subject topic and a subject geographic.
“Average Views Per Item” column is the mean average of the previous two columns.
Total Views w/ Topic and Geo Total Records w/ Topic and Geo Average Views Per Item  
Digital Commonwealth 14,534 52,455 0.2771
DPLA 1,138 52,455 0.0217
Both Sources 15,672 52,455 0.2988

That is quite an improvement to our average! It shouldn’t be unexpected that records with better quality metadata end up being more discoverable. Despite being only 22% of the total records, this subset of items makes up 62% of the views in Digital Commonwealth and 49% of the views in DPLA! The numbers don’t lie: quality metadata matters. But are there any reasons why the increase was slightly more dramatic in Digital Commonwealth compared to DPLA? The setup to this post ate up more time than I expected so we will delve into more breakdowns and analysis of these numbers next time!

Final Notes and Part 2

These numbers do seem small but I want to stress that is doesn’t included the harvested items from Digital Commonwealth. For example, during a two month period, Digital Commonwealth would have nearly 200,000 views of it 214,138 hosted objects which is quite a step up from the harvested statistics. However, much of that difference would be from “repeat bookmarked / shared / etc” traffic… and I just want to state don’t evaluate the entire system based on this subset comparison. Blog posts about some of the other Digital Commonwealth comparisons will come in the future.

Additionally, the DPLA numbers are only a subset of one hub in their system and overall are much more impressive when looking at their system as a whole. We are but one small waterway feeding into their large ocean.

Lastly I feel I should mention that the above is more of a “team discoverability effort”. As a DPLA service hub, the benefit of regional hubs like Digital Commonwealth is just part of what one gets by being in the Digital Public Library of America infrastructure. This analysis just breaks down the DPLA pipeline to its component parts. Hopefully this analysis is useful for showing that it isn’t to one’s benefit to attempt to by a special snowflake that skips the smaller regional aggregation and that regional aggregation systems should have a user facing front-end. The latter is something that I feel I need to stress since so many DPLA hubs just view the goal as getting the data to DPLA (ie. don’t build a front-end) and fail to offer their members the boost that local aggregation can have on harvested items discoverability.

Part 2 will take a look at the more granular level of this dataset. For example, what items get the most views in both systems? Do those most viewed items share similarities? What about clickthrough similarities? What affect does subjects have on the average views and clickthroughs of an item? Etc.

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

A Further Look at Metadata’s Affect on Discoverability

Posted on April 17, 2015 by scande3
Reply

This post continues a look at the effect metadata has on the amount of views an object receives. Part 1 can be found at: http://scande3.com/2015/04/effect-of-metadata-subjects-on-a-digital-objects-discoverability/. The same criteria from part 1 still applies to these stats and those overall global rules are:

  • A six month time period from October 1st, 2014 until March 31st, 2015 for objects the existed in the repository before December 31st, 2014.
  • The listed view counts come from the Google Analytics API and reflect views on the object’s main result page only.

The first exciting match puts “LCSH pre-coordinated” subject topics against those that lack the “–” concatenation.

LCSH Topic Subject Comparisons

Term Clarifications:
“LCSH Style Topic Subject Objects” column are items that all of their subjects have “–” in them.
“Non-LCSH Style Topic Subject Objects” column are items with all subjects lacking “–” in them.
“Mixture of Both Styles Topic Subject Objects” column is an item with at least one subject with “–” and at least one subject without “–“.
LCSH Style Topic Subject Objects Non-LCSH Style Topic Subject Objects Mixture of Both Styles Topic Subject Objects
Total Records 8,119 122,353 8,111
Average Views 0.938 1.570 1.243
Percent with 1+ Views 30.8% 42.9% 40.5%

It would appear the hypothesis in the previous analysis post is correct: normalized non-LCSH style subjects soundly defeat those items that use the concatenation. But there is a notable asterisk to this victory in that the amount of objects using LCSH style subjects is significantly smaller. Natively in the Digital Commonwealth system, we do not generally pre-coordinate LCSH subjects as our “best practice” and thus that policy decision has an affect on how metadata was done for the vast majority of items. That doesn’t mean we don’t use “complex subjects” in LCSH that represent a complete topic. For example, we do have “best practice” objects that use “United States–History–Civil War, 1861-1865” as that is the single Library of Congress topic entry that defines that war. The majority of these cases are in the “Mixture” category in the above table. For example, the item I looked at with that string also had the subjects of “Monuments & memorials” and “Churches“.

Back on topic, this means the vast majority of “LCSH Style” topic subjects came to us from metadata sources we do not control. That would namely be OAI feeds from institutions that use the “pre-coordinated LCSH Subjects” as their metadata practice and that we were unable to break up on our end. This is an important note as these records coming from a series of uniform minority sources in the system could indicate other factors are at play for these numbers. Taking into account the numerous potential factors (such as quality of source metadata in other areas of the record or how interesting the items are) are mostly beyond the scope of this blog post. I will provide a breakdown of these items comparing those that have a topic subject but no geographic subject to those objects that do contain that geographic subject. Albeit I must add a caveat that this breakdown does make the numbers much more volatile as the size of records in a category becomes quite limited and thus I’d avoid coming to conclusions from these:

Term Clarifications:
“LCSH Style Topic Subject Objects (no Geographic)” column are items that all of their subjects have “–” in them and no geographic topic.
“Non-LCSH Style Topic Subject Objects (no Geographic)” column are items with all subjects lacking “–” in them and no geographic topic.
“Mixture of Both Styles Topic Subject Objects (no Geographic)” column is an item with at least one subject with “–” and at least one subject without “–” and no geographic topic.
LCSH Style Topic Subject Object (no Geographic) Non-LCSH Style Topic Subjects Objects (no Geographic) Mixture of Both Styles Topic Subject Objects (no Geographic)
Total Records 549 59,328 1,224
Average Views 1.741 0.829 1.622
Percent with 1+ Views 52.6% 28.9% 53.1%
Term Clarifications:
“LCSH Style Topic Subject Objects (with Geographic)” column are items that all of their subjects have “–” in them and at least one geographic topic.
“Non-LCSH Style Topic Subject Objects (with Geographic)” column are items with all subjects lacking “–” in them and at least one geographic topic.
“Mixture of Both Styles Topic Subject Objects (with Geographic)” column is an item with at least one subject with “–” and at least one subject without “–” and at least one geographic topic.
LCSH Style Topic Subject Objects (with Geographic) Non-LCSH Style Topic Objects (with Geographic) Mixture of Both Styles Topic Subject Objects (with Geographic)
Total Records 7,570 63,025 6,887
Average Views 0.880 2.267 1.176
Percent with 1+ Views 29.2% 56.1% 38.2%

OAI Harvested (Metadata Only Records) vs Hosted Native Records

As a DPLA Hub, we offer both hosted and harvesting options for our member institutes. The majority of our content is hosted directly in the system and those items that are ingested almost always go through our Digitization department where the metadata is often either cleaned up, created by, or given advice on its creation by our Metadata Mob. This then theoretically creates much more uniform metadata for our system that will play well with other objects when searching or faceting. Meanwhile, while we do enrichment on metadata from OAI feeds, we often have much less control over the policies those institutions implement (such as the previous topic subject differences). As such, the variation on the standards used for those objects is likely much greater. This table hopes to quantify that difference… but does have one huge flaw. In cases of an OAI Harvested metadata item, we provide DPLA with a direct link to that object in its native system rather than forcing the user to go through Digital Commonwealth first. As such, OAI Harvested objects below will be missing statistics on those views and the DPLA is a one of our top sources for referral traffic. (As an aside, the site we get the most traffic directed to us from is Facebook).

Term Clarifications:
“Hosted Object” column are items that live natively in our system and have the image/audio/video file content in our Fedora Commons repository.
“OAI Harvested Object” column are items that we ingest from a member institute’s repository and contain metadata and a thumbnail image only.
Hosted Object OAI Harvested Object
Total Records 108,282 37,801
Average Views 1.921 0.364
Percent with 1+ Views 51.7% 15.0%

The results are as I would expect. I wish I could tell how much of an effect the loss of the DPLA traffic on the stats for the OAI Harvested records had on these results. Still: it does seem highly likely that the uniformity of the metadata does have an effect on how often an object is discovered in our shared system.

The Fourth Dimension!

While knowing where a record is from and what it is about is quite important, I haven’t talked about the “when” aspect. I decided I’d run some quick stats that looks at how having a date on a particular item might increase the findability of an item.

Term Clarifications:
“Object with a date” column are items with at least one valid and not “unknown” date associated with it.
“Objects with No Date” column are items that either have no date or the date is “unknown”.
Objects with a Date Objects with No Date
Total Records 143,097 2,986
Average Views 1.500 2.405
Percent with 1+ Views 41.9% 55.6%

The good news: 98% of our records have a date associated with it! That is actually higher than I would have expected. More objects have a date in our system than have a subject associated with it!

The bad news? This means I don’t have a large enough group of “no date” items to figure out what effect a date might have on the views of an object. From the stats above, it would seem that objects without a date have a significant higher viewer average than those that contain a date which does not make logical sense. So while the above table are the actual stats, the only sense I can make from it is that individuals creating these records are doing an awesome job adding in a date. :)

Conclusion

It would appear the “exploded LCSH” or “non pre-coordinated LCSH” topic subject items are more discoverable in our system. However, it also appears likely that uniformity of metadata increases the odds of an object being discovered, so that could be a result of that being the primary policy we implemented for topic subjects. It would be interesting to see the same subject analysis that have been run here run over all of the DPLA data to see if the same patterns hold up in an even larger pool of objects.

Thanks for reading once again! Next time will likely be a move away from stats and on another aspect of the Digital Commonwealth system. Take care!

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

Effect of Metadata Subjects on a Digital Object’s Discoverability

Posted on April 8, 2015 by scande3
Reply

Inspired by Mark E. Phillips series of blog posts analyzing subject metadata from DPLA hubs and a conversation with Corey Harper at code4lib on an analysis he is doing on DPLA data, I decided to do some statistical digging into the Digital Commonwealth system. In particular: what effect does subject level metadata have on how discoverable an object is in a repository shared by over 100 institutions?

The following are the shared details on what the following tables will represent:

  • A six month time period from October 1st, 2014 until March 31st, 2015 for objects the existed in the repository before December 31st, 2014.
  • The listed view counts come from the Google Analytics API and reflect views on the object’s main result page only. An example of this page would be: http://arktest.bpl.org/ark:/50959/zs25x877c (of course, linked to the test server to prevent future object statistics from being manipulated by this post! :p).

My Initial Subject Stats Gathering Attempt Chart

Term Clarifications:
“Topic Subject Only” column has records that have no geographic subject element.
“Geographic Subjects Only” column has record that have no topic subject element.
“Top 5 Average” row is just the average views for the top 5 individual records of that category.
No Subjects Topic Subject Only Geographic Subjects Only Both Topic and Geographic
Total Records 3,958  61,101  3,438  77,482
Average Views 1.098  0.853  2.170  2.035
Percent with 1+ Views 42.1%  29.6%  48.9%  51.9%
Top 5 Average Views 44 161 71 394

As we can see, Geographic Subjects and a combination of Topic / Geographic Subjects both do great things to our numbers! The average views in these cases show a strong increase over lacking those elements and the percentage of records with at least one view go up a significant amount. Records with a geographic subject have a low “Top 5 Average” but that could potentially be due to the very limited amount of records in the system that fall in that category. But… Topic Subjects… what happened to you?!?! I had positive expectations for you! Why are you dragging your records into the darkness of oblivion? Only 29.6% of records were ever viewed even once and the average views are a paltry 0.853 per object… both significantly lower than records with no subjects at all!

I pondered on this for a short period of time and began to form a theory on why these results came out as they were. To start with, in the Digital Commonwealth system, we link virtually all Geographic Topics to the TGN Controlled Linked Data Vocabulary using a gem known as Geomash. This means that all of those records are in the same hierarchical structure using the same place terms – which is absolutely awesome for faceting! Meanwhile, our subjects currently benefit from no such structure at this current time. As Mark E. Phillips analysis showed, our system has a bunch of “unique subjects” and those are…. less awesome for faceting. So I decided I need to “stats smarter” and account for the difference between how shared different subjects on a record are. Thus the following methodology was introduced for the remaining charts:

  • For a digital object record, I would keep a count of the number of topic subjects that record contained that were shared with at least one of the other 113 institutions in the Digital Commonwealth system.

My Smarter Chart (Objects with a Topic Subject and no Geographic Subject)

Term Clarifications:
“Percent of Topic Only” row refers to the percentage that the broken up objects make up of the original table’s “Topic Subject Only” column.
Unique Topic Subject Objects One Topic Subject Shared By Multiple Institutions Two Topic Subjects Shared By Multiple Institutions 3+ Subjects Shared By Multiple Institutions
Total Records  1,140  34,528  16,077 9,356
Percent of Topic Only 1.9% 56.5% 26.3% 15.3%
Average Views  0.419  0.503  1.233 1.542
Percent with 1+ Views  17.9%  19.2%  40.8% 50.1%
Top 5 Average Views 13 122 155 99

This…. look a bit better and seems to validate that non-unique subjects have a positive effect on a digital object being discovered. As one adds more shared subjects, the average amount of views an item could expect to receive increased. However, I find it interesting that the top 5 record average went to the exactly two shared topic category. Why is this the case?

Source: memegenerator.net

As valid as any other explanation I could give.

In seriousness, I am far from a statistician so this blog is mostly a raw data dump with my limited perspective on what it could mean. In keeping with hopefully offering interesting data, it would be neat to have that same breakdown for objects with both subject topics and subject geographical elements to see if the pattern holds up, right? As such, I now present:

The “Going Above and Beyond” Chart (Objects with both a Topic and Geographic Subject)

Term Clarifications:
“Percent of Topic Only” row refers to the percentage that the broken up objects make up of the original table’s “Both Topic and Geographic” column.
Unique Topic Subject Objects w/ Geographic One Topic Subject Shared By Multiple Institutions w/ Geographic Two Topic Subjects Shared By Multiple Institutions w/ Geographic 3+ Topic Subjects Shared By Multiple Institutions w/ Geographic
Total Records  352  32,009  16,823 28,298
Percent of Topic Only 0.5% 41.3% 21.7% 36.5%
Average Views  1.878  1.835  1.959 2.307
Percent with 1+ Views  19.6%  46.8%  76.9% 55.9%
Top 5 Average Views 31 308 135 284

It would appear that when a geographic element exists, the effect on the average views is less pronounced than when that element is missing.  The trend is still of an overall increase in the average views an object will receive but the percentage jump from adding each shared subject is less. There are also some additional oddities introduced such as two shared topic subjects having an unusual boost in a record’s chance of having been viewed at least once. I double checked the numbers to make sure that the “exactly two” category continues to be taunt me.

It would appear the “Top 5 Average Views” category doesn’t show much of a trend in terms of what leads to our most popular items. My colleague said that statistic was not useful and overall just confusing and it appears I must concede that he can do a victory dance. One cannot even easily claim that it might be mainly related to the amount of records that fall into a category as there is a large exception to that pattern in the “Topic Subject Breakdown” chart. Still included it in all of these charts as I still find it interesting that the availability of metadata did not seem to have a consistent effect on what might be considered a “viral success” in our system. If I include an attempt to catch the “cream of the crop” statistic in the future, l likely would increase this to “top 50 average views” to see if that behaves more in a manner that I would expect and smooths out some of the extreme variance.

Conclusion

Overall it would appear that Geographic Subjects are more important to have than Topic Subjects on a record. The average of even a highly shared Topic Subject Only record against a record with just a Geographic Subject still favors the latter. Additionally: more shared subjects had only a relatively limited affect on records with a Geographic element (although an overall pattern of increased finability was observed). Whether this is due to the extra effort we put in to enhance the geographic aspect of our metadata by programmatically linking it to a controlled vocabulary or if that is just a more useful facet is hard to tell. We would have to do more analysis on user behavior to add that element to this equation and may end up taking a look at that in the future.

One of our next focuses for metadata enrichment will be topic subjects and it will be interesting to see how any enhancements on that field affect these stats. My colleague has also suggested that we take a look at numbers comparing subjects elements with LCSH “–” separators compared to those without the concatenation. I’d imagine this to be overall similar to the “shared topic subject” breakdowns since the more elements contained in that concatenated LCSH string, the more likely it is to be unique. But that is just a hypothesis at this point. Additionally such a followup blog post would look at the effect of an object being hosted by the Digital Commonwealth (74% of objects) compared to only having its metadata harvested in the system (26% of objects). There is also a thought about somehow taking into account the popularity of an overall collection into this analysis since a popular collection would likely boost all of its respective member items regardless of metadata quality. Feel free to leave a comment on other interesting analysis that might be worthwhile to do.

If you found this blog interesting, feel free to share it! Thanks for taking the time to read this post and take care.

Posted in Digital Commonwealth, Metadata, Statistics | Leave a reply

Archives

  • April 2017 (2)
  • March 2017 (1)
  • March 2016 (1)
  • May 2015 (1)
  • April 2015 (3)

Categories

  • Coding (2)
    • Demo (1)
    • Ruby (2)
  • Digital Commonwealth (6)
  • Metadata (5)
    • linked data (1)
  • Software Engineering Profession (1)
    • Job Search (1)
  • Statistics (4)
  • Uncategorized (1)

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org
Proudly powered by WordPress