Inspired by Mark E. Phillips series of blog posts analyzing subject metadata from DPLA hubs and a conversation with Corey Harper at code4lib on an analysis he is doing on DPLA data, I decided to do some statistical digging into the Digital Commonwealth system. In particular: what effect does subject level metadata have on how discoverable an object is in a repository shared by over 100 institutions?
The following are the shared details on what the following tables will represent:
- A six month time period from October 1st, 2014 until March 31st, 2015 for objects the existed in the repository before December 31st, 2014.
- The listed view counts come from the Google Analytics API and reflect views on the object’s main result page only. An example of this page would be: http://arktest.bpl.org/ark:/50959/zs25x877c (of course, linked to the test server to prevent future object statistics from being manipulated by this post! :p).
My Initial Subject Stats Gathering Attempt Chart
No Subjects | Topic Subject Only | Geographic Subjects Only | Both Topic and Geographic | |
Total Records | 3,958 | 61,101 | 3,438 | 77,482 |
Average Views | 1.098 | 0.853 | 2.170 | 2.035 |
Percent with 1+ Views | 42.1% | 29.6% | 48.9% | 51.9% |
Top 5 Average Views | 44 | 161 | 71 | 394 |
As we can see, Geographic Subjects and a combination of Topic / Geographic Subjects both do great things to our numbers! The average views in these cases show a strong increase over lacking those elements and the percentage of records with at least one view go up a significant amount. Records with a geographic subject have a low “Top 5 Average” but that could potentially be due to the very limited amount of records in the system that fall in that category. But… Topic Subjects… what happened to you?!?! I had positive expectations for you! Why are you dragging your records into the darkness of oblivion? Only 29.6% of records were ever viewed even once and the average views are a paltry 0.853 per object… both significantly lower than records with no subjects at all!
I pondered on this for a short period of time and began to form a theory on why these results came out as they were. To start with, in the Digital Commonwealth system, we link virtually all Geographic Topics to the TGN Controlled Linked Data Vocabulary using a gem known as Geomash. This means that all of those records are in the same hierarchical structure using the same place terms – which is absolutely awesome for faceting! Meanwhile, our subjects currently benefit from no such structure at this current time. As Mark E. Phillips analysis showed, our system has a bunch of “unique subjects” and those are…. less awesome for faceting. So I decided I need to “stats smarter” and account for the difference between how shared different subjects on a record are. Thus the following methodology was introduced for the remaining charts:
- For a digital object record, I would keep a count of the number of topic subjects that record contained that were shared with at least one of the other 113 institutions in the Digital Commonwealth system.
My Smarter Chart (Objects with a Topic Subject and no Geographic Subject)
Unique Topic Subject Objects | One Topic Subject Shared By Multiple Institutions | Two Topic Subjects Shared By Multiple Institutions | 3+ Subjects Shared By Multiple Institutions | |
Total Records | 1,140 | 34,528 | 16,077 | 9,356 |
Percent of Topic Only | 1.9% | 56.5% | 26.3% | 15.3% |
Average Views | 0.419 | 0.503 | 1.233 | 1.542 |
Percent with 1+ Views | 17.9% | 19.2% | 40.8% | 50.1% |
Top 5 Average Views | 13 | 122 | 155 | 99 |
This…. look a bit better and seems to validate that non-unique subjects have a positive effect on a digital object being discovered. As one adds more shared subjects, the average amount of views an item could expect to receive increased. However, I find it interesting that the top 5 record average went to the exactly two shared topic category. Why is this the case?
In seriousness, I am far from a statistician so this blog is mostly a raw data dump with my limited perspective on what it could mean. In keeping with hopefully offering interesting data, it would be neat to have that same breakdown for objects with both subject topics and subject geographical elements to see if the pattern holds up, right? As such, I now present:
The “Going Above and Beyond” Chart (Objects with both a Topic and Geographic Subject)
Unique Topic Subject Objects w/ Geographic | One Topic Subject Shared By Multiple Institutions w/ Geographic | Two Topic Subjects Shared By Multiple Institutions w/ Geographic | 3+ Topic Subjects Shared By Multiple Institutions w/ Geographic | |
Total Records | 352 | 32,009 | 16,823 | 28,298 |
Percent of Topic Only | 0.5% | 41.3% | 21.7% | 36.5% |
Average Views | 1.878 | 1.835 | 1.959 | 2.307 |
Percent with 1+ Views | 19.6% | 46.8% | 76.9% | 55.9% |
Top 5 Average Views | 31 | 308 | 135 | 284 |
It would appear that when a geographic element exists, the effect on the average views is less pronounced than when that element is missing. The trend is still of an overall increase in the average views an object will receive but the percentage jump from adding each shared subject is less. There are also some additional oddities introduced such as two shared topic subjects having an unusual boost in a record’s chance of having been viewed at least once. I double checked the numbers to make sure that the “exactly two” category continues to be taunt me.
It would appear the “Top 5 Average Views” category doesn’t show much of a trend in terms of what leads to our most popular items. My colleague said that statistic was not useful and overall just confusing and it appears I must concede that he can do a victory dance. One cannot even easily claim that it might be mainly related to the amount of records that fall into a category as there is a large exception to that pattern in the “Topic Subject Breakdown” chart. Still included it in all of these charts as I still find it interesting that the availability of metadata did not seem to have a consistent effect on what might be considered a “viral success” in our system. If I include an attempt to catch the “cream of the crop” statistic in the future, l likely would increase this to “top 50 average views” to see if that behaves more in a manner that I would expect and smooths out some of the extreme variance.
Conclusion
Overall it would appear that Geographic Subjects are more important to have than Topic Subjects on a record. The average of even a highly shared Topic Subject Only record against a record with just a Geographic Subject still favors the latter. Additionally: more shared subjects had only a relatively limited affect on records with a Geographic element (although an overall pattern of increased finability was observed). Whether this is due to the extra effort we put in to enhance the geographic aspect of our metadata by programmatically linking it to a controlled vocabulary or if that is just a more useful facet is hard to tell. We would have to do more analysis on user behavior to add that element to this equation and may end up taking a look at that in the future.
One of our next focuses for metadata enrichment will be topic subjects and it will be interesting to see how any enhancements on that field affect these stats. My colleague has also suggested that we take a look at numbers comparing subjects elements with LCSH “–” separators compared to those without the concatenation. I’d imagine this to be overall similar to the “shared topic subject” breakdowns since the more elements contained in that concatenated LCSH string, the more likely it is to be unique. But that is just a hypothesis at this point. Additionally such a followup blog post would look at the effect of an object being hosted by the Digital Commonwealth (74% of objects) compared to only having its metadata harvested in the system (26% of objects). There is also a thought about somehow taking into account the popularity of an overall collection into this analysis since a popular collection would likely boost all of its respective member items regardless of metadata quality. Feel free to leave a comment on other interesting analysis that might be worthwhile to do.
If you found this blog interesting, feel free to share it! Thanks for taking the time to read this post and take care.