Steven Carl Anderson's Blog » Digital Commonwealth

The Importance of Regional Aggregation Hubs for Digital Collections Part 2

scande3 — Thu, 20 Apr 2017 02:46:43 +0000

Recap, Caveats, and Decisions

Last time I broke down some top level statistics comparing the traffic through the DPLA pipeline. As the notice above that post reads, there may be an issue with the DPLA numbers. It is worth noting that this isn’t directly the HTTP to HTTPS conversion: after all, none of the DPLA traffic for those statistics ever hits our site and the numbers come from DPLA’s Google Analytics. However, Michael Bitta of DPLA still feels there were broader statistical issues caused by some of their recent changes that could have affected their internal numbers as well.

With the above in mind, it is hard to do deep analysis of the dataset. It has become apparent that I don’t have it in me to complete a real “part 2″ of this series… but that doesn’t mean that you can’t dig into the numbers as they exist! I’ve decided that I’ll provide the source data for anyone who might be curious.

All the qualifications from my part 1 blog post apply to these numbers. Additionally:

Not all items have a DPLA ID listed because those items had no clickthroughs or views in the report from DPLA. I could have still scripted a way to look them up but never implemented that piece. You should be able to get the DPLA ID from the DPLA API using the Digital Commonwealth PID though.
Some DPLA items show a clickthrough but no item view. This is not a bug. On the DPLA site, is possible to click through to the place hosting an item without visiting the detailed item page on https://dp.la. Essentially the lack of a view means they did a search on DPLA and just clicked to view the item at its source location directly in the search results view.

The Dataset Download

Download the dataset here: dpla_stats_2017_01_14-2017_03_14.xlsx

The Importance of Regional Aggregation Hubs for Digital Collections Part 1

scande3 — Wed, 05 Apr 2017 17:33:25 +0000

Disclaimer: From DPLA’s Michael Bitta, their stats may be off a bit. To quote the twitter exchange: “I guess the issue here is that we know that the mechanism that was tracking outlinks and exposing the referrer data was broken. We’re not 100% sure that that got fixed, so we would like to see a change in the data over time to verify. In any case, I think your general point about the value of hubs is is important, didn’t mean to distract from that conversation. In the nearish-term, we plan to add an interstitial redirection page to make sure we’re getting the best stats about outlinks.”

Background

Getting into Digital Public Library of America (DPLA) is seen as an important goal by many institutions within the United States. At the Digital Commonwealth, it is one of the “carrots” we use to convince people to contribute to the Massachusetts regional repository (Digital Commonwealth itself). Some institutions even have managed to bypass their regional hubs and force DPLA to harvest directly from them. But I’d argue that ignoring the “middle man” is a mistake in this case and that everyone in the library world is vastly underestimating the value of a well run regional hub.

The Data Setup

What follows is a look at statistics from DPLA and Digital Commonwealth for the two month period of January 14, 2017 to March 14, 2017. That may appear to be an odd date range but the DPLA had only just re-harvested our system the previous week of that January. As such, at the time of generating these statistics, that range made the most sense to ensure an apples to apples comparison.

Furthermore, these statistics only use items that Digital Commonwealth has harvested itself. All hosted items from Digital Commonwealth have been removed. Why? To keep things similar once again and compare only records that both DPLA and Digital Commonwealth have only metadata for. To be more specific, metadata that they cannot directly control and must be placed into an aggregate system. Removing hosted items also mostly eliminates any traffic from repeat visitors of that object since what is hot-linked or cited is the page with the actual object on it.

I should clarify that none of these objects actually link to Digital Commonwealth. There is no “double dipping” as we provide DPLA with the link directly the the source object and do not force the traffic through our application. After all, it would be a horrendous user experience if one found an object on DPLA, clicked to go to Digital Commonwealth, and then had to click another link to get to the actual object a user was looking for. So while we act like a “middle man”, we don’t steal any traffic from our members objects and there is no worry that DPLA clickthroughs are affecting the Digital Commonwealth numbers.

The final note is the sample size. There were 239,051 harvested objects that existed in Digital Commonwealth as of January 14th.

The Initial TLDR Takeaway

During that two month period, the following are the aggregate view statistics:

**Term Clarifications**:
“Total Views” column refers to people viewing the detail page of the item on their respective site over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Views	Total Records	Average Views Per Item
Digital Commonwealth	23,375	239,051	0.0978
DPLA	2,337	239,051	0.0098
Both Sources	25,712	239,051	0.1076

For a Massachusetts institute, we can see that about 10X of the eyeballs on the institution’s digital objects comes from the Digital Commonwealth! To put this in perspective, it would take about 2 years for each harvested item in Digital Commonwealth to have an average of one view each. For those same items, we are looking at a 20 year timeframe here.

But views are only one metric… what kind of traffic is this aggregation giving my digital repository? To understand that, we would need the metric known as clickthroughs, which I just do happen to have for you! The following chart takes a look at clickthroughs:

**Term Clarifications**:
“Total Clickthroughs” column refers to people who clicked on the link to go to the original source repository view page of the digital object over a two month period.
“Total Records” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Clickthroughs	Total Records	Average Clickthroughs Per Item
Digital Commonwealth	4,495	239,051	0.0188
DPLA	343	239,051	0.0014
Both Sources	25,712	239,051	0.0202

The division here is even more drastic than it was previously with Digital Commonwealth making up over 1,300% of the clickthroughs. That is more clickthroughs in Digital Commonwealth than views of this set of harvested items in DPLA! Interesting statistics, I think.

I’ll leave one with one final chart as a teaser for next time: what is we only include items that had both a subject topic and subject geographic in our views table? What impact does that have on our average?

**Term Clarifications**:
“Total Views w/ Topic and Geo” column refers to people viewing the detail page of an item containing both a subject topic and subject geographic on their respective site over a two month period.
“Total Records w/ Topic and Geo” column refer to the number of harvested items in Digital Commonwealth that DPLA has in its system and that have both a subject topic and a subject geographic.
“Average Views Per Item” column is the mean average of the previous two columns.
	Total Views w/ Topic and Geo	Total Records w/ Topic and Geo	Average Views Per Item
Digital Commonwealth	14,534	52,455	0.2771
DPLA	1,138	52,455	0.0217
Both Sources	15,672	52,455	0.2988

That is quite an improvement to our average! It shouldn’t be unexpected that records with better quality metadata end up being more discoverable. Despite being only 22% of the total records, this subset of items makes up 62% of the views in Digital Commonwealth and 49% of the views in DPLA! The numbers don’t lie: quality metadata matters. But are there any reasons why the increase was slightly more dramatic in Digital Commonwealth compared to DPLA? The setup to this post ate up more time than I expected so we will delve into more breakdowns and analysis of these numbers next time!

Final Notes and Part 2

These numbers do seem small but I want to stress that is doesn’t included the harvested items from Digital Commonwealth. For example, during a two month period, Digital Commonwealth would have nearly 200,000 views of it 214,138 hosted objects which is quite a step up from the harvested statistics. However, much of that difference would be from “repeat bookmarked / shared / etc” traffic… and I just want to state don’t evaluate the entire system based on this subset comparison. Blog posts about some of the other Digital Commonwealth comparisons will come in the future.

Additionally, the DPLA numbers are only a subset of one hub in their system and overall are much more impressive when looking at their system as a whole. We are but one small waterway feeding into their large ocean.

Lastly I feel I should mention that the above is more of a “team discoverability effort”. As a DPLA service hub, the benefit of regional hubs like Digital Commonwealth is just part of what one gets by being in the Digital Public Library of America infrastructure. This analysis just breaks down the DPLA pipeline to its component parts. Hopefully this analysis is useful for showing that it isn’t to one’s benefit to attempt to by a special snowflake that skips the smaller regional aggregation and that regional aggregation systems should have a user facing front-end. The latter is something that I feel I need to stress since so many DPLA hubs just view the goal as getting the data to DPLA (ie. don’t build a front-end) and fail to offer their members the boost that local aggregation can have on harvested items discoverability.

Part 2 will take a look at the more granular level of this dataset. For example, what items get the most views in both systems? Do those most viewed items share similarities? What about clickthrough similarities? What affect does subjects have on the average views and clickthroughs of an item? Etc.

Software Engineer Interview Experiences

scande3 — Sun, 26 Mar 2017 19:43:53 +0000

Background

So… last Friday (March 24th) was sadly my last day with the Boston Public Library. I start at Akamai tomorrow (March 27th) as a Senior Software Engineer. I will deeply miss the Boston Public Library and the work my colleague and I completed building the Digital Commonwealth system. But unfortunately all things must end and I felt it was time to move on to a new opportunity. The following are some interview experiences that I believe might be interesting from my recent search.

The New HR Phone Screen Question

A 30 minute initial call from a HR representative has been in practice for quite some time now. However, what I did find new was how aggressively one will be required to give one’s salary requirements at this time. Nearly every place I applied required me to state the salary I was looking for and refused to allow me to weasel out of giving a number.

From the HR point of view, this is a smart move. With the interviewee not having interacted with the team yet, the interviewee loses the leverage of being the “first choice” in such a negotiation. It doesn’t matter how amazing of a match one is for the job or how much unexpected value above and beyond the job posting one might bring since those aren’t the HR representative’s primary concern. They are focused at filling X position for Y salary and they can make the cold calculation to eliminate the candidate up front by forcing this simple question.

I’m quite unsure of how to best handle the moving of the salary negotiation from the end of the interview process up to the forefront. A talk on negotiating salary was given recently at Code4Lib but lacked any advice on this situation. For example, does one give a low number to keep one in the running and then surprise them with a higher requirement at the end? After all, I had a few HR representatives from potential position thank me for my time once my number was outside of their range – including one job that they won’t likely ever find a better match for. In these cases, they mostly didn’t even bother to ask if I might be willing to work for less than the number I gave. Or does one start off with a high number on the expectation they could offer a slight bit below that in the end and just give up on those that one will be eliminated from immediately? Beyond just doing one’s research and being prepared to give a number, I can’t really give further advice on how to handle this.

Assembly Line Programming

Let’s say one had a questionnaire with responses about where people live. If one wanted a report for all those who responded from Massachusetts, one would want to include those that responded with “Massachusetts” obviously. But, due to hierarchy, one would also want to include those that were more specific and answered “Boston” or “Cambridge”. That makes sense, right?

So when I was to do a code sample for a medical crowdsourcing site, I quickly saw what I wanted to do. Their site had little knowledge of medical hierarchy that looked to be an obvious flaw. Those that responded with a symptom of “Myocardial Infarction” (commonly known as Heart Attack) were the only ones being included on a report for that symptom. Those the reported more specific versions of “Heart Attack” like “Non-STI Myocardial Infarction” would not show up. Similar to the location example above, this seemed to be a problem. Beyond coding up a sample of how to fix this, I came armed to talk about it during the in-person interview.

Only throughout the in-person interview process, I was the only one who considered this an issue. The argument against it generally boiled down to the fact they got their requirements from a group of medical scientists. I seriously cannot figure out a counter-argument beyond that since they could fix their system with about a week of a single developer’s time that would make the data much more usable. Further questioning about conferences that developers generally attended there were devoid of anything to do with the medical field. They were hired to implement checkboxes from a requirements document – to handle just the translation from pages of paper to digital ones and zeroes. Ability to understand what it is they are implementing and taking action to make the system work well? Not a desired skill.

Of course, perhaps I’m incorrect in my assessment of how medical hierarchy can be used and there is some valid reason that makes it different than my initial location analogy. I cannot come up with one and it frustrates me to this day how it makes so much sense to me but how much it bombed talking about it with people at that organization. (There were other things wrong with that place as well. Such as being told that making their data shareable wasn’t a priority they cared about since others can just deal with their custom format, how they didn’t like to collaborate with others in the medical world because they viewed themselves as doing things better than everyone else, etc).

This wasn’t just a one-off situation but rather the norm. Another position I interviewed for would have dealt with anti-virus protection. When I asked what conferences the team lead interviewing me attended, he seemed to be confused. I repeated it and clarified it being security conferences to keep up on the latest trends, and he responded that he didn’t attend any. Developers there just don’t go to those. This lack of specialist knowledge really showed in their answer to how their product differed from the competition. Both him and another I interviewed with touted a single killer feature that they claimed no other product on the market had. Only that had already been added to other security products recently – including one suite my girlfriend sells. I didn’t bother to correct their lack of domain knowledge.

This trend towards an assembly line where developers aren’t expected nor desired to understand the product is disturbing to me. I will never understand how one can create anything truly exceptional from such a philosophy. Of course, part of that is fueled by my own preferences that bias my opinion, but valid improvements can come from more than just product managers or researchers understanding the domain.

Government Employees Being Viewed As Lazy

In one phone screen, I was asked if I thought I was up to working at a startup. That I would have to “actually do work” and “really get things done” which is different than working for the government.

At another job, when asked about what advice at working at their company, an interviewer asked me to remind them where I was currently employed at again. After responding that I was employed by the Boston Public Library, he went into a spiel about how it is easy to just “blend in without doing much work” at such a place. A similar theme to the above, he stated how one “is responsible for making things work since no one else will for you”.

Sitting through both of those talks was nothing sort of infuriating. It didn’t help that I could have presented my background better by linking it to terms they knew – a flaw my girlfriend pointed out. She advised that I should have pointed out whenever possible that my position was much like an early stage startup and that my colleague and I had to wear many hats and put in quite a great deal of work to get things off the ground. Regardless, it seems that the war on the competence of government employees has hit even the liberal bastion of Boston, Massachusetts. That being a Software Engineer who chooses to do public service must mean that was the only job one could land with one’s limited skills or that one must just be lazy. By remaining in the public sector, I’m viewed as being less competent by private sector peers that impacts how one will be judged during an interview and will make taking the next step in one’s career more difficult.

Hired.com

Beyond normal interview sites, I did give hired.com a try. I ended up getting four interview offers from the site that I would not have found on my own. Essentially a quick note that it seems to work well as an option to try even if it didn’t end up leading to an offer myself. (The job I ended up accepting was referred to me by a friend of a friend).

Accessing Excel Spreadsheet Files for Batch Uploads of Digital Objects:

scande3 — Thu, 28 May 2015 21:16:35 +0000

As it was asked how we handle reading content from Excel, this is a very quick blog post that goes over what we do for that. First you will need the following added to your Gemfile and then bundle install:

gem 'roo', :git => 'https://github.com/roo-rb/roo' gem 'roo-xls', :git => 'https://github.com/roo-rb/roo-xls.git'

Now that we have the library that will allow us to read the spreadsheet, we can go ahead and setup a variable to hold the content of the spreadsheet. This assumes you have a “sheet_location” variable set that indicates where the file you are trying to read lives (be it uploaded or not) and assigns the content to @worksheet:

if sheet_location =~ /\b.xlsx$\b/
  @worksheet = Roo::Excelx.new(sheet_location)
elsif sheet_location =~ /\b.xls$\b/
  @worksheet = Roo::Excel.new(sheet_location)
elsif sheet_location =~ /\b.csv\b/
  @worksheet = Roo::CSV.new(sheet_location)
elsif sheet_location =~ /\b.ods\b/
  @worksheet = Roo::OpenOffice.new(sheet_location)
end

@worksheet.default_sheet = @worksheet.sheets.first #Sets to the first sheet in the workbook

The next thing we want to do is grab the header row that has our column headers. At the BPL, this is the third row in the spreadsheet (previous rows are for notes). As such, with the library using “1” as its first row, we would get the header row via the following:

header_row_index = 3
@header_row = @worksheet.row(header_row_index)

From here, I loop through each data row in my spreadsheet (which starts at index 5 for us) and pass that row value along with the header row to a method to process that row. It looks something like:

data_start_row = 5 
data_start_row_index.upto @worksheet.last_row do |index| row = @worksheet.row(index) 
  if row.present? && @header_row.present?
    begin 
      process_a_row(@header_row, row) 
    rescue Exception => e 
      #Exception handling for when encounter bad data... 
    end 
   end 
  end
end

Now we have each row in our spreadsheet being processed! But… how do we access each individual cell? In our case, our spreadsheet template has over 150 possible headers and having a spreadsheet with every header becomes unwieldy. As such, each one has some combination of potential headers and the order of those headers in the spreadsheet is not guaranteed. So we end up with something like the following to get the value of “title” out of our spreadsheet:

def process_a_row(header_row, row_value)
  # ...
  title = find_in_row(header_row, row_value, 'title_primary')
  # ...
end

Essentially this is calling a method called “find_in_row” from within the “process_a_row” block and adds a third argument of the row header identifier we are using to find that data element. The “find_in_row” method then looks like:

def find_in_row(header_row, row_value, column_identifier)
  0.upto header_row.length do |row_pos|
    case header_row[row_pos]
      when column_identifier
        return strip_value(row_value[row_pos])
    end
  end
  return nil
end

This has another new method: strip_value. The plan is to move this function into “Bpl_Enrich” in the near future but essentially this is to return our data elements as UTF-8 strings. The code for this looks like:

def strip_value(value)
  if(value == nil)
    return nil
  else
    if value.class == Float
      value = value.to_f.to_s
      value = value.gsub(/.0$/, '') #FIXME: Temporary. Bugged as see: https://github.com/roo-rb/roo/issues/86 , https://github.com/roo-rb/roo/issues/133 , https://github.com/zdavatz/spreadsheet/issues/41
    elsif value.class == Fixnum
      value = value.to_i.to_s #FIXME: to_i as otherwise non-existant values cause problems
    end

    # Make sure it is all UTF-8 and not character encodings or HTML tags and remove any cariage returns
    return utf8Encode(value)
  end
end

def utf8Encode(value)
  return HTMLEntities.new.decode(ActionView::Base.full_sanitizer.sanitize(value.to_s.gsub(/\r?\n?\t/, ' ').gsub(/\r?\n/, ' '))).strip
end

We can now repeat the calls to “find_in”row” within the “process_a_row” method for each of our data elements and insert them into our system as needed. But what do we do for multi-valued fields? We use a deliminator of a double pipe (“||”) to delineate values in those cases. For example, if our title was allowed to be multi-valued, we could have the above return “title1||title2″. There is then another helper function to convert that into an array as the following:

def split_with_nils(value)
  if(value == nil)
    return ""
  else
    split_value = value.split("||")
    0.upto split_value.length-1 do |pos|
      split_value[pos] = strip_value(split_value[pos])
    end

    return split_value
  end
end

Why do I return “” on the nil case? To make processing easier for related pairs of multivalued column fields when doing indexing without an extra logic check in the inserts of “process_a_row”. As a full example, assume we have titles of “title1||title2||title3″ and title types of “primary||||alternative” (ie. the second title lacks a type in this made up example of bad MODS data). You would do something like:

title = find_in_row(header_row, row_value, 'title')
if title.present?
  title_list = split_with_nils(title)
  title_type_list = split_with_nils(find_in_row(header_row, row_value, 'title_type'))
  0.upto title_list.length-1 do |pos|
    @digital_object.descMetadata.insert_title(title_list[pos],title_type_list[pos])
  end
end

Of course, in the insert title method, you would need to check for blank values to not insert the empty title type value for the second title in our list. If there was no “title_type” specified at all as that was omitted as an optional field, our function would still work as indexing an empty string (the returned “” from split_with_nils) would just give us all blank values for the title_type as we index through it.

I hope this was somewhat useful, not a completely crazy approach, and made some sense. Take care!

A Further Look at Metadata’s Affect on Discoverability

scande3 — Fri, 17 Apr 2015 19:11:08 +0000

This post continues a look at the effect metadata has on the amount of views an object receives. Part 1 can be found at: http://scande3.com/2015/04/effect-of-metadata-subjects-on-a-digital-objects-discoverability/. The same criteria from part 1 still applies to these stats and those overall global rules are:

A six month time period from October 1st, 2014 until March 31st, 2015 for objects the existed in the repository before December 31st, 2014.
The listed view counts come from the Google Analytics API and reflect views on the object’s main result page only.

The first exciting match puts “LCSH pre-coordinated” subject topics against those that lack the “–” concatenation.

LCSH Topic Subject Comparisons

**Term Clarifications**:
“LCSH Style Topic Subject Objects” column are items that all of their subjects have “–” in them.
“Non-LCSH Style Topic Subject Objects” column are items with all subjects lacking “–” in them.
“Mixture of Both Styles Topic Subject Objects” column is an item with at least one subject with “–” and at least one subject without “–“.
	LCSH Style Topic Subject Objects	Non-LCSH Style Topic Subject Objects	Mixture of Both Styles Topic Subject Objects
Total Records	8,119	122,353	8,111
Average Views	0.938	1.570	1.243
Percent with 1+ Views	30.8%	42.9%	40.5%

It would appear the hypothesis in the previous analysis post is correct: normalized non-LCSH style subjects soundly defeat those items that use the concatenation. But there is a notable asterisk to this victory in that the amount of objects using LCSH style subjects is significantly smaller. Natively in the Digital Commonwealth system, we do not generally pre-coordinate LCSH subjects as our “best practice” and thus that policy decision has an affect on how metadata was done for the vast majority of items. That doesn’t mean we don’t use “complex subjects” in LCSH that represent a complete topic. For example, we do have “best practice” objects that use “United States–History–Civil War, 1861-1865” as that is the single Library of Congress topic entry that defines that war. The majority of these cases are in the “Mixture” category in the above table. For example, the item I looked at with that string also had the subjects of “Monuments & memorials” and “Churches“.

Back on topic, this means the vast majority of “LCSH Style” topic subjects came to us from metadata sources we do not control. That would namely be OAI feeds from institutions that use the “pre-coordinated LCSH Subjects” as their metadata practice and that we were unable to break up on our end. This is an important note as these records coming from a series of uniform minority sources in the system could indicate other factors are at play for these numbers. Taking into account the numerous potential factors (such as quality of source metadata in other areas of the record or how interesting the items are) are mostly beyond the scope of this blog post. I will provide a breakdown of these items comparing those that have a topic subject but no geographic subject to those objects that do contain that geographic subject. Albeit I must add a caveat that this breakdown does make the numbers much more volatile as the size of records in a category becomes quite limited and thus I’d avoid coming to conclusions from these:

**Term Clarifications**:
“LCSH Style Topic Subject Objects (no Geographic)” column are items that all of their subjects have “–” in them and no geographic topic.
“Non-LCSH Style Topic Subject Objects (no Geographic)” column are items with all subjects lacking “–” in them and no geographic topic.
“Mixture of Both Styles Topic Subject Objects (no Geographic)” column is an item with at least one subject with “–” and at least one subject without “–” and no geographic topic.
	LCSH Style Topic Subject Object (no Geographic)	Non-LCSH Style Topic Subjects Objects (no Geographic)	Mixture of Both Styles Topic Subject Objects (no Geographic)
Total Records	549	59,328	1,224
Average Views	1.741	0.829	1.622
Percent with 1+ Views	52.6%	28.9%	53.1%

**Term Clarifications**:
“LCSH Style Topic Subject Objects (with Geographic)” column are items that all of their subjects have “–” in them and at least one geographic topic.
“Non-LCSH Style Topic Subject Objects (with Geographic)” column are items with all subjects lacking “–” in them and at least one geographic topic.
“Mixture of Both Styles Topic Subject Objects (with Geographic)” column is an item with at least one subject with “–” and at least one subject without “–” and at least one geographic topic.
	LCSH Style Topic Subject Objects (with Geographic)	Non-LCSH Style Topic Objects (with Geographic)	Mixture of Both Styles Topic Subject Objects (with Geographic)
Total Records	7,570	63,025	6,887
Average Views	0.880	2.267	1.176
Percent with 1+ Views	29.2%	56.1%	38.2%

OAI Harvested (Metadata Only Records) vs Hosted Native Records

As a DPLA Hub, we offer both hosted and harvesting options for our member institutes. The majority of our content is hosted directly in the system and those items that are ingested almost always go through our Digitization department where the metadata is often either cleaned up, created by, or given advice on its creation by our Metadata Mob. This then theoretically creates much more uniform metadata for our system that will play well with other objects when searching or faceting. Meanwhile, while we do enrichment on metadata from OAI feeds, we often have much less control over the policies those institutions implement (such as the previous topic subject differences). As such, the variation on the standards used for those objects is likely much greater. This table hopes to quantify that difference… but does have one huge flaw. In cases of an OAI Harvested metadata item, we provide DPLA with a direct link to that object in its native system rather than forcing the user to go through Digital Commonwealth first. As such, OAI Harvested objects below will be missing statistics on those views and the DPLA is a one of our top sources for referral traffic. (As an aside, the site we get the most traffic directed to us from is Facebook).

**Term Clarifications**:
“Hosted Object” column are items that live natively in our system and have the image/audio/video file content in our Fedora Commons repository.
“OAI Harvested Object” column are items that we ingest from a member institute’s repository and contain metadata and a thumbnail image only.
	Hosted Object	OAI Harvested Object
Total Records	108,282	37,801
Average Views	1.921	0.364
Percent with 1+ Views	51.7%	15.0%

The results are as I would expect. I wish I could tell how much of an effect the loss of the DPLA traffic on the stats for the OAI Harvested records had on these results. Still: it does seem highly likely that the uniformity of the metadata does have an effect on how often an object is discovered in our shared system.

The Fourth Dimension!

While knowing where a record is from and what it is about is quite important, I haven’t talked about the “when” aspect. I decided I’d run some quick stats that looks at how having a date on a particular item might increase the findability of an item.

**Term Clarifications**:
“Object with a date” column are items with at least one valid and not “unknown” date associated with it.
“Objects with No Date” column are items that either have no date or the date is “unknown”.
	Objects with a Date	Objects with No Date
Total Records	143,097	2,986
Average Views	1.500	2.405
Percent with 1+ Views	41.9%	55.6%

The good news: 98% of our records have a date associated with it! That is actually higher than I would have expected. More objects have a date in our system than have a subject associated with it!

The bad news? This means I don’t have a large enough group of “no date” items to figure out what effect a date might have on the views of an object. From the stats above, it would seem that objects without a date have a significant higher viewer average than those that contain a date which does not make logical sense. So while the above table are the actual stats, the only sense I can make from it is that individuals creating these records are doing an awesome job adding in a date.

Conclusion

It would appear the “exploded LCSH” or “non pre-coordinated LCSH” topic subject items are more discoverable in our system. However, it also appears likely that uniformity of metadata increases the odds of an object being discovered, so that could be a result of that being the primary policy we implemented for topic subjects. It would be interesting to see the same subject analysis that have been run here run over all of the DPLA data to see if the same patterns hold up in an even larger pool of objects.

Thanks for reading once again! Next time will likely be a move away from stats and on another aspect of the Digital Commonwealth system. Take care!

Effect of Metadata Subjects on a Digital Object’s Discoverability

scande3 — Wed, 08 Apr 2015 06:14:01 +0000

Inspired by Mark E. Phillips series of blog posts analyzing subject metadata from DPLA hubs and a conversation with Corey Harper at code4lib on an analysis he is doing on DPLA data, I decided to do some statistical digging into the Digital Commonwealth system. In particular: what effect does subject level metadata have on how discoverable an object is in a repository shared by over 100 institutions?

The following are the shared details on what the following tables will represent:

A six month time period from October 1st, 2014 until March 31st, 2015 for objects the existed in the repository before December 31st, 2014.
The listed view counts come from the Google Analytics API and reflect views on the object’s main result page only. An example of this page would be: http://arktest.bpl.org/ark:/50959/zs25x877c (of course, linked to the test server to prevent future object statistics from being manipulated by this post! :p).

My Initial Subject Stats Gathering Attempt Chart

**Term Clarifications**:
“Topic Subject Only” column has records that have no geographic subject element.
“Geographic Subjects Only” column has record that have no topic subject element.
“Top 5 Average” row is just the average views for the top 5 individual records of that category.
	No Subjects	Topic Subject Only	Geographic Subjects Only	Both Topic and Geographic
Total Records	3,958	61,101	3,438	77,482
Average Views	1.098	0.853	2.170	2.035
Percent with 1+ Views	42.1%	29.6%	48.9%	51.9%
Top 5 Average Views	44	161	71	394

As we can see, Geographic Subjects and a combination of Topic / Geographic Subjects both do great things to our numbers! The average views in these cases show a strong increase over lacking those elements and the percentage of records with at least one view go up a significant amount. Records with a geographic subject have a low “Top 5 Average” but that could potentially be due to the very limited amount of records in the system that fall in that category. But… Topic Subjects… what happened to you?!?! I had positive expectations for you! Why are you dragging your records into the darkness of oblivion? Only 29.6% of records were ever viewed even once and the average views are a paltry 0.853 per object… both significantly lower than records with no subjects at all!

I pondered on this for a short period of time and began to form a theory on why these results came out as they were. To start with, in the Digital Commonwealth system, we link virtually all Geographic Topics to the TGN Controlled Linked Data Vocabulary using a gem known as Geomash. This means that all of those records are in the same hierarchical structure using the same place terms – which is absolutely awesome for faceting! Meanwhile, our subjects currently benefit from no such structure at this current time. As Mark E. Phillips analysis showed, our system has a bunch of “unique subjects” and those are…. less awesome for faceting. So I decided I need to “stats smarter” and account for the difference between how shared different subjects on a record are. Thus the following methodology was introduced for the remaining charts:

For a digital object record, I would keep a count of the number of topic subjects that record contained that were shared with at least one of the other 113 institutions in the Digital Commonwealth system.

My Smarter Chart (Objects with a Topic Subject and no Geographic Subject)

**Term Clarifications**:
“Percent of Topic Only” row refers to the percentage that the broken up objects make up of the original table’s “Topic Subject Only” column.
	Unique Topic Subject Objects	One Topic Subject Shared By Multiple Institutions	Two Topic Subjects Shared By Multiple Institutions	3+ Subjects Shared By Multiple Institutions
Total Records	1,140	34,528	16,077	9,356
Percent of Topic Only	1.9%	56.5%	26.3%	15.3%
Average Views	0.419	0.503	1.233	1.542
Percent with 1+ Views	17.9%	19.2%	40.8%	50.1%
Top 5 Average Views	13	122	155	99

This…. look a bit better and seems to validate that non-unique subjects have a positive effect on a digital object being discovered. As one adds more shared subjects, the average amount of views an item could expect to receive increased. However, I find it interesting that the top 5 record average went to the exactly two shared topic category. Why is this the case?

As valid as any other explanation I could give.

In seriousness, I am far from a statistician so this blog is mostly a raw data dump with my limited perspective on what it could mean. In keeping with hopefully offering interesting data, it would be neat to have that same breakdown for objects with both subject topics and subject geographical elements to see if the pattern holds up, right? As such, I now present:

The “Going Above and Beyond” Chart (Objects with both a Topic and Geographic Subject)

**Term Clarifications**:
“Percent of Topic Only” row refers to the percentage that the broken up objects make up of the original table’s “Both Topic and Geographic” column.
	Unique Topic Subject Objects w/ Geographic	One Topic Subject Shared By Multiple Institutions w/ Geographic	Two Topic Subjects Shared By Multiple Institutions w/ Geographic	3+ Topic Subjects Shared By Multiple Institutions w/ Geographic
Total Records	352	32,009	16,823	28,298
Percent of Topic Only	0.5%	41.3%	21.7%	36.5%
Average Views	1.878	1.835	1.959	2.307
Percent with 1+ Views	19.6%	46.8%	76.9%	55.9%
Top 5 Average Views	31	308	135	284

It would appear that when a geographic element exists, the effect on the average views is less pronounced than when that element is missing. The trend is still of an overall increase in the average views an object will receive but the percentage jump from adding each shared subject is less. There are also some additional oddities introduced such as two shared topic subjects having an unusual boost in a record’s chance of having been viewed at least once. I double checked the numbers to make sure that the “exactly two” category continues to be taunt me.

It would appear the “Top 5 Average Views” category doesn’t show much of a trend in terms of what leads to our most popular items. My colleague said that statistic was not useful and overall just confusing and it appears I must concede that he can do a victory dance. One cannot even easily claim that it might be mainly related to the amount of records that fall into a category as there is a large exception to that pattern in the “Topic Subject Breakdown” chart. Still included it in all of these charts as I still find it interesting that the availability of metadata did not seem to have a consistent effect on what might be considered a “viral success” in our system. If I include an attempt to catch the “cream of the crop” statistic in the future, l likely would increase this to “top 50 average views” to see if that behaves more in a manner that I would expect and smooths out some of the extreme variance.

Conclusion

Overall it would appear that Geographic Subjects are more important to have than Topic Subjects on a record. The average of even a highly shared Topic Subject Only record against a record with just a Geographic Subject still favors the latter. Additionally: more shared subjects had only a relatively limited affect on records with a Geographic element (although an overall pattern of increased finability was observed). Whether this is due to the extra effort we put in to enhance the geographic aspect of our metadata by programmatically linking it to a controlled vocabulary or if that is just a more useful facet is hard to tell. We would have to do more analysis on user behavior to add that element to this equation and may end up taking a look at that in the future.

One of our next focuses for metadata enrichment will be topic subjects and it will be interesting to see how any enhancements on that field affect these stats. My colleague has also suggested that we take a look at numbers comparing subjects elements with LCSH “–” separators compared to those without the concatenation. I’d imagine this to be overall similar to the “shared topic subject” breakdowns since the more elements contained in that concatenated LCSH string, the more likely it is to be unique. But that is just a hypothesis at this point. Additionally such a followup blog post would look at the effect of an object being hosted by the Digital Commonwealth (74% of objects) compared to only having its metadata harvested in the system (26% of objects). There is also a thought about somehow taking into account the popularity of an overall collection into this analysis since a popular collection would likely boost all of its respective member items regardless of metadata quality. Feel free to leave a comment on other interesting analysis that might be worthwhile to do.

If you found this blog interesting, feel free to share it! Thanks for taking the time to read this post and take care.