Caching and Metadata Enrichment Interfaces (Part 1 – Alpha)

You are likely here after seeing Eben and I’s presentation on RDF and are interested in some of the more advanced coding topics. Sadly, I’m still scrambling to get things up, so this is going to be done in parts over the next few days here at code4lib. In addition, there is no denying that this will be a Rails / Hydra ecosystem viewpoint… but the design being outlined may be duplicated in other systems. This first part is going to go over the following two items:

  1. Some initial Rails Linked Data Fragments setup with caching.
  2. A live demo site of one of the Metadata Enrichment Interfaces to play around with.

Why run a Rails Linked Data Fragments Instance?

What advantage does this have over just directly accessing those uri’s? Or why run this application? The following are the “wins” for this approach:

  1. You can pre-cache a resource such as in the Blazegraph steps above. If a resource doesn’t exist when requested (such as if you resolved “Berlin” from dbpedia that wasn’t cached in Blazegraph), then it will fetch it live and cache that triple automatically. This all equates to faster performance and more reliability.
  2. While one could use the Blazegraph, Marmotta, etc APIs directly, it becomes a nightmare to then share code that interfaces with Linked Data. I may be using Blazegraph, another Marmotta, someone else Apache Stanbol, to a bunch more caching solutions. Requiring other gems to account for all those implementations would be a bit of a nightmare. This centralizes that negotiation layer into a single place so neat stuff like Metadata Enrichment Interfaces or Sidecar Indexers (to keep your labels from URIs up to date) can just worry about understanding one API.
  3. Not all Linked Data sources are easy to parse. This centralizes any special negotiation cases rather than such exceptions having to be spread out elsewhere.

Setting up a Rails Linked Data Fragments instance

Of the options for a caching backend, I personally recommend Blazegraph. Performance and ease of use seem to be much better than Marmotta.

git clone https://github.com/ActiveTriples/linked-data-fragments.git
git fetch # Temporary step
git checkout repository_from_rdf_rb # Temporary step
bundle install

# To install a dev blazegraph / marmotta
rake ldfjetty:install
rake ldfjetty:start

# It should now be available at: http://localhost:8988. To stop, do:
rake ldfjetty:stop

# It is recommended to prepopulate your data with LCSH. To do this:
1. Download the latest subjects vocab from: http://id.loc.gov/download/ (the nt version of “LC Subject Headings (SKOS/RDF only)”)

2. Extract the above download into a directory.

3. Run the following command from within that directory: 

  curl -H 'Content-Type: text/turtle' --upload-file subjects-skos-20140306.nt -X POST "http://localhost:8988/blazegraph/sparql?context-uri=http://id.loc.gov/static/data/authoritiessubjects.nt.skos.zip"

If you used the ldfjetty scrips, you should be able to access both blazegraph and marmotta at http://localhost:8988. The next step is to configure which caching backend your linked data fragments server will use (among other settings). There are sample configs in your config directory… pick the desired one and copy it as “ldf.yml”. Make sure the settings are satisfactory (the defaults shouldn’t need adjustment for testing). Once done, you can run the normal command to start a test server (albeit likely on a custom port like below):

rails c -p 3005

Congrats! You have a running Linked Data Fragments server. Want to try some commands? The following are some examples with default settings:

Other Caching Options

The Linked Data Fragments interface is technically optional if you are only concerned with caching and not with sharing code using a standard API. In addition, there are caching layers we have yet to implement. For a lively discussion on these and some notes, see the following two sources:

Metadata Enrichment Interface Live Demo

While a better test bed interface needs to be provided, I’ve updated an old test server for people to test the actual form itself. It is located via the following procedure:

While you won’t be able to submit anything, you can try out the Metadata Enrichment Interface at that url. The alternative interface form Villanova University has yet to be implemented as an option. Speeds are likely to be magnitudes slower as this server space lacks the resources to host a caching layer and I am using one located elsewhere on the web.

Next Time

I’m hoping to wrap this up as I bring code to a more stable place. So a part two should follow by Friday. Hope this was interesting or helpful!

 

Accessing Excel Spreadsheet Files for Batch Uploads of Digital Objects:

As it was asked how we handle reading content from Excel, this is a very quick blog post that goes over what we do for that. First you will need the following added to your Gemfile and then bundle install:

gem 'roo', :git => 'https://github.com/roo-rb/roo'
gem 'roo-xls', :git => 'https://github.com/roo-rb/roo-xls.git'

Now that we have the library that will allow us to read the spreadsheet, we can go ahead and setup a variable to hold the content of the spreadsheet. This assumes you have a “sheet_location” variable set that indicates where the file you are trying to read lives (be it uploaded or not) and assigns the content to @worksheet:

if sheet_location =~ /\b.xlsx$\b/
  @worksheet = Roo::Excelx.new(sheet_location)
elsif sheet_location =~ /\b.xls$\b/
  @worksheet = Roo::Excel.new(sheet_location)
elsif sheet_location =~ /\b.csv\b/
  @worksheet = Roo::CSV.new(sheet_location)
elsif sheet_location =~ /\b.ods\b/
  @worksheet = Roo::OpenOffice.new(sheet_location)
end

@worksheet.default_sheet = @worksheet.sheets.first #Sets to the first sheet in the workbook

The next thing we want to do is grab the header row that has our column headers. At the BPL, this is the third row in the spreadsheet (previous rows are for notes). As such, with the library using “1” as its first row, we would get the header row via the following:

header_row_index = 3
@header_row = @worksheet.row(header_row_index)

From here, I loop through each data row in my spreadsheet (which starts at index 5 for us) and pass that row value along with the header row to a method to process that row. It looks something like:

data_start_row = 5 
data_start_row_index.upto @worksheet.last_row do |index| row = @worksheet.row(index) 
  if row.present? && @header_row.present?
    begin 
      process_a_row(@header_row, row) 
    rescue Exception => e 
      #Exception handling for when encounter bad data... 
    end 
   end 
  end
end

Now we have each row in our spreadsheet being processed! But… how do we access each individual cell? In our case, our spreadsheet template has over 150 possible headers and having a spreadsheet with every header becomes unwieldy. As such, each one has some combination of potential headers and the order of those headers in the spreadsheet is not guaranteed. So we end up with something like the following to get the value of “title” out of our spreadsheet:

def process_a_row(header_row, row_value)
  # ...
  title = find_in_row(header_row, row_value, 'title_primary')
  # ...
end

Essentially this is calling a method called “find_in_row” from within the “process_a_row” block and adds a third argument of the row header identifier we are using to find that data element. The “find_in_row” method then looks like:

def find_in_row(header_row, row_value, column_identifier)
  0.upto header_row.length do |row_pos|
    case header_row[row_pos]
      when column_identifier
        return strip_value(row_value[row_pos])
    end
  end
  return nil
end

This has another new method: strip_value. The plan is to move this function into “Bpl_Enrich” in the near future but essentially this is to return our data elements as UTF-8 strings. The code for this looks like:

def strip_value(value)
  if(value == nil)
    return nil
  else
    if value.class == Float
      value = value.to_f.to_s
      value = value.gsub(/.0$/, '') #FIXME: Temporary. Bugged as see: https://github.com/roo-rb/roo/issues/86 , https://github.com/roo-rb/roo/issues/133 , https://github.com/zdavatz/spreadsheet/issues/41
    elsif value.class == Fixnum
      value = value.to_i.to_s #FIXME: to_i as otherwise non-existant values cause problems
    end

    # Make sure it is all UTF-8 and not character encodings or HTML tags and remove any cariage returns
    return utf8Encode(value)
  end
end

def utf8Encode(value)
  return HTMLEntities.new.decode(ActionView::Base.full_sanitizer.sanitize(value.to_s.gsub(/\r?\n?\t/, ' ').gsub(/\r?\n/, ' '))).strip
end

We can now repeat the calls to “find_in”row” within the “process_a_row” method for each of our data elements and insert them into our system as needed. But what do we do for multi-valued fields? We use a deliminator of a double pipe (“||”) to delineate values in those cases. For example, if our title was allowed to be multi-valued, we could have the above return “title1||title2″. There is then another helper function to convert that into an array as the following:

def split_with_nils(value)
  if(value == nil)
    return ""
  else
    split_value = value.split("||")
    0.upto split_value.length-1 do |pos|
      split_value[pos] = strip_value(split_value[pos])
    end

    return split_value
  end
end

Why do I return “” on the nil case? To make processing easier for related pairs of multivalued column fields when doing indexing without an extra logic check in the inserts of “process_a_row”. As a full example, assume we have titles of “title1||title2||title3″ and title types of “primary||||alternative” (ie. the second title lacks a type in this made up example of bad MODS data). You would do something like:

title = find_in_row(header_row, row_value, 'title')
if title.present?
  title_list = split_with_nils(title)
  title_type_list = split_with_nils(find_in_row(header_row, row_value, 'title_type'))
  0.upto title_list.length-1 do |pos|
    @digital_object.descMetadata.insert_title(title_list[pos],title_type_list[pos])
  end
end

Of course, in the insert title method, you would need to check for blank values to not insert the empty title type value for the second title in our list. If there was no “title_type” specified at all as that was omitted as an optional field, our function would still work as indexing an empty string (the returned “” from split_with_nils) would just give us all blank values for the title_type as we index through it.

I hope this was somewhat useful, not a completely crazy approach, and made some sense. Take care!