May | 2015 | Steven Carl Anderson's Blog

As it was asked how we handle reading content from Excel, this is a very quick blog post that goes over what we do for that. First you will need the following added to your Gemfile and then bundle install:

gem 'roo', :git => 'https://github.com/roo-rb/roo' gem 'roo-xls', :git => 'https://github.com/roo-rb/roo-xls.git'

Now that we have the library that will allow us to read the spreadsheet, we can go ahead and setup a variable to hold the content of the spreadsheet. This assumes you have a “sheet_location” variable set that indicates where the file you are trying to read lives (be it uploaded or not) and assigns the content to @worksheet:

if sheet_location =~ /\b.xlsx$\b/
  @worksheet = Roo::Excelx.new(sheet_location)
elsif sheet_location =~ /\b.xls$\b/
  @worksheet = Roo::Excel.new(sheet_location)
elsif sheet_location =~ /\b.csv\b/
  @worksheet = Roo::CSV.new(sheet_location)
elsif sheet_location =~ /\b.ods\b/
  @worksheet = Roo::OpenOffice.new(sheet_location)
end

@worksheet.default_sheet = @worksheet.sheets.first #Sets to the first sheet in the workbook

The next thing we want to do is grab the header row that has our column headers. At the BPL, this is the third row in the spreadsheet (previous rows are for notes). As such, with the library using “1” as its first row, we would get the header row via the following:

header_row_index = 3
@header_row = @worksheet.row(header_row_index)

From here, I loop through each data row in my spreadsheet (which starts at index 5 for us) and pass that row value along with the header row to a method to process that row. It looks something like:

data_start_row = 5 
data_start_row_index.upto @worksheet.last_row do |index| row = @worksheet.row(index) 
  if row.present? && @header_row.present?
    begin 
      process_a_row(@header_row, row) 
    rescue Exception => e 
      #Exception handling for when encounter bad data... 
    end 
   end 
  end
end

Now we have each row in our spreadsheet being processed! But… how do we access each individual cell? In our case, our spreadsheet template has over 150 possible headers and having a spreadsheet with every header becomes unwieldy. As such, each one has some combination of potential headers and the order of those headers in the spreadsheet is not guaranteed. So we end up with something like the following to get the value of “title” out of our spreadsheet:

def process_a_row(header_row, row_value)
  # ...
  title = find_in_row(header_row, row_value, 'title_primary')
  # ...
end

Essentially this is calling a method called “find_in_row” from within the “process_a_row” block and adds a third argument of the row header identifier we are using to find that data element. The “find_in_row” method then looks like:

def find_in_row(header_row, row_value, column_identifier)
  0.upto header_row.length do |row_pos|
    case header_row[row_pos]
      when column_identifier
        return strip_value(row_value[row_pos])
    end
  end
  return nil
end

This has another new method: strip_value. The plan is to move this function into “Bpl_Enrich” in the near future but essentially this is to return our data elements as UTF-8 strings. The code for this looks like:

def strip_value(value)
  if(value == nil)
    return nil
  else
    if value.class == Float
      value = value.to_f.to_s
      value = value.gsub(/.0$/, '') #FIXME: Temporary. Bugged as see: https://github.com/roo-rb/roo/issues/86 , https://github.com/roo-rb/roo/issues/133 , https://github.com/zdavatz/spreadsheet/issues/41
    elsif value.class == Fixnum
      value = value.to_i.to_s #FIXME: to_i as otherwise non-existant values cause problems
    end

    # Make sure it is all UTF-8 and not character encodings or HTML tags and remove any cariage returns
    return utf8Encode(value)
  end
end

def utf8Encode(value)
  return HTMLEntities.new.decode(ActionView::Base.full_sanitizer.sanitize(value.to_s.gsub(/\r?\n?\t/, ' ').gsub(/\r?\n/, ' '))).strip
end

We can now repeat the calls to “find_in”row” within the “process_a_row” method for each of our data elements and insert them into our system as needed. But what do we do for multi-valued fields? We use a deliminator of a double pipe (“||”) to delineate values in those cases. For example, if our title was allowed to be multi-valued, we could have the above return “title1||title2″. There is then another helper function to convert that into an array as the following:

def split_with_nils(value)
  if(value == nil)
    return ""
  else
    split_value = value.split("||")
    0.upto split_value.length-1 do |pos|
      split_value[pos] = strip_value(split_value[pos])
    end

    return split_value
  end
end

Why do I return “” on the nil case? To make processing easier for related pairs of multivalued column fields when doing indexing without an extra logic check in the inserts of “process_a_row”. As a full example, assume we have titles of “title1||title2||title3″ and title types of “primary||||alternative” (ie. the second title lacks a type in this made up example of bad MODS data). You would do something like:

title = find_in_row(header_row, row_value, 'title')
if title.present?
  title_list = split_with_nils(title)
  title_type_list = split_with_nils(find_in_row(header_row, row_value, 'title_type'))
  0.upto title_list.length-1 do |pos|
    @digital_object.descMetadata.insert_title(title_list[pos],title_type_list[pos])
  end
end

Of course, in the insert title method, you would need to check for blank values to not insert the empty title type value for the second title in our list. If there was no “title_type” specified at all as that was omitted as an optional field, our function would still work as indexing an empty string (the returned “” from split_with_nils) would just give us all blank values for the title_type as we index through it.

I hope this was somewhat useful, not a completely crazy approach, and made some sense. Take care!

Steven Carl Anderson's Blog

The melody of logic always plays the notes of truth.

Monthly Archives: May 2015

Accessing Excel Spreadsheet Files for Batch Uploads of Digital Objects: