As it was asked how we handle reading content from Excel, this is a very quick blog post that goes over what we do for that. First you will need the following added to your Gemfile and then bundle install:
gem 'roo', :git => 'https://github.com/roo-rb/roo'
gem 'roo-xls', :git => 'https://github.com/roo-rb/roo-xls.git'
Now that we have the library that will allow us to read the spreadsheet, we can go ahead and setup a variable to hold the content of the spreadsheet. This assumes you have a “sheet_location” variable set that indicates where the file you are trying to read lives (be it uploaded or not) and assigns the content to @worksheet:
if sheet_location =~ /\b.xlsx$\b/
@worksheet = Roo::Excelx.new(sheet_location)
elsif sheet_location =~ /\b.xls$\b/
@worksheet = Roo::Excel.new(sheet_location)
elsif sheet_location =~ /\b.csv\b/
@worksheet = Roo::CSV.new(sheet_location)
elsif sheet_location =~ /\b.ods\b/
@worksheet = Roo::OpenOffice.new(sheet_location)
end
@worksheet.default_sheet = @worksheet.sheets.first #Sets to the first sheet in the workbook
The next thing we want to do is grab the header row that has our column headers. At the BPL, this is the third row in the spreadsheet (previous rows are for notes). As such, with the library using “1” as its first row, we would get the header row via the following:
header_row_index = 3
@header_row = @worksheet.row(header_row_index)
From here, I loop through each data row in my spreadsheet (which starts at index 5 for us) and pass that row value along with the header row to a method to process that row. It looks something like:
data_start_row = 5
data_start_row_index.upto @worksheet.last_row do |index| row = @worksheet.row(index)
if row.present? && @header_row.present?
begin
process_a_row(@header_row, row)
rescue Exception => e
#Exception handling for when encounter bad data...
end
end
end
end
Now we have each row in our spreadsheet being processed! But… how do we access each individual cell? In our case, our spreadsheet template has over 150 possible headers and having a spreadsheet with every header becomes unwieldy. As such, each one has some combination of potential headers and the order of those headers in the spreadsheet is not guaranteed. So we end up with something like the following to get the value of “title” out of our spreadsheet:
def process_a_row(header_row, row_value)
# ...
title = find_in_row(header_row, row_value, 'title_primary')
# ...
end
Essentially this is calling a method called “find_in_row” from within the “process_a_row” block and adds a third argument of the row header identifier we are using to find that data element. The “find_in_row” method then looks like:
def find_in_row(header_row, row_value, column_identifier)
0.upto header_row.length do |row_pos|
case header_row[row_pos]
when column_identifier
return strip_value(row_value[row_pos])
end
end
return nil
end
This has another new method: strip_value. The plan is to move this function into “Bpl_Enrich” in the near future but essentially this is to return our data elements as UTF-8 strings. The code for this looks like:
def strip_value(value)
if(value == nil)
return nil
else
if value.class == Float
value = value.to_f.to_s
value = value.gsub(/.0$/, '') #FIXME: Temporary. Bugged as see: https://github.com/roo-rb/roo/issues/86 , https://github.com/roo-rb/roo/issues/133 , https://github.com/zdavatz/spreadsheet/issues/41
elsif value.class == Fixnum
value = value.to_i.to_s #FIXME: to_i as otherwise non-existant values cause problems
end
# Make sure it is all UTF-8 and not character encodings or HTML tags and remove any cariage returns
return utf8Encode(value)
end
end
def utf8Encode(value)
return HTMLEntities.new.decode(ActionView::Base.full_sanitizer.sanitize(value.to_s.gsub(/\r?\n?\t/, ' ').gsub(/\r?\n/, ' '))).strip
end
We can now repeat the calls to “find_in”row” within the “process_a_row” method for each of our data elements and insert them into our system as needed. But what do we do for multi-valued fields? We use a deliminator of a double pipe (“||”) to delineate values in those cases. For example, if our title was allowed to be multi-valued, we could have the above return “title1||title2″. There is then another helper function to convert that into an array as the following:
def split_with_nils(value)
if(value == nil)
return ""
else
split_value = value.split("||")
0.upto split_value.length-1 do |pos|
split_value[pos] = strip_value(split_value[pos])
end
return split_value
end
end
Why do I return “” on the nil case? To make processing easier for related pairs of multivalued column fields when doing indexing without an extra logic check in the inserts of “process_a_row”. As a full example, assume we have titles of “title1||title2||title3″ and title types of “primary||||alternative” (ie. the second title lacks a type in this made up example of bad MODS data). You would do something like:
title = find_in_row(header_row, row_value, 'title')
if title.present?
title_list = split_with_nils(title)
title_type_list = split_with_nils(find_in_row(header_row, row_value, 'title_type'))
0.upto title_list.length-1 do |pos|
@digital_object.descMetadata.insert_title(title_list[pos],title_type_list[pos])
end
end
Of course, in the insert title method, you would need to check for blank values to not insert the empty title type value for the second title in our list. If there was no “title_type” specified at all as that was omitted as an optional field, our function would still work as indexing an empty string (the returned “” from split_with_nils) would just give us all blank values for the title_type as we index through it.
I hope this was somewhat useful, not a completely crazy approach, and made some sense. Take care!