As it was asked how we handle reading content from Excel, this is a very quick blog post that goes over what we do for that. First you will need the following added to your Gemfile and then bundle install:
gem 'roo', :git => 'https://github.com/roo-rb/roo'
gem 'roo-xls', :git => 'https://github.com/roo-rb/roo-xls.git'
Now that we have the library that will allow us to read the spreadsheet, we can go ahead and setup a variable to hold the content of the spreadsheet. This assumes you have a “sheet_location” variable set that indicates where the file you are trying to read lives (be it uploaded or not) and assigns the content to @worksheet:
if sheet_location =~ /\b.xlsx$\b/ @worksheet = Roo::Excelx.new(sheet_location) elsif sheet_location =~ /\b.xls$\b/ @worksheet = Roo::Excel.new(sheet_location) elsif sheet_location =~ /\b.csv\b/ @worksheet = Roo::CSV.new(sheet_location) elsif sheet_location =~ /\b.ods\b/ @worksheet = Roo::OpenOffice.new(sheet_location) end @worksheet.default_sheet = @worksheet.sheets.first #Sets to the first sheet in the workbook
The next thing we want to do is grab the header row that has our column headers. At the BPL, this is the third row in the spreadsheet (previous rows are for notes). As such, with the library using “1” as its first row, we would get the header row via the following:
header_row_index = 3 @header_row = @worksheet.row(header_row_index)
From here, I loop through each data row in my spreadsheet (which starts at index 5 for us) and pass that row value along with the header row to a method to process that row. It looks something like:
data_start_row = 5 data_start_row_index.upto @worksheet.last_row do |index| row = @worksheet.row(index) if row.present? && @header_row.present? begin process_a_row(@header_row, row) rescue Exception => e #Exception handling for when encounter bad data... end end end end
Now we have each row in our spreadsheet being processed! But… how do we access each individual cell? In our case, our spreadsheet template has over 150 possible headers and having a spreadsheet with every header becomes unwieldy. As such, each one has some combination of potential headers and the order of those headers in the spreadsheet is not guaranteed. So we end up with something like the following to get the value of “title” out of our spreadsheet:
def process_a_row(header_row, row_value) # ... title = find_in_row(header_row, row_value, 'title_primary') # ... end
Essentially this is calling a method called “find_in_row” from within the “process_a_row” blockĀ and adds a third argument of the row header identifier we are using to find that data element. The “find_in_row” method then looks like:
def find_in_row(header_row, row_value, column_identifier) 0.upto header_row.length do |row_pos| case header_row[row_pos] when column_identifier return strip_value(row_value[row_pos]) end end return nil end
This has another new method: strip_value. The plan is to move this function into “Bpl_Enrich” in the near future but essentially this is to return our data elements as UTF-8 strings. The code for this looks like:
def strip_value(value) if(value == nil) return nil else if value.class == Float value = value.to_f.to_s value = value.gsub(/.0$/, '') #FIXME: Temporary. Bugged as see: https://github.com/roo-rb/roo/issues/86 , https://github.com/roo-rb/roo/issues/133 , https://github.com/zdavatz/spreadsheet/issues/41 elsif value.class == Fixnum value = value.to_i.to_s #FIXME: to_i as otherwise non-existant values cause problems end # Make sure it is all UTF-8 and not character encodings or HTML tags and remove any cariage returns return utf8Encode(value) end end def utf8Encode(value) return HTMLEntities.new.decode(ActionView::Base.full_sanitizer.sanitize(value.to_s.gsub(/\r?\n?\t/, ' ').gsub(/\r?\n/, ' '))).strip end
We can now repeat the calls to “find_in”row” within the “process_a_row” method for each of our data elements and insert them into our system as needed. But what do we do for multi-valued fields? We use a deliminator of a double pipe (“||”) to delineate values in those cases. For example, if our title was allowed to be multi-valued, we could have the above return “title1||title2″. There is then another helper function to convert that into an array as the following:
def split_with_nils(value) if(value == nil) return "" else split_value = value.split("||") 0.upto split_value.length-1 do |pos| split_value[pos] = strip_value(split_value[pos]) end return split_value end end
Why do I return “” on the nil case? To make processing easier for related pairs of multivalued column fields when doing indexing without an extra logic check in the inserts of “process_a_row”. As a full example, assume we have titles of “title1||title2||title3″ and title types of “primary||||alternative” (ie. the second title lacks a type in this made up example of bad MODS data). You would do something like:
title = find_in_row(header_row, row_value, 'title') if title.present? title_list = split_with_nils(title) title_type_list = split_with_nils(find_in_row(header_row, row_value, 'title_type')) 0.upto title_list.length-1 do |pos| @digital_object.descMetadata.insert_title(title_list[pos],title_type_list[pos]) end end
Of course, in the insert title method, you would need to check for blank values to not insert the empty title type value for the second title in our list. If there was no “title_type” specified at all as that was omitted as an optional field, our function would still work as indexing an empty string (the returned “” from split_with_nils) would just give us all blank values for the title_type as we index through it.
I hope this was somewhat useful, not a completely crazy approach, and made some sense. Take care!