Reading complex Excel files

#r #rstats #tidyxl #unpivotr

The Problem

How would you import into R an Excel file like the one below?

The table has no text data in its cells but the fill colors have meanings.

The readxl package would not be of much help for a few reasons:

The data is not “tidy”. Merged cells to the left convey categorical information;
The formatting of the cells is where the information really is.

The Solution

Enter tidyxl and unpivotr packages.

The tidyxl package imports each cell on the excel file as a row in a data frame, with columns describing its position, content and formatting.

The unpivotr package makes use of the data frame generated by the tidyxl package and makes possible to tidy it up.

First, I converted the original PDF file to an Excel using a popular online service. The converted Excel file can be downloaded here.

With the aforementioned packages, reading the table above become as simple as:

library(dplyr)
library(purrr)
library(tidyxl)
library(unpivotr)
library(here)

filename <- here("raw-data/produtos_epoca-converted.xlsx")

# The workbook contains several sheets. We first import all tables to a
# list
tables_names <-
  c("Table 1", "Table 3", "Table 4", "Table 5", "Table 6", "Table 7")
tables_to_read <- map(tables_names, xlsx_cells, path = filename)

# We create a function to read each sheet
import_table <- function(df) {
  # Each fill color represent a different information. First, we create a
  # pallette of the fill colors in the sheet that can be indexed by the
  # `local_format_id` of a given cell to get the fill color of that cell
  fill_color_palette <-
    xlsx_formats(filename, "Table 1")$local$fill$patternFill$fgColor$rgb

  # Since the table has different headings, we have to filter out these
  # headings in order to have only the cells with data. Then, we create a
  # new column for the fill colors by looking up the `local_format_id` of
  # each cell in the pallette. Following, we create another column where
  # we codify this information.
  availability <-
    df %>%
    filter(row >= 2, col >= 3) %>%   # filter out headers
    mutate(fill_color = fill_color_palette[local_format_id]) %>%
    mutate(
      availability = case_when(
        fill_color == "FFFF7F7F" ~ "Low",
        fill_color == "FFFFFFCC" ~ "Medium",
        fill_color == "FFCCFFCC" ~ "High"
      )
    ) %>%
    select(availability)

  # We now transform all the headings so we can have a tidy data
  df %>%
    behead("left-up", category) %>%
    behead("left", produce) %>%
    behead("up", month) %>%
    bind_cols(availability) %>%
    select(category, produce, month, availability)
}

# Let's apply the function to our list of sheets
availability_ceagesp <- map_dfr(tables_to_read, import_table)

Done!

Result

So, we imported and Excel file that contained un-tidy data and with formatting that conveys meaning.

Now the data is already prepared for further processing and to be explored as needed:

availability_ceagesp %>%
  filter(month == "Set",
         availability == "High")

## # A tibble: 89 x 4
##    category produce                 month availability
##    <chr>    <chr>                   <chr> <chr>       
##  1 Frutas   Abacate Breda/Margarida Set   High        
##  2 Frutas   Abiu                    Set   High        
##  3 Frutas   Acerola                 Set   High        
##  4 Frutas   Banana Maçã             Set   High        
##  5 Frutas   Banana Prata            Set   High        
##  6 Frutas   Caju                    Set   High        
##  7 Frutas   Graviola                Set   High        
##  8 Frutas   Jabuticaba              Set   High        
##  9 Frutas   Kiwi Estrangeiro        Set   High        
## 10 Frutas   Laranja Lima            Set   High        
## # ... with 79 more rows