The Problem
How would you import into R an Excel file like the one below?
The table has no text data in its cells but the fill colors have meanings.
The readxl
package would not be of much help for a few reasons:
- The data is not “tidy”. Merged cells to the left convey categorical information;
- The formatting of the cells is where the information really is.
The Solution
Enter tidyxl
and unpivotr
packages.
The tidyxl
package imports each cell on the excel file as a row in a data frame, with columns describing its position, content and formatting.
The unpivotr
package makes use of the data frame generated by the tidyxl package and makes possible to tidy it up.
First, I converted the original PDF file to an Excel using a popular online service. The converted Excel file can be downloaded here.
With the aforementioned packages, reading the table above become as simple as:
library(dplyr)
library(purrr)
library(tidyxl)
library(unpivotr)
library(here)
filename <- here("raw-data/produtos_epoca-converted.xlsx")
# The workbook contains several sheets. We first import all tables to a
# list
tables_names <-
c("Table 1", "Table 3", "Table 4", "Table 5", "Table 6", "Table 7")
tables_to_read <- map(tables_names, xlsx_cells, path = filename)
# We create a function to read each sheet
import_table <- function(df) {
# Each fill color represent a different information. First, we create a
# pallette of the fill colors in the sheet that can be indexed by the
# `local_format_id` of a given cell to get the fill color of that cell
fill_color_palette <-
xlsx_formats(filename, "Table 1")$local$fill$patternFill$fgColor$rgb
# Since the table has different headings, we have to filter out these
# headings in order to have only the cells with data. Then, we create a
# new column for the fill colors by looking up the `local_format_id` of
# each cell in the pallette. Following, we create another column where
# we codify this information.
availability <-
df %>%
filter(row >= 2, col >= 3) %>% # filter out headers
mutate(fill_color = fill_color_palette[local_format_id]) %>%
mutate(
availability = case_when(
fill_color == "FFFF7F7F" ~ "Low",
fill_color == "FFFFFFCC" ~ "Medium",
fill_color == "FFCCFFCC" ~ "High"
)
) %>%
select(availability)
# We now transform all the headings so we can have a tidy data
df %>%
behead("left-up", category) %>%
behead("left", produce) %>%
behead("up", month) %>%
bind_cols(availability) %>%
select(category, produce, month, availability)
}
# Let's apply the function to our list of sheets
availability_ceagesp <- map_dfr(tables_to_read, import_table)
Done!
Result
So, we imported and Excel file that contained un-tidy data and with formatting that conveys meaning.
Now the data is already prepared for further processing and to be explored as needed:
availability_ceagesp %>%
filter(month == "Set",
availability == "High")
## # A tibble: 89 x 4
## category produce month availability
## <chr> <chr> <chr> <chr>
## 1 Frutas Abacate Breda/Margarida Set High
## 2 Frutas Abiu Set High
## 3 Frutas Acerola Set High
## 4 Frutas Banana Maçã Set High
## 5 Frutas Banana Prata Set High
## 6 Frutas Caju Set High
## 7 Frutas Graviola Set High
## 8 Frutas Jabuticaba Set High
## 9 Frutas Kiwi Estrangeiro Set High
## 10 Frutas Laranja Lima Set High
## # ... with 79 more rows
Top comments (1)
My God, what a work! Why do people do this to us, right? Learned a new package today! Thanks!