DEV Community

Dipti M
Dipti M

Posted on

XML / HTML tables / Web scraping (static HTML)

Data import is a surprisingly large part of any data project. Below is a compact but practical reference of common file types, recommended packages/functions, example code, and real-world tips (encoding, memory, speed, cloud, DBs, troubleshooting).

Quick workspace prep (repeat — but important)

Check working dir

getwd()

Set working dir (if you must)

setwd("/path/to/project")

Start with a clean environment (use with care)

rm(list = ls())

Helpful: show files

list.files()

Base R vs modern tidyverse / high-performance packages
Base R (read.table, read.csv, read.delim) works everywhere and is flexible.
readr (read_csv, read_delim) — faster, consistent parsing, tidyverse-friendly.
data.table::fread — extremely fast for large CSV-like files.
readxl — best for Excel (no external dependencies).
haven — best for SAS/SPSS/Stata (via ReadStat).
jsonlite — modern JSON parsing.
DBI + specific DB backends — standard interface for relational DBs.
arrow — for Feather/Parquet and interop with Python/Spark; excellent for large data and columnar formats.

CSV / delimited text
Recommended
Small-medium files: readr::read_csv()
Very large: data.table::fread() or vroom::vroom()
Examples

readr - tidy defaults, prints parsing problems

install.packages("readr")
library(readr)
df <- read_csv("data/sales.csv") # comma
df2 <- read_delim("data/data.txt", "\t") # tab separated

data.table: blazing fast, automatic type detection

install.packages("data.table")
library(data.table)
dt <- fread("data/sales.csv")

vroom: loads lazily and is fast for many files

install.packages("vroom")
library(vroom)
df_v <- vroom("data/sales.csv")

Useful arguments
locale = locale(encoding = "UTF-8") — handle encodings
col_types = cols(...) — force column types (avoid mis-parsing)
na = c("", "NA", "NULL") — specify missing tokens
n_max — read only first N rows (quick inspect)

Excel (XLS / XLSX)
Recommended
readxl::read_excel() — no Java dependency, reliable
tidyxl, openxlsx — if you need formulas or writing
Example
install.packages("readxl")
library(readxl)

read first sheet

df <- read_excel("data/book.xlsx")

read a specific sheet by name or index

df_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")
df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)

read a range

df_range <- read_excel("data/book.xlsx", range = "A1:F100")

JSON
Recommended
jsonlite::fromJSON() — robust, converts arrays to data frames where appropriate
install.packages("jsonlite")
library(jsonlite)

from local file

j <- fromJSON("data/doc.json", flatten = TRUE)

from URL

j2 <- fromJSON("https://api.example.com/data")

Tip: If JSON contains nested lists, use flatten=TRUE or manually tidyr::unnest() to tidy.

XML / HTML tables / Web scraping (static HTML)
Recommended
xml2 + rvest for HTML scraping / table extraction
XML for more advanced XML parsing (but xml2 recommended)
install.packages(c("xml2","rvest"))
library(xml2); library(rvest)

page <- read_html("https://example.com/page-with-tables")
tables <- html_table(html_nodes(page, "table")) # list of tibbles

Excel-like & legacy formats (ODS, Google Sheets)
Google Sheets: googlesheets4 (supports OAuth & public sheets)
ODS: readODS package

Google Sheets (best to follow auth flow)

install.packages("googlesheets4")
library(googlesheets4)
sheet <- read_sheet("https://docs.google.com/spreadsheets/...")

SAS / SPSS / Stata
Recommended
haven (fast, preserves labelled variables)
foreign is older; haven preferred now
install.packages("haven")
library(haven)

df_sas <- read_sas("data/data.sas7bdat")
df_spss <- read_sav("data/data.sav")
df_stata <- read_dta("data/data.dta")

Tip: haven keeps value labels (labelled class). Use labelled::to_factor() to convert labeled columns to factors.

MATLAB / Octave
R.matlab::readMat() for .mat files
foreign::read.octave() for Octave text data
install.packages("R.matlab")
library(R.matlab)
m <- readMat("data/matrix.mat")

Parquet, Feather, Arrow — modern columnar formats
Use arrow package to read/write Parquet and Feather; very efficient for large datasets and cross-language sharing.
install.packages("arrow")
library(arrow)

read parquet

tbl <- read_parquet("data/data.parquet")

write parquet

write_parquet(df, "out/data.parquet")

Benefit: Arrow allows out-of-memory, zero-copy reads, and is ideal for large-scale pipelines.

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)
Recommended approach
DBI + driver package (RPostgres, RMySQL, odbc) — standard interface
pool for connection pooling in apps
Example: Postgres via DBI
install.packages(c("DBI","RPostgres"))
library(DBI)
con <- dbConnect(RPostgres::Postgres(),
dbname = "mydb", host = "db.example.com",
port = 5432, user = "me", password = "pw")

run a query

df <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")

fetch a big table in chunks

res <- dbSendQuery(con, "SELECT * FROM big_table")
chunk <- dbFetch(res, n = 10000)

iterate until exhausted

dbClearResult(res)

dbDisconnect(con)

ODBC (Windows / DSN)
install.packages("odbc")
library(DBI)
con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")
dbListTables(con)
dbDisconnect(con)

Tip: Avoid SELECT * on huge tables — use column selection and SQL-side filtering.

Cloud storage (S3, GCS, Azure)
S3: aws.s3, arrow can read directly from S3 URIs
Google Cloud Storage: gcsfs (less common in R), or use arrow/gcs integration
Azure: AzureStor
Example with aws.s3:
install.packages("aws.s3")
library(aws.s3)

must configure AWS credentials (env vars or ~/.aws/credentials)

s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")

arrow::read_parquet("s3://bucket/path/file.parquet") also works when configured correctly.

Reading in chunks / streaming large files
If data is too big to fit in memory:
Use database imports (load into DB and query subsets)
Use data.table::fread() with select= to read only columns
Use vroom + dplyr for delayed reads
For complex chunking: readr::read_lines_chunked() or process via LaF or iotools packages
Example chunked processing with LaF:
install.packages("LaF")
library(LaF)
laf <- laf_open_csv("big.csv", column_types = c("integer","double","string"), nrows = 1e8)
block <- laf[1:100000, ] # read first 100k rows

Parsing dates, times, and types
Parsing types properly avoids nasty surprises later.
Use col_types in readr or fread's colClasses
Use lubridate for robust date parsing
install.packages("lubridate")
library(lubridate)

convert common formats

df$date <- ymd(df$date_string) # "2023-07-01"
df$datetime <- ymd_hms(df$ts) # "2023-07-01 08:45:00"

Encoding & locale issues
If you see weird characters:

readr

df <- read_csv("file.csv", locale = locale(encoding = "latin1"))

base read.table

df <- read.table("file.txt", fileEncoding = "UTF-8")

Check Encoding() and iconv() to convert.

Common import pitfalls and troubleshooting
Headers misaligned / comments: use skip = n, comment = "#", or skip_empty_rows = TRUE.
Inconsistent column types: pre-specify column types (using col_types or colClasses) to avoid type coercion.
NA tokens: set na = c("", "NA", "n/a").
Thousands separators / decimal: specify locale = locale(decimal_mark = ",", grouping_mark = ".").
Memory errors: filter on SQL side, load a subset, or use arrow/Parquet to read columns only.

Quick hacks & productivity tips
Clipboard: quick copy-paste into R (Windows/Mac):
df <- read.table("clipboard", header = TRUE, sep = "\t")

Peek first rows before full import:
readr::read_lines("file.csv", n_max = 10)

Inspect parsing problems with readr::problems() after read_csv().
Use col_types = cols() to force types and prevent guessing mistakes.
Use projects & relative paths (RStudio projects) to avoid brittle setwd().

Recommended import pipeline (small project)
Inspect the file (read_lines, head, file.info()).
Try a small sample read (n_max, skip).
Specify explicit col_types or colClasses.
Load with a high-performance reader (fread/vroom/arrow) for big data.
Convert dates and factor levels with lubridate and forcats.
Persist as Parquet (arrow) if reusing: arrow::write_parquet(df, "data.parquet").

Example: end-to-end — CSV → cleaned parquet
library(data.table)
library(lubridate)
library(arrow)

fast read

dt <- fread("raw/sales.csv", na.strings = c("", "NA"))

clean types

dt[, order_date := ymd(order_date)]
dt[, amount := as.numeric(amount)]

sample check

head(dt)

persist for faster reads later

write_parquet(dt, "clean/sales.parquet")

Useful packages summary (short cheat-sheet)
TaskPackage / Function
Fast CSV read
data.table::fread()
Tidy read
readr::read_csv()
Excel
readxl::read_excel()
JSON
jsonlite::fromJSON()
SPSS/Stata/SAS
haven::read_sav(), read_dta(), read_sas()
Parquet / Arrow
arrow::read_parquet() / write_parquet()
DB connections
DBI + RPostgres / odbc
S3
aws.s3, arrow
Google Sheets
googlesheets4
HTML tables
rvest::html_table()
Chunked reading
LaF, readr::read_lines_chunked()
Labelled variables
labelled

Final notes & next steps
Choose the right tool for your dataset size: fread/vroom/arrow for big files; readr/readxl for comfort and parsing diagnostics.
Preserve raw files and create a reproducible import script (save col_types, date parsing logic).
For production pipelines, prefer Parquet/Arrow or a DB as canonical storage (fast, typed, efficient).
If you want, I can convert any of your examples above into a runnable RMarkdown notebook with sample files and tests.

At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include tableau consultancy and advanced analytics consulting, turning data into strategic insight. We would love to talk to you. Do reach out to us.

Top comments (0)