Vivek Kumar

Posted on May 6

SQL String Functions: Clean, Transform, and Tame Messy Data Like a Pro

#sql #database #postgres #dataengineering

Real-world data is messy. Users type their names in ALL CAPS, phone numbers come in five different formats, email addresses have trailing spaces, and city names are spelled three different ways. Before you can analyze or display that data meaningfully, you need to clean and reshape it — and SQL string functions are one of the most powerful tools in your toolkit for doing exactly that.

In this guide, we'll go beyond the basics. You'll see how string functions work together to solve real data problems — the kind that show up constantly in reporting pipelines, backend APIs, and data migrations.

The Sample Database

We'll use a realistic customers table throughout this article:

CREATE TABLE customers (
  customer_id   SERIAL PRIMARY KEY,
  first_name    VARCHAR(100),
  last_name     VARCHAR(100),
  email         VARCHAR(255),
  phone         VARCHAR(30),
  city          VARCHAR(100),
  country_code  CHAR(2)
);

And here's the kind of data you might find in it after a messy data import:

customer_id	first_name	last_name	email	phone	city
1	ALICE	Johnson	alice@example.com	(555) 123-4567	new york
2	bob	Smith	BOB@EXAMPLE.COM	555.987.6543	Los Angeles
3	María	García	maria@example.com	5552345678	CHICAGO
4	James	NULL	james@example.com	+1-555-876-5432	Houston

Classic import chaos. Let's fix it.

TRIM: Killing the Invisible Enemies

Leading and trailing whitespace is the cockroach of data problems — invisible, persistent, and it breaks things at the worst moments. TRIM removes it.

-- Remove spaces from both sides (default behavior)
SELECT TRIM(last_name) AS clean_last_name
FROM customers;

clean_last_name
---------------
Johnson
Smith
García
(null)

You can also target only one side:

SELECT
  LTRIM(email) AS no_leading_space,
  RTRIM(email) AS no_trailing_space
FROM customers;

Gotcha: TRIM only removes spaces by default. If your data has tabs (\t) or newlines (\n) — common in CSV imports — you need REGEXP_REPLACE (covered below).

UPPER and LOWER: Standardizing Case

Mixed-case data makes comparisons a nightmare. alice@example.com and ALICE@example.com will fail an equality check even though they're the same address.

-- Normalize all emails to lowercase
SELECT
  customer_id,
  LOWER(TRIM(email)) AS normalized_email
FROM customers;

customer_id | normalized_email
------------+-------------------
1           | alice@example.com
2           | bob@example.com
3           | maria@example.com
4           | james@example.com

For display names, you often want title case. SQL doesn't have a built-in INITCAP function in MySQL, but PostgreSQL does:

-- PostgreSQL only
SELECT
  INITCAP(LOWER(first_name)) AS display_first,
  INITCAP(LOWER(last_name))  AS display_last
FROM customers;

display_first | display_last
--------------+-------------
Alice         | Johnson
Bob           | Smith
María         | García
James         | (null)

In MySQL, you'd build your own with CONCAT(UPPER(LEFT(name,1)), LOWER(SUBSTRING(name,2))).

CONCAT: Building Strings From Parts

CONCAT joins strings together. Use it for display names, slugs, full addresses — anywhere you need to assemble a value from multiple columns.

-- Build a full display name
SELECT
  CONCAT(
    INITCAP(LOWER(TRIM(first_name))),
    ' ',
    INITCAP(LOWER(TRIM(last_name)))
  ) AS full_name
FROM customers
WHERE last_name IS NOT NULL;

full_name
-----------
Alice Johnson
Bob Smith
María García

The || operator is the ANSI-standard alternative to CONCAT() and works in PostgreSQL:

SELECT TRIM(first_name) || ' ' || TRIM(last_name) AS full_name FROM customers;

Gotcha: In MySQL, CONCAT(value, NULL) returns NULL — a single null column poisons the whole concatenation. Use CONCAT_WS (concat with separator) or COALESCE to guard against this:

-- Safe version that handles NULLs
SELECT CONCAT(first_name, ' ', COALESCE(last_name, '')) AS full_name
FROM customers;

REPLACE: Fixing Formatted Junk

Our phone numbers are a mess — parentheses, dots, dashes, country codes. REPLACE lets you strip unwanted characters one at a time.

-- Strip common phone number formatting characters
SELECT
  phone AS original,
  REPLACE(
    REPLACE(
      REPLACE(
        REPLACE(
          REPLACE(phone, '(', ''),
        ')', ''),
      '-', ''),
    '.', ''),
  ' ', '') AS digits_only
FROM customers;

original	digits_only
(555) 123-4567	5551234567
555.987.6543	5559876543
5552345678	5552345678
+1-555-876-5432	+15558765432

For the last row, you'd still need to handle the +1 country prefix — which is where REGEXP_REPLACE shines.

REGEXP_REPLACE: The Power Tool

When REPLACE isn't enough, regular expressions let you target patterns instead of literal characters.

-- PostgreSQL: strip ALL non-digit characters from phone numbers
SELECT
  phone,
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS digits_only
FROM customers;

phone            | digits_only
-----------------+------------
(555) 123-4567   | 5551234567
555.987.6543     | 5559876543
5552345678       | 5552345678
+1-555-876-5432  | 15558765432

In MySQL 8+, use REGEXP_REPLACE(phone, '[^0-9]', '').

Gotcha: Regex is powerful but slow on large tables. Don't run REGEXP_REPLACE on a column inside a WHERE clause if you can help it — you'll kill index usage and force a full scan. Clean your data in a migration, not on every query.

SUBSTRING: Extracting Exactly What You Need

SUBSTRING (or SUBSTR) extracts part of a string by position.

-- Extract domain from email addresses
SELECT
  email,
  SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM customers;

email                   | domain
------------------------+------------
alice@example.com       | example.com
bob@example.com         | example.com
maria@example.com       | example.com
james@example.com       | example.com

A more practical example — extracting year from a date stored as text (not ideal, but it happens):

-- Pull the year from 'YYYY-MM-DD' stored as text
SELECT
  order_ref,
  SUBSTRING(order_ref, 1, 4) AS order_year
FROM orders
WHERE order_ref LIKE '____-__-__%';

SUBSTRING(string, start, length) is the MySQL/SQL-standard syntax; PostgreSQL also supports SUBSTRING(string FROM start FOR length).

LENGTH and POSITION: Measuring and Locating

LENGTH tells you how many characters a string has. POSITION (or CHARINDEX in SQL Server) tells you where a substring appears.

-- Find customers with suspiciously short email addresses
SELECT customer_id, email, LENGTH(email) AS email_length
FROM customers
WHERE LENGTH(TRIM(email)) < 10;

-- Find where the '@' sits in each email
SELECT email, POSITION('@' IN email) AS at_position
FROM customers;

Together, POSITION and SUBSTRING let you split strings on a delimiter without regex — useful for parsing values like "Smith, John" or "Houston, TX":

-- Split "city, state" into two columns
SELECT
  location,
  TRIM(SUBSTRING(location FROM 1 FOR POSITION(',' IN location) - 1)) AS city,
  TRIM(SUBSTRING(location FROM POSITION(',' IN location) + 1))       AS state
FROM offices;

Putting It All Together: A Real Cleanup Query

Here's a single query that normalizes the entire customers table for a reporting export:

SELECT
  customer_id,
  INITCAP(LOWER(TRIM(first_name)))                         AS first_name,
  INITCAP(LOWER(TRIM(COALESCE(last_name, ''))))            AS last_name,
  LOWER(TRIM(email))                                       AS email,
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g')                 AS phone_digits,
  INITCAP(LOWER(TRIM(city)))                               AS city,
  UPPER(country_code)                                      AS country_code
FROM customers;

The result:

customer_id	first_name	last_name	email	phone_digits	city
1	Alice	Johnson	alice@example.com	5551234567	New York
2	Bob	Smith	bob@example.com	5559876543	Los Angeles
3	María	García	maria@example.com	5552345678	Chicago
4	James		james@example.com	15558765432	Houston

Common Mistakes to Avoid

Assuming NULLs are empty strings. LENGTH(NULL) returns NULL, not 0. Always wrap nullable columns in COALESCE(column, '') before passing them to string functions.

Chaining REPLACE instead of using REGEXP_REPLACE. Five nested REPLACE calls are harder to read and maintain than a single regex pattern. If you find yourself nesting more than two, switch to regex.

Running string transformations in WHERE clauses on large tables. WHERE LOWER(email) = 'alice@example.com' prevents the database from using an index on email. Either store emails already lowercased, or create a functional index: CREATE INDEX idx_email_lower ON customers (LOWER(email)).

Forgetting character vs. byte length. In MySQL, LENGTH() returns byte count — a multi-byte UTF-8 character (like é or 中) counts as 2–3 bytes. Use CHAR_LENGTH() when you care about the number of visible characters.

Key Takeaways

SQL string functions aren't just cosmetic — they're essential data engineering tools. A few patterns worth remembering:

Clean at storage time, not query time — normalize data when it enters your database, not on every SELECT
Layer functions — INITCAP(LOWER(TRIM(name))) is perfectly valid and readable SQL
Use COALESCE defensively — any function receiving NULL will return NULL, so guard early
Prefer REGEXP_REPLACE for pattern-based cleanup — it's more maintainable than chained REPLACE calls
Watch performance — function calls in WHERE clauses can kill index efficiency on large tables

What's Next?

String functions pair beautifully with SQL's UPDATE statement — you can run a one-time data cleanup across your whole table:

UPDATE customers
SET
  email      = LOWER(TRIM(email)),
  first_name = INITCAP(LOWER(TRIM(first_name))),
  city       = INITCAP(LOWER(TRIM(city)));

Have you run into a particularly gnarly data cleaning challenge in SQL? Drop it in the comments — I'd love to see how you solved it, or we can work through it together.

DEV Community