DEV Community

Vivek Kumar
Vivek Kumar

Posted on

SQL String Functions: Clean, Transform, and Tame Messy Data Like a Pro

Real-world data is messy. Users type their names in ALL CAPS, phone numbers come in five different formats, email addresses have trailing spaces, and city names are spelled three different ways. Before you can analyze or display that data meaningfully, you need to clean and reshape it — and SQL string functions are one of the most powerful tools in your toolkit for doing exactly that.

In this guide, we'll go beyond the basics. You'll see how string functions work together to solve real data problems — the kind that show up constantly in reporting pipelines, backend APIs, and data migrations.


The Sample Database

We'll use a realistic customers table throughout this article:

CREATE TABLE customers (
  customer_id   SERIAL PRIMARY KEY,
  first_name    VARCHAR(100),
  last_name     VARCHAR(100),
  email         VARCHAR(255),
  phone         VARCHAR(30),
  city          VARCHAR(100),
  country_code  CHAR(2)
);
Enter fullscreen mode Exit fullscreen mode

And here's the kind of data you might find in it after a messy data import:

customer_id first_name last_name email phone city
1 ALICE Johnson alice@example.com (555) 123-4567 new york
2 bob Smith BOB@EXAMPLE.COM 555.987.6543 Los Angeles
3 María García maria@example.com 5552345678 CHICAGO
4 James NULL james@example.com +1-555-876-5432 Houston

Classic import chaos. Let's fix it.


TRIM: Killing the Invisible Enemies

Leading and trailing whitespace is the cockroach of data problems — invisible, persistent, and it breaks things at the worst moments. TRIM removes it.

-- Remove spaces from both sides (default behavior)
SELECT TRIM(last_name) AS clean_last_name
FROM customers;
Enter fullscreen mode Exit fullscreen mode
clean_last_name
---------------
Johnson
Smith
García
(null)
Enter fullscreen mode Exit fullscreen mode

You can also target only one side:

SELECT
  LTRIM(email) AS no_leading_space,
  RTRIM(email) AS no_trailing_space
FROM customers;
Enter fullscreen mode Exit fullscreen mode

Gotcha: TRIM only removes spaces by default. If your data has tabs (\t) or newlines (\n) — common in CSV imports — you need REGEXP_REPLACE (covered below).


UPPER and LOWER: Standardizing Case

Mixed-case data makes comparisons a nightmare. alice@example.com and ALICE@example.com will fail an equality check even though they're the same address.

-- Normalize all emails to lowercase
SELECT
  customer_id,
  LOWER(TRIM(email)) AS normalized_email
FROM customers;
Enter fullscreen mode Exit fullscreen mode
customer_id | normalized_email
------------+-------------------
1           | alice@example.com
2           | bob@example.com
3           | maria@example.com
4           | james@example.com
Enter fullscreen mode Exit fullscreen mode

For display names, you often want title case. SQL doesn't have a built-in INITCAP function in MySQL, but PostgreSQL does:

-- PostgreSQL only
SELECT
  INITCAP(LOWER(first_name)) AS display_first,
  INITCAP(LOWER(last_name))  AS display_last
FROM customers;
Enter fullscreen mode Exit fullscreen mode
display_first | display_last
--------------+-------------
Alice         | Johnson
Bob           | Smith
María         | García
James         | (null)
Enter fullscreen mode Exit fullscreen mode

In MySQL, you'd build your own with CONCAT(UPPER(LEFT(name,1)), LOWER(SUBSTRING(name,2))).


CONCAT: Building Strings From Parts

CONCAT joins strings together. Use it for display names, slugs, full addresses — anywhere you need to assemble a value from multiple columns.

-- Build a full display name
SELECT
  CONCAT(
    INITCAP(LOWER(TRIM(first_name))),
    ' ',
    INITCAP(LOWER(TRIM(last_name)))
  ) AS full_name
FROM customers
WHERE last_name IS NOT NULL;
Enter fullscreen mode Exit fullscreen mode
full_name
-----------
Alice Johnson
Bob Smith
María García
Enter fullscreen mode Exit fullscreen mode

The || operator is the ANSI-standard alternative to CONCAT() and works in PostgreSQL:

SELECT TRIM(first_name) || ' ' || TRIM(last_name) AS full_name FROM customers;
Enter fullscreen mode Exit fullscreen mode

Gotcha: In MySQL, CONCAT(value, NULL) returns NULL — a single null column poisons the whole concatenation. Use CONCAT_WS (concat with separator) or COALESCE to guard against this:

-- Safe version that handles NULLs
SELECT CONCAT(first_name, ' ', COALESCE(last_name, '')) AS full_name
FROM customers;
Enter fullscreen mode Exit fullscreen mode

REPLACE: Fixing Formatted Junk

Our phone numbers are a mess — parentheses, dots, dashes, country codes. REPLACE lets you strip unwanted characters one at a time.

-- Strip common phone number formatting characters
SELECT
  phone AS original,
  REPLACE(
    REPLACE(
      REPLACE(
        REPLACE(
          REPLACE(phone, '(', ''),
        ')', ''),
      '-', ''),
    '.', ''),
  ' ', '') AS digits_only
FROM customers;
Enter fullscreen mode Exit fullscreen mode
original digits_only
(555) 123-4567 5551234567
555.987.6543 5559876543
5552345678 5552345678
+1-555-876-5432 +15558765432

For the last row, you'd still need to handle the +1 country prefix — which is where REGEXP_REPLACE shines.


REGEXP_REPLACE: The Power Tool

When REPLACE isn't enough, regular expressions let you target patterns instead of literal characters.

-- PostgreSQL: strip ALL non-digit characters from phone numbers
SELECT
  phone,
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS digits_only
FROM customers;
Enter fullscreen mode Exit fullscreen mode
phone            | digits_only
-----------------+------------
(555) 123-4567   | 5551234567
555.987.6543     | 5559876543
5552345678       | 5552345678
+1-555-876-5432  | 15558765432
Enter fullscreen mode Exit fullscreen mode

In MySQL 8+, use REGEXP_REPLACE(phone, '[^0-9]', '').

Gotcha: Regex is powerful but slow on large tables. Don't run REGEXP_REPLACE on a column inside a WHERE clause if you can help it — you'll kill index usage and force a full scan. Clean your data in a migration, not on every query.


SUBSTRING: Extracting Exactly What You Need

SUBSTRING (or SUBSTR) extracts part of a string by position.

-- Extract domain from email addresses
SELECT
  email,
  SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM customers;
Enter fullscreen mode Exit fullscreen mode
email                   | domain
------------------------+------------
alice@example.com       | example.com
bob@example.com         | example.com
maria@example.com       | example.com
james@example.com       | example.com
Enter fullscreen mode Exit fullscreen mode

A more practical example — extracting year from a date stored as text (not ideal, but it happens):

-- Pull the year from 'YYYY-MM-DD' stored as text
SELECT
  order_ref,
  SUBSTRING(order_ref, 1, 4) AS order_year
FROM orders
WHERE order_ref LIKE '____-__-__%';
Enter fullscreen mode Exit fullscreen mode

SUBSTRING(string, start, length) is the MySQL/SQL-standard syntax; PostgreSQL also supports SUBSTRING(string FROM start FOR length).


LENGTH and POSITION: Measuring and Locating

LENGTH tells you how many characters a string has. POSITION (or CHARINDEX in SQL Server) tells you where a substring appears.

-- Find customers with suspiciously short email addresses
SELECT customer_id, email, LENGTH(email) AS email_length
FROM customers
WHERE LENGTH(TRIM(email)) < 10;

-- Find where the '@' sits in each email
SELECT email, POSITION('@' IN email) AS at_position
FROM customers;
Enter fullscreen mode Exit fullscreen mode

Together, POSITION and SUBSTRING let you split strings on a delimiter without regex — useful for parsing values like "Smith, John" or "Houston, TX":

-- Split "city, state" into two columns
SELECT
  location,
  TRIM(SUBSTRING(location FROM 1 FOR POSITION(',' IN location) - 1)) AS city,
  TRIM(SUBSTRING(location FROM POSITION(',' IN location) + 1))       AS state
FROM offices;
Enter fullscreen mode Exit fullscreen mode

Putting It All Together: A Real Cleanup Query

Here's a single query that normalizes the entire customers table for a reporting export:

SELECT
  customer_id,
  INITCAP(LOWER(TRIM(first_name)))                         AS first_name,
  INITCAP(LOWER(TRIM(COALESCE(last_name, ''))))            AS last_name,
  LOWER(TRIM(email))                                       AS email,
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g')                 AS phone_digits,
  INITCAP(LOWER(TRIM(city)))                               AS city,
  UPPER(country_code)                                      AS country_code
FROM customers;
Enter fullscreen mode Exit fullscreen mode

The result:

customer_id first_name last_name email phone_digits city
1 Alice Johnson alice@example.com 5551234567 New York
2 Bob Smith bob@example.com 5559876543 Los Angeles
3 María García maria@example.com 5552345678 Chicago
4 James james@example.com 15558765432 Houston

Common Mistakes to Avoid

Assuming NULLs are empty strings. LENGTH(NULL) returns NULL, not 0. Always wrap nullable columns in COALESCE(column, '') before passing them to string functions.

Chaining REPLACE instead of using REGEXP_REPLACE. Five nested REPLACE calls are harder to read and maintain than a single regex pattern. If you find yourself nesting more than two, switch to regex.

Running string transformations in WHERE clauses on large tables. WHERE LOWER(email) = 'alice@example.com' prevents the database from using an index on email. Either store emails already lowercased, or create a functional index: CREATE INDEX idx_email_lower ON customers (LOWER(email)).

Forgetting character vs. byte length. In MySQL, LENGTH() returns byte count — a multi-byte UTF-8 character (like é or ) counts as 2–3 bytes. Use CHAR_LENGTH() when you care about the number of visible characters.


Key Takeaways

SQL string functions aren't just cosmetic — they're essential data engineering tools. A few patterns worth remembering:

  • Clean at storage time, not query time — normalize data when it enters your database, not on every SELECT
  • Layer functionsINITCAP(LOWER(TRIM(name))) is perfectly valid and readable SQL
  • Use COALESCE defensively — any function receiving NULL will return NULL, so guard early
  • Prefer REGEXP_REPLACE for pattern-based cleanup — it's more maintainable than chained REPLACE calls
  • Watch performance — function calls in WHERE clauses can kill index efficiency on large tables

What's Next?

String functions pair beautifully with SQL's UPDATE statement — you can run a one-time data cleanup across your whole table:

UPDATE customers
SET
  email      = LOWER(TRIM(email)),
  first_name = INITCAP(LOWER(TRIM(first_name))),
  city       = INITCAP(LOWER(TRIM(city)));
Enter fullscreen mode Exit fullscreen mode

Have you run into a particularly gnarly data cleaning challenge in SQL? Drop it in the comments — I'd love to see how you solved it, or we can work through it together.

Top comments (0)