Real-world data is messy. Users type their names in ALL CAPS, phone numbers come in five different formats, email addresses have trailing spaces, and city names are spelled three different ways. Before you can analyze or display that data meaningfully, you need to clean and reshape it — and SQL string functions are one of the most powerful tools in your toolkit for doing exactly that.
In this guide, we'll go beyond the basics. You'll see how string functions work together to solve real data problems — the kind that show up constantly in reporting pipelines, backend APIs, and data migrations.
The Sample Database
We'll use a realistic customers table throughout this article:
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
phone VARCHAR(30),
city VARCHAR(100),
country_code CHAR(2)
);
And here's the kind of data you might find in it after a messy data import:
| customer_id | first_name | last_name | phone | city | |
|---|---|---|---|---|---|
| 1 | ALICE | Johnson | alice@example.com | (555) 123-4567 | new york |
| 2 | bob | Smith | BOB@EXAMPLE.COM | 555.987.6543 | Los Angeles |
| 3 | María | García | maria@example.com | 5552345678 | CHICAGO |
| 4 | James | NULL | james@example.com | +1-555-876-5432 | Houston |
Classic import chaos. Let's fix it.
TRIM: Killing the Invisible Enemies
Leading and trailing whitespace is the cockroach of data problems — invisible, persistent, and it breaks things at the worst moments. TRIM removes it.
-- Remove spaces from both sides (default behavior)
SELECT TRIM(last_name) AS clean_last_name
FROM customers;
clean_last_name
---------------
Johnson
Smith
García
(null)
You can also target only one side:
SELECT
LTRIM(email) AS no_leading_space,
RTRIM(email) AS no_trailing_space
FROM customers;
Gotcha: TRIM only removes spaces by default. If your data has tabs (\t) or newlines (\n) — common in CSV imports — you need REGEXP_REPLACE (covered below).
UPPER and LOWER: Standardizing Case
Mixed-case data makes comparisons a nightmare. alice@example.com and ALICE@example.com will fail an equality check even though they're the same address.
-- Normalize all emails to lowercase
SELECT
customer_id,
LOWER(TRIM(email)) AS normalized_email
FROM customers;
customer_id | normalized_email
------------+-------------------
1 | alice@example.com
2 | bob@example.com
3 | maria@example.com
4 | james@example.com
For display names, you often want title case. SQL doesn't have a built-in INITCAP function in MySQL, but PostgreSQL does:
-- PostgreSQL only
SELECT
INITCAP(LOWER(first_name)) AS display_first,
INITCAP(LOWER(last_name)) AS display_last
FROM customers;
display_first | display_last
--------------+-------------
Alice | Johnson
Bob | Smith
María | García
James | (null)
In MySQL, you'd build your own with CONCAT(UPPER(LEFT(name,1)), LOWER(SUBSTRING(name,2))).
CONCAT: Building Strings From Parts
CONCAT joins strings together. Use it for display names, slugs, full addresses — anywhere you need to assemble a value from multiple columns.
-- Build a full display name
SELECT
CONCAT(
INITCAP(LOWER(TRIM(first_name))),
' ',
INITCAP(LOWER(TRIM(last_name)))
) AS full_name
FROM customers
WHERE last_name IS NOT NULL;
full_name
-----------
Alice Johnson
Bob Smith
María García
The || operator is the ANSI-standard alternative to CONCAT() and works in PostgreSQL:
SELECT TRIM(first_name) || ' ' || TRIM(last_name) AS full_name FROM customers;
Gotcha: In MySQL, CONCAT(value, NULL) returns NULL — a single null column poisons the whole concatenation. Use CONCAT_WS (concat with separator) or COALESCE to guard against this:
-- Safe version that handles NULLs
SELECT CONCAT(first_name, ' ', COALESCE(last_name, '')) AS full_name
FROM customers;
REPLACE: Fixing Formatted Junk
Our phone numbers are a mess — parentheses, dots, dashes, country codes. REPLACE lets you strip unwanted characters one at a time.
-- Strip common phone number formatting characters
SELECT
phone AS original,
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(phone, '(', ''),
')', ''),
'-', ''),
'.', ''),
' ', '') AS digits_only
FROM customers;
| original | digits_only |
|---|---|
| (555) 123-4567 | 5551234567 |
| 555.987.6543 | 5559876543 |
| 5552345678 | 5552345678 |
| +1-555-876-5432 | +15558765432 |
For the last row, you'd still need to handle the +1 country prefix — which is where REGEXP_REPLACE shines.
REGEXP_REPLACE: The Power Tool
When REPLACE isn't enough, regular expressions let you target patterns instead of literal characters.
-- PostgreSQL: strip ALL non-digit characters from phone numbers
SELECT
phone,
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS digits_only
FROM customers;
phone | digits_only
-----------------+------------
(555) 123-4567 | 5551234567
555.987.6543 | 5559876543
5552345678 | 5552345678
+1-555-876-5432 | 15558765432
In MySQL 8+, use REGEXP_REPLACE(phone, '[^0-9]', '').
Gotcha: Regex is powerful but slow on large tables. Don't run REGEXP_REPLACE on a column inside a WHERE clause if you can help it — you'll kill index usage and force a full scan. Clean your data in a migration, not on every query.
SUBSTRING: Extracting Exactly What You Need
SUBSTRING (or SUBSTR) extracts part of a string by position.
-- Extract domain from email addresses
SELECT
email,
SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM customers;
email | domain
------------------------+------------
alice@example.com | example.com
bob@example.com | example.com
maria@example.com | example.com
james@example.com | example.com
A more practical example — extracting year from a date stored as text (not ideal, but it happens):
-- Pull the year from 'YYYY-MM-DD' stored as text
SELECT
order_ref,
SUBSTRING(order_ref, 1, 4) AS order_year
FROM orders
WHERE order_ref LIKE '____-__-__%';
SUBSTRING(string, start, length) is the MySQL/SQL-standard syntax; PostgreSQL also supports SUBSTRING(string FROM start FOR length).
LENGTH and POSITION: Measuring and Locating
LENGTH tells you how many characters a string has. POSITION (or CHARINDEX in SQL Server) tells you where a substring appears.
-- Find customers with suspiciously short email addresses
SELECT customer_id, email, LENGTH(email) AS email_length
FROM customers
WHERE LENGTH(TRIM(email)) < 10;
-- Find where the '@' sits in each email
SELECT email, POSITION('@' IN email) AS at_position
FROM customers;
Together, POSITION and SUBSTRING let you split strings on a delimiter without regex — useful for parsing values like "Smith, John" or "Houston, TX":
-- Split "city, state" into two columns
SELECT
location,
TRIM(SUBSTRING(location FROM 1 FOR POSITION(',' IN location) - 1)) AS city,
TRIM(SUBSTRING(location FROM POSITION(',' IN location) + 1)) AS state
FROM offices;
Putting It All Together: A Real Cleanup Query
Here's a single query that normalizes the entire customers table for a reporting export:
SELECT
customer_id,
INITCAP(LOWER(TRIM(first_name))) AS first_name,
INITCAP(LOWER(TRIM(COALESCE(last_name, '')))) AS last_name,
LOWER(TRIM(email)) AS email,
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS phone_digits,
INITCAP(LOWER(TRIM(city))) AS city,
UPPER(country_code) AS country_code
FROM customers;
The result:
| customer_id | first_name | last_name | phone_digits | city | |
|---|---|---|---|---|---|
| 1 | Alice | Johnson | alice@example.com | 5551234567 | New York |
| 2 | Bob | Smith | bob@example.com | 5559876543 | Los Angeles |
| 3 | María | García | maria@example.com | 5552345678 | Chicago |
| 4 | James | james@example.com | 15558765432 | Houston |
Common Mistakes to Avoid
Assuming NULLs are empty strings. LENGTH(NULL) returns NULL, not 0. Always wrap nullable columns in COALESCE(column, '') before passing them to string functions.
Chaining REPLACE instead of using REGEXP_REPLACE. Five nested REPLACE calls are harder to read and maintain than a single regex pattern. If you find yourself nesting more than two, switch to regex.
Running string transformations in WHERE clauses on large tables. WHERE LOWER(email) = 'alice@example.com' prevents the database from using an index on email. Either store emails already lowercased, or create a functional index: CREATE INDEX idx_email_lower ON customers (LOWER(email)).
Forgetting character vs. byte length. In MySQL, LENGTH() returns byte count — a multi-byte UTF-8 character (like é or 中) counts as 2–3 bytes. Use CHAR_LENGTH() when you care about the number of visible characters.
Key Takeaways
SQL string functions aren't just cosmetic — they're essential data engineering tools. A few patterns worth remembering:
-
Clean at storage time, not query time — normalize data when it enters your database, not on every
SELECT -
Layer functions —
INITCAP(LOWER(TRIM(name)))is perfectly valid and readable SQL -
Use
COALESCEdefensively — any function receivingNULLwill returnNULL, so guard early -
Prefer
REGEXP_REPLACEfor pattern-based cleanup — it's more maintainable than chainedREPLACEcalls -
Watch performance — function calls in
WHEREclauses can kill index efficiency on large tables
What's Next?
String functions pair beautifully with SQL's UPDATE statement — you can run a one-time data cleanup across your whole table:
UPDATE customers
SET
email = LOWER(TRIM(email)),
first_name = INITCAP(LOWER(TRIM(first_name))),
city = INITCAP(LOWER(TRIM(city)));
Have you run into a particularly gnarly data cleaning challenge in SQL? Drop it in the comments — I'd love to see how you solved it, or we can work through it together.
Top comments (0)