CITEXT vs LOWER

#postgres #citext #btree #sql

Today we going to compare the differences between two approaches for case insensitive text search.

Starting with CITEXT

CITEXT is a case-insensitive character string type, it stores the value as its input. It's does not convert the values to lower case.

But how does it work?
It actually uses LOWER() behind the scenes whenever it does a comparison.

/*
 * citextcmp()
 * Internal comparison function for citext strings.
 * Returns int32 negative, zero, or positive.
 */
static int32
citextcmp(text *left, text *right, Oid collid)
{
    char       *lcstr,
               *rcstr;
    int32       result;

    /*
     * We must do our str_tolower calls with DEFAULT_COLLATION_OID, not the
     * input collation as you might expect.  This is so that the behavior of
     * citext's equality and hashing functions is not collation-dependent.  We
     * should change this once the core infrastructure is able to cope with
     * collation-dependent equality and hashing functions.
     */

    lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left), DEFAULT_COLLATION_OID);
    rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right), DEFAULT_COLLATION_OID);

    result = varstr_cmp(lcstr, strlen(lcstr),
                        rcstr, strlen(rcstr),
                        collid);

    pfree(lcstr);
    pfree(rcstr);

    return result;
}

So, does it mean that using CITEXT will add overhead over the query?
... actually, it depends. Depend on your Database Collation the performance will vary, while if you use an index the results will be close.

Let's dig deep, my current collation is "en_US.utf8"

SHOW lc_collate;
// = en_US.utf8

TEXT

CREATE TABLE text_data(t text NOT NULL);

INSERT INTO text_data
   SELECT i||'text'
   FROM generate_series(1, 1000000) AS i;

VACUUM (FREEZE, ANALYZE) text_data;

explain analyze (
  SELECT * FROM text_data WHERE t = 'ted'
)

//output
"Gather  (cost=1000.00..11613.43 rows=1 width=10) (actual time=45.528..49.233 rows=0 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Parallel Seq Scan on text_data  (cost=0.00..10613.33 rows=1 width=10) (actual time=42.385..42.385 rows=0 loops=3)"
"        Filter: (t = 'ted'::text)"
"        Rows Removed by Filter: 333333"
"Planning Time: 0.112 ms"
"Execution Time: 49.258 ms"

CITEXT

CREATE TABLE citext_data(t citext NOT NULL);

INSERT INTO citext_data
   SELECT i||'text'
   FROM generate_series(1, 1000000) AS i;

VACUUM (FREEZE, ANALYZE) citext_data;

explain analyze (
  SELECT * FROM citext_data WHERE t = 'ted'
)

//output
"Gather  (cost=1000.00..11613.43 rows=1 width=10) (actual time=291.318..294.635 rows=0 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Parallel Seq Scan on citext_data  (cost=0.00..10613.33 rows=1 width=10) (actual time=288.713..288.714 rows=0 loops=3)"
"        Filter: (t = 'ted'::citext)"
"        Rows Removed by Filter: 333333"
"Planning Time: 0.045 ms"
"Execution Time: 294.655 ms"

So basically, CITEXT was performing 6 times slower than the TEXT.
Thats actually make sense because it performs sequential scan with a million comparisons. While if we were using index, they should be close to each other.

TEXT

CREATE INDEX ON text_data (t);

VACUUM (FREEZE, ANALYZE) text_data;

explain analyze (
SELECT * FROM text_data WHERE t = 'ted'
)

//output
"Index Only Scan using text_data_t_idx on text_data  (cost=0.42..4.44 rows=1 width=10) (actual time=0.101..0.101 rows=0 loops=1)"
"  Index Cond: (t = 'ted'::text)"
"  Heap Fetches: 0"
"Planning Time: 0.113 ms"
"Execution Time: 0.116 ms"

CITEXT

CREATE INDEX ON citext_data (t);

VACUUM (FREEZE, ANALYZE) citext_data;

explain analyze (
SELECT * FROM citext_data WHERE t = 'ted'
)

//output
"Index Only Scan using citext_data_t_idx on citext_data  (cost=0.42..4.44 rows=1 width=10) (actual time=0.091..0.091 rows=0 loops=1)"
"  Index Cond: (t = 'ted'::citext)"
"  Heap Fetches: 0"
"Planning Time: 0.111 ms"
"Execution Time: 0.105 ms"

So where is LOWER() from all of this?

LOWER() is just a string function. It converts the string to all lower case, according to the rules of the database's locale.

So, by default LOWER() does not use the b-tree index of what we already created on the text_data table.
Confirming this by the following example:

explain analyze (
  SELECT * FROM text_data WHERE lower(t) = 'ted'
)

//output
"Gather  (cost=1000.00..13155.00 rows=5000 width=10) (actual time=200.007..202.697 rows=0 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  ->  Parallel Seq Scan on text_data  (cost=0.00..11655.00 rows=2083 width=10) (actual time=197.425..197.426 rows=0 loops=3)"
"        Filter: (lower(t) = 'ted'::text)"
"        Rows Removed by Filter: 333333"
"Planning Time: 0.109 ms"
"Execution Time: 202.716 ms"

To improve its performance, we will have to add a functional index.

CREATE INDEX lower_text_data ON text_data (lower(t));

VACUUM (FREEZE, ANALYZE) text_data;

explain analyze (
  SELECT * FROM text_data WHERE lower(t) = 'ted'
)

//output
"Index Scan using lower_text_data on text_data  (cost=0.42..8.44 rows=1 width=10) (actual time=0.057..0.057 rows=0 loops=1)"
"  Index Cond: (lower(t) = 'ted'::text)"
"Planning Time: 0.189 ms"
"Execution Time: 0.068 ms"

By performing multiple examples, both CITEXT and LOWER() are close to each other with the right applied index.

CITEXT is using the B-tree index. So, if you already have one you don't need to create another while LOWER() needs the functional index in-order to perform.

LOWER() makes your SQL statements verbose, and you always have to remember to use LOWER() on the column and the value.

CITEXT allow the primary key to be case-insensitive while LOWER() does not allow that.

DEV Community

CITEXT vs LOWER

Starting with CITEXT

So where is LOWER() from all of this?

Top comments (0)

Read next

Text to pdf using jsPDF

Being a developer is fashionable!

Load testing 3scale

OTEL Tracing demo with AWS X-Ray and Grafana