Based on the amount of posts in the web and questions from my colleagues, this seems to be a quite common challenge. That's why I decided to write this post with the goal of going beyond what you typically find in stackoverflow answers and comments.
So what is the problem? We have a column that contains some list and we want to turn it into rows. So from this ...
col |
---|
a,b,c |
we want to get this ...
result |
---|
a |
b |
c |
I already hear you saying: "But this is not a good design!". True, first normal form (1.NF) does not allow the source table to look like above. But the reality check says, that not all developers (and data modellers) know 1.NF and even less follow it, often for good reasons, e.g. to build some kind of staging area in their database.
Let us start with the test data. We will use Oracle dual
"table" for this purpose
SELECT 'a,b,c' AS col
FROM dual
and to improve readability of the SQL statements below, we'll pack it into a WITH
clause
WITH test AS (
SELECT 'a,b,c' AS col
FROM dual )
SELECT * FROM test
The most common solution is to use REGEXP_SUBSTR
and REGEXP_COUNT
functions to achieve the expected result
WITH test AS (
SELECT 'a,b,c' AS col
FROM dual )
SELECT LEVEL AS n, REGEXP_SUBSTR(col, '[^,]+', 1, LEVEL) AS out
FROM test
CONNECT BY LEVEL <= REGEXP_COUNT(col, ',') + 1
We get
N | OUT |
---|---|
1 | a |
2 | b |
3 | c |
But this "solution" has one big drawback. Let's try to put more than just one row into test data
WITH test AS (
SELECT 'a,b,c' AS col
FROM dual
UNION ALL
SELECT 'd,e,f' AS col
FROM dual )
SELECT LEVEL AS n, REGEXP_SUBSTR(col, '[^,]+', 1, LEVEL) AS out
FROM test
CONNECT BY LEVEL <= REGEXP_COUNT(col, ',') + 1
ORDER BY n, out
NOTE: ORDER BY was also added as last line to the query above to improve readability of the output.
The output is not what is expected
N | OUT |
---|---|
1 | a |
1 | d |
2 | b |
2 | b |
2 | e |
2 | e |
3 | c |
3 | c |
3 | c |
3 | c |
3 | f |
3 | f |
3 | f |
3 | f |
What's going on here? Data in second level is duplicated and on third level it even quadrupled. What's wrong? In order to find out, we add function call for SYS_CONNECT_BY_PATH
WITH test AS (
SELECT 'a,b,c' AS col
FROM dual
UNION ALL
SELECT 'd,e,f' AS col
FROM dual )
SELECT LEVEL AS n, REGEXP_SUBSTR(col, '[^,]+', 1, LEVEL) AS out,
SYS_CONNECT_BY_PATH(LEVEL || '+' || REGEXP_SUBSTR(col, '[^,]+', 1, LEVEL), '/') as path
FROM test
CONNECT BY LEVEL <= REGEXP_COUNT(col, ',') + 1
ORDER BY n, out
N | OUT | PATH |
---|---|---|
1 | a | /1+a |
1 | d | /1+d |
2 | b | /1+a/2+b |
2 | b | /1+d/2+b |
2 | e | /1+d/2+e |
2 | e | /1+a/2+e |
3 | c | /1+d/2+b/3+c |
3 | c | /1+a/2+b/3+c |
3 | c | /1+d/2+e/3+c |
3 | c | /1+a/2+e/3+c |
3 | f | /1+d/2+b/3+f |
3 | f | /1+d/2+e/3+f |
3 | f | /1+a/2+b/3+f |
3 | f | /1+a/2+e/3+f |
Output shows, that the query is traversing the path and is jumping from one test data row to the other. I tried using some features like NOCYCLE or adding a unique column to the test data to fix this, but without success. If you have a solution, please let me know.
But there is a solution, in my opinion even more elegant. Welcome to CROSS APPLY join clause.
WITH test AS (
SELECT 'a,b,c' AS col
FROM dual
UNION ALL
SELECT 'd,e,f' AS col
FROM dual )
SELECT *
FROM test t
CROSS APPLY
(SELECT LEVEL AS n, REGEXP_SUBSTR(t.col, '[^,]+', 1, LEVEL) AS out
FROM dual
CONNECT BY LEVEL <= REGEXP_COUNT(t.col, ',') + 1)
ORDER BY n, out
The output is exactly as expected
COL | N | OUT |
---|---|---|
a,b,c | 1 | a |
d,e,f | 1 | d |
a,b,c | 2 | b |
d,e,f | 2 | e |
a,b,c | 3 | c |
d,e,f | 3 | f |
I have a follow up to this article in my mind on how to enhance the query for source data containing strings (or even CLOBs), that cannot be easily parsed by regular expressions. So stay tuned!
That's it, folks! at least for now.
Top comments (0)