DEV Community

Jialei Jin
Jialei Jin

Posted on

How Lingodb do column pruning

This ariticle talk about lingodb's column pruning and how it differs from tranditional way because of MLIR making it easy.

I am external contributor to Lingodb, not affiliated to TUM. For this ariticle's content, I am just a learner not contributor.

Typical way of column pruning

Column pruning is a very common optimization for databases. It helps TableScan only fetch data from only neccessary columns used by upstream operator like projection operator. This is one of the optimization can decrease disk IO.

Let's walk through column pruning by a very simple example, say we have a very simple query select abs(name) from student. The query plan structure is very simple, can be described from top to bottom like:

[QUERY PLAN]
Projection{student::name => abs(student::name)} -> TableScan{student} 

[REQUIRED COLUMN ANALYZER]
- Analyze Projection: collect REQUIRED column student::name
- Analyzer reach TableScan: set [required columns] as attributes to TableScan

[QUERY PLAN AFTER COLUMN PRUNNING]
Projection{student::name => abs(student::name)} -> TableScan{student, columns{name}} 
Enter fullscreen mode Exit fullscreen mode

For more complex cases with join or where, [REQUIRED COLUMN ANALYZER] will be apply to those operators for get the used/required columns before it reach TableScan.

How Lingodb do column pruning

By searching the code, you cannot find any content related to column pruning. Column pruning is not a standard optimization/pass of Lingodb. Lingodb just do it very handy inside a rewrite passes. I will just post the code to show how Lingodb do column pruning. It's very simple.

class BaseTableLowering : public OpConversionPattern<relalg::BaseTableOp> {
   public:
   using OpConversionPattern<relalg::BaseTableOp>::OpConversionPattern;
   LogicalResult matchAndRewrite(relalg::BaseTableOp baseTableOp, OpAdaptor adaptor, ConversionPatternRewriter& rewriter) const override {
      auto required = getRequired(baseTableOp);
Enter fullscreen mode Exit fullscreen mode

BaseTableLowering is the process of rewriting tablescan from relalg dialect to subop dialect. The total rewriting process is not what we care for this ariticle.

A BaseTableOp is TableScan definition. getRequired(baseTableOp) call get the required columns for the TableScan. This required column analysis is from bottom to top, from def to users, which is opposite to typical way of column pruning.

The very crucial thing that Lingodb can achieve column pruning from def to users is MLIR provide def/use information by nature. Def/use info for a program is crucial in compiler optimization, like liveness analysis. And in terms of database compilation process, def/use info is perfect for column pruning. The [REQUIRED COLUMN ANALYZER] is defined as follow, read comments along with code.

static relalg::ColumnSet getRequired(Operator op) {
   // 1. calc available columns. For BaseTableOp like student, available could be [name, birth, gender, addr]
   auto available = op.getAvailableColumns();

   relalg::ColumnSet required;
   // 2. iterate users. BaseTableOp.getUsers() for could be [projection, filter, join, ...]
   //    required columns is calcuated recursively. In case of required column is indirectly required,
   //    an example is a projection uses column of a join which joins two tables.
   for (auto* user : op->getUsers()) {
      if (auto consumingOp = mlir::dyn_cast_or_null<Operator>(user)) {
         required.insert(getRequired(consumingOp));
         required.insert(consumingOp.getUsedColumns());
      }
      if (auto materializeOp = mlir::dyn_cast_or_null<relalg::MaterializeOp>(user)) {
         required.insert(relalg::ColumnSet::fromArrayAttr(materializeOp.getCols()));
      }
   }
   // 3. do intersect. required may contains used column not from current BaseTableOp.
   return available.intersect(required);
}
Enter fullscreen mode Exit fullscreen mode

Let me show TableScan changes after getRequired is called. Lingodb's relalg IR for a more concrete understanding. We use a more complex sql like following, the tables used are all tpch tables. You can also try it here: https://www.lingo-db.com/interface/

select n_name, c_name from
(
  select * from region tb1 inner join nation 
  on n_regionkey = r_regionkey 
  where r_name = 'EUROPE'
) tb2
inner join customer
on c_nationkey = n_nationkey
limit 10;
Enter fullscreen mode Exit fullscreen mode

[SUBOP TABLESCAN] is the TableScan after rewriting.

  • region: The attributes @scan_u_3::@ref({type = !subop.table_entry_ref<[r_name$0 : !db.char<25>, r_regionkey$0 : i32]>}) shows there only two columns need to be scanned. r_name is required by predicate and r_regionkey is required by inner join.
  • nation: The attributes @scan_u_2::@ref({type = !subop.table_entry_ref<[n_name$1 : !db.char<25>, n_nationkey$0 : i32, n_regionkey$0 : i32]>}) {parallel} shows there only three columns need to be scanned. n_name is required by outer projection, n_nationkey is required by outer join, n_regionkey is required by inner join.
  • customer: The attributes @scan_u_1::@ref({type = !subop.table_entry_ref<[c_name$1 : !db.string, c_nationkey$0 : i32]>}) shows there only two columns need to be scanned. c_name is required by outer projection and c_nationkey is required by outer join.
[RELALG TABLESCAN]
%1 = relalg.basetable {rows = 5.000000e+00 : f64, table_identifier = "region"} columns: {r_comment => @tb1_::@r_comment({type = !db.string}), r_name => @tb1_::@r_name({type = !db.char<25>}), r_regionkey => @tb1_::@r_regionkey({type = i32})}
%2 = relalg.basetable {rows = 2.500000e+01 : f64, table_identifier = "nation"} columns: {n_comment => @nation::@n_comment({type = !db.string}), n_name => @nation::@n_name({type = !db.char<25>}), n_nationkey => @nation::@n_nationkey({type = i32}), n_regionkey => @nation::@n_regionkey({type = i32})}
  ...
%5 = relalg.basetable {rows = 1.500000e+05 : f64, table_identifier = "customer"} columns: {c_acctbal => @customer::@c_acctbal({type = !db.decimal<12, 2>}), c_address => @customer::@c_address({type = !db.string}), c_comment => @customer::@c_comment({type = !db.string}), c_custkey => @customer::@c_custkey({type = i32}), c_mktsegment => @customer::@c_mktsegment({type = !db.char<10>}), c_name => @customer::@c_name({type = !db.string}), c_nationkey => @customer::@c_nationkey({type = i32}), c_phone => @customer::@c_phone({type = !db.char<15>})}

[SUBOP TABLESCAN]
%15 = subop.scan_refs %arg0 : !subop.table<[r_comment$0 : !db.string, r_name$0 : !db.char<25>, r_regionkey$0 : i32]> @scan_u_3::@ref({type = !subop.table_entry_ref<[r_name$0 : !db.char<25>, r_regionkey$0 : i32]>}) {parallel}
  ...
%15 = subop.scan_refs %arg0 : !subop.table<[n_comment$0 : !db.string, n_name$1 : !db.char<25>, n_nationkey$0 : i32, n_regionkey$0 : i32]> @scan_u_2::@ref({type = !subop.table_entry_ref<[n_name$1 : !db.char<25>, n_nationkey$0 : i32, n_regionkey$0 : i32]>}) {parallel}
  ...
%15 = subop.scan_refs %arg0 : !subop.table<[c_acctbal$0 : !db.decimal<12, 2>, c_address$0 : !db.string, c_comment$0 : !db.string, c_custkey$0 : i32, c_mktsegment$0 : !db.char<10>, c_name$1 : !db.string, c_nationkey$0 : i32, c_phone$0 : !db.char<15>]> @scan_u_1::@ref({type = !subop.table_entry_ref<[c_name$1 : !db.string, c_nationkey$0 : i32]>}) {parallel}
Enter fullscreen mode Exit fullscreen mode

Conclusion

I hope I show how different and handy that Lingodb handle query optimization comparing to other databases. The MLIR provides a compilation infrastructure that can make compilation task easy to do. I avoid the more delicate part Lingodb like how to implement getAvailableColumns inside getRequired. Upon MLIR infrastructure, Lingodb still need to do a lot of great design to make it work. But I would like to keep it simple for this article just to show the general idea to how column pruning is handled in Lingodb.

Another thing you may consider is this getRequired implementation may not be fastest. Like for join case, getRequired will visit all users for each TableScan which may cause multiple visits for one operator, but typical way of column pruning still visit all operators only once. However, Lingodb is not chasing for extreme compilation performance right now. The main objective of Lingodb can be seen from their website https://www.lingo-db.com/. Current design is totally acceptable.

Top comments (0)