Setting up TPC-H on Amazon Aurora MySQL

#aws #database #testing

Over the past 20 years I have spent a significant amount of time doing all kinds of benchmarking tests. This week I was looking to kick the tyres of Amazon Aurora, and explore the performance gains that can be had using some of the features that customers can use, such as parallel query. I was looking to reproduce some of the findings of this blog post from Jeff Barr.

To do that, I needed to setup some benchmarking tools, and for databases there are plenty out there. One of the standard benchmarks for testing queries on databases however, is the TPC benchmarking tests. There are quite a few of them, and I certainly do not fully understand the full breadth of each of them but it looked like one in particular was useful in benchmarking the kinds of things I was looking for, and that benchmark was TPC-H.

I thought it was going to take only a short amount of time to get together, but one weekend later and I was still struggling. Much of that was probably my lack of knowledge in all things SQL, but I thought I would document/share how to get this up and running in the hope it will save someone else some time.

What you will need before starting

The only thing I needed was my linux desktop (I run an Ubuntu instance on AWS, but others will work) with the standard build-essential/gcc tools installed. All the steps below were performed on this workstation.

You will also need an Amazon Aurora database cluster which has been configured with a parallel query engine.

I also configured so my workstation instance had secure network connectivity to the Amazon Aurora databases, and had all the MySQL client tools needed. I used Amazon Session Manager to get access to my Ubuntu workstation.

Downloading and generating sample data

The first thing you will need to do is to download the TPC-H files from TPC and then configure and compile a program called dbgen that will generate your sample data. This was pretty straight forward, although I needed to do the following:

1.One the files have been decompressed, in the dbgen folder you need to rename makefile.suite to makefile

2.You need to edit makefile to configure the dbgen to generate MySQL, and specify your compiler. There is only a few changes you need to do:

CC = gcc 
DATABASE= MYSQL #database format is mysql
MACHINE = LINUX #System environment is linux
WORKLOAD = TPCH #Workloaded as tpch

3.You now need to add a MYSQL section to the header file, tpcd.h and add the following (I added it after the other database providers listed)

#ifdef MYSQL
#define GEN_QUERY_PLAN  ""
#define START_TRAN      "START TRANSACTION"
#define END_TRAN        "COMMIT"
#define SET_OUTPUT      ""
#define SET_ROWCOUNT    "limit %d;\n"
#define SET_DBASE       "use %s;\n"
#endif

4.Once you have made the changes, you can run make to compile the code and you should end up with a dbgen executable which you can now run using the following command to generate some sample data. (The following command will generate 1GB of data, [-s 1] and you can view the manpage to see more options)

$ ./dbgen -s 1

If it runs ok, you will now see the following in your folder. We will use these later.

customer.tbl
lineitem.tbl
nation.tbl
orders.tbl
part.tbl
partsupp.tbl
region.tbl
supplier.tbl

Creating database and importing sample data

5.Now that I had the data, I just needed to create the database and schema so it could be useful and I could starting running queries to test Amazon Aurora.

There is a file called dss.ddl that is in the dbgen folder that you can use to do that. Connect to the Aurora endpoint using the mysql client, using your user/password

$ mysql -h$DB -u$DBUSER -p"$DBPASS"

Once connected, create the database.

mysql > create database tpch;

6.And down create the schema using the dss.ddl file (I ran the mysql client command from the dbgen folder, so I new that this file would be in the 'root' folder)

mysql > \. dss.ddl

It should create your tables (you can check by running a "show tables" command), and now you should need to run some further sql to configure indexes and keys. The TPC documentation tells you to use the dss.ri file, but I had issues with this. This is the file that I ended up using:

-- ALTER TABLE REGION DROP PRIMARY KEY;
-- ALTER TABLE NATION DROP PRIMARY KEY;
-- ALTER TABLE PART DROP PRIMARY KEY;
-- ALTER TABLE SUPPLIER DROP PRIMARY KEY;
-- ALTER TABLE PARTSUPP DROP PRIMARY KEY;
-- ALTER TABLE ORDERS DROP PRIMARY KEY;
-- ALTER TABLE LINEITEM DROP PRIMARY KEY;
-- ALTER TABLE CUSTOMER DROP PRIMARY KEY;


-- For table REGION
ALTER TABLE REGION
ADD PRIMARY KEY (R_REGIONKEY);

-- For table NATION
ALTER TABLE NATION
ADD PRIMARY KEY (N_NATIONKEY);

ALTER TABLE NATION
ADD FOREIGN KEY NATION_FK1 (N_REGIONKEY) references REGION(R_REGIONKEY);


COMMIT WORK;

ALTER TABLE PART
ADD PRIMARY KEY (P_PARTKEY);

COMMIT WORK;

ALTER TABLE SUPPLIER
ADD PRIMARY KEY (S_SUPPKEY);

ALTER TABLE SUPPLIER
ADD FOREIGN KEY SUPPLIER_FK1 (S_NATIONKEY) references NATION(N_NATIONKEY);

COMMIT WORK;

ALTER TABLE PARTSUPP
ADD PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY);

COMMIT WORK;

ALTER TABLE CUSTOMER
ADD PRIMARY KEY (C_CUSTKEY);

ALTER TABLE CUSTOMER
ADD FOREIGN KEY CUSTOMER_FK1 (C_NATIONKEY) references NATION(N_NATIONKEY);

COMMIT WORK;

ALTER TABLE LINEITEM
ADD PRIMARY KEY (L_ORDERKEY,L_LINENUMBER);

COMMIT WORK;

ALTER TABLE ORDERS
ADD PRIMARY KEY (O_ORDERKEY);

COMMIT WORK;

ALTER TABLE PARTSUPP
ADD FOREIGN KEY PARTSUPP_FK1 (PS_SUPPKEY) references SUPPLIER(S_SUPPKEY);

COMMIT WORK;

ALTER TABLE PARTSUPP
ADD FOREIGN KEY PARTSUPP_FK2 (PS_PARTKEY) references PART(P_PARTKEY);

COMMIT WORK;

ALTER TABLE ORDERS
ADD FOREIGN KEY ORDERS_FK1 (O_CUSTKEY) references CUSTOMER(C_CUSTKEY);

COMMIT WORK;

ALTER TABLE LINEITEM
ADD FOREIGN KEY LINEITEM_FK1 (L_ORDERKEY)  references ORDERS(O_ORDERKEY);

COMMIT WORK;

ALTER TABLE LINEITEM
ADD FOREIGN KEY LINEITEM_FK2 (L_PARTKEY,L_SUPPKEY) references
        PARTSUPP(PS_PARTKEY,PS_SUPPKEY);

COMMIT WORK;

7.Now you will import that data that you created with dbgen. This is where I encountered a lot of problems as it looks like you need to import that data in a specific order. This is the order that worked for me.

LOAD DATA LOCAL INFILE 'part.tbl' INTO TABLE PART FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'region.tbl' INTO TABLE REGION FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'nation.tbl' INTO TABLE NATION FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'customer.tbl' INTO TABLE CUSTOMER FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'orders.tbl' INTO TABLE ORDERS FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'supplier.tbl' INTO TABLE SUPPLIER FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'partsupp.tbl' INTO TABLE PARTSUPP FIELDS TERMINATED BY '|';
LOAD DATA LOCAL INFILE 'lineitem.tbl' INTO TABLE LINEITEM FIELDS TERMINATED BY '|';

You should get no errors - if you do, then you should stop and review what you have done. Things will not work if these setup of the database, the keys/index and data have not been setup right.

8.The final piece is to rename the tables so they are all lower case. I used the following to do this - again, once logged into the mysql client;

alter table CUSTOMER    rename to customer ;
alter table LINEITEM    rename to lineitem ;
alter table NATION      rename to nation   ;
alter table ORDERS      rename to orders   ;
alter table PART        rename to part     ;
alter table PARTSUPP    rename to partsupp ;
alter table REGION      rename to region   ;
alter table SUPPLIER    rename to supplier ;

And that should be it. You are now ready for running queries against the TPC-H dataset.

Testing this with a query

9.I ran the following query,

SELECT
  l_orderkey,
  SUM(l_extendedprice * (1-l_discount)) AS revenue,
  o_orderdate,
  o_shippriority

FROM customer, orders, lineitem

WHERE
  c_mktsegment='AUTOMOBILE'
  AND c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND o_orderdate < date '1995-03-13'
  AND l_shipdate > date '1995-03-13'

GROUP BY
  l_orderkey,
  o_orderdate,
  o_shippriority

ORDER BY
  revenue DESC,
  o_orderdate LIMIT 15;

and got this output.


+------------+-------------+-------------+----------------+
| l_orderkey | revenue     | o_orderdate | o_shippriority |
+------------+-------------+-------------+----------------+
|    4791171 | 452497.4729 | 1995-02-23  |              0 |
|    4163074 | 437267.7799 | 1995-02-13  |              0 |
|    4676933 | 412072.0035 | 1995-02-07  |              0 |
|    3778628 | 399682.3516 | 1995-02-25  |              0 |
|    2773540 | 398691.5039 | 1995-02-27  |              0 |
|    2692902 | 394554.8742 | 1995-03-06  |              0 |
|     181414 | 393083.4426 | 1995-03-08  |              0 |
|    4178471 | 390099.4832 | 1995-02-28  |              0 |
|    4739141 | 385442.5444 | 1995-03-07  |              0 |
|    1310209 | 381336.0532 | 1995-02-23  |              0 |
|    1757283 | 380264.2700 | 1995-02-16  |              0 |
|    4011108 | 377256.1264 | 1995-03-10  |              0 |
|    5235941 | 373361.7169 | 1995-03-07  |              0 |
|     900581 | 367622.8600 | 1995-02-25  |              0 |
|    2729412 | 367211.1485 | 1995-03-10  |              0 |
+------------+-------------+-------------+----------------+

If you run EXPLAIN, you get the following:

+----+-------------+----------+------+--------------------+------+---------+------+---------+-----------------------------------------------------------------+
| id | select_type | table    | type | possible_keys      | key  | key_len | ref  | rows    | Extra                                                           |
+----+-------------+----------+------+--------------------+------+---------+------+---------+-----------------------------------------------------------------+
|  1 | SIMPLE      | customer | ALL  | PRIMARY            | NULL | NULL    | NULL |  150000 | Using where; Using temporary; Using filesort                    |
|  1 | SIMPLE      | orders   | ALL  | PRIMARY,ORDERS_FK1 | NULL | NULL    | NULL | 1488084 | Using where; Using join buffer (Hash Join Outer table orders)   |
|  1 | SIMPLE      | lineitem | ALL  | PRIMARY            | NULL | NULL    | NULL | 5642734 | Using where; Using join buffer (Hash Join Outer table lineitem) |
+----+-------------+----------+------+--------------------+------+---------+------+---------+-----------------------------------------------------------------+

10.I found that I needed to flush the cache in order to reproduce testing of new queries/benchmarking too. I did this using the following command.

mysql > reset query cache;

11.Testing Parallel Query was done following Jeff's blog post, which you can see when you enable this:

+----+-------------+----------+------+--------------------+------+---------+------+---------+--------------------------------------------------------------------------------------------------------------------------------+
| id | select_type | table    | type | possible_keys      | key  | key_len | ref  | rows    | Extra                                               |
+----+-------------+----------+------+--------------------+------+---------+------+---------+--------------------------------------------------------------------------------------------------------------------------------+
|  1 | SIMPLE      | customer | ALL  | PRIMARY            | NULL | NULL    | NULL |  150000 | Using where; Using temporary; Using filesort; Using parallel query (2 columns, 1 filters, 0 exprs; 0 extra)                    |
|  1 | SIMPLE      | orders   | ALL  | PRIMARY,ORDERS_FK1 | NULL | NULL    | NULL | 1488084 | Using where; Using join buffer (Hash Join Outer table orders); Using parallel query (4 columns, 1 filters, 1 exprs; 0 extra)   |
|  1 | SIMPLE      | lineitem | ALL  | PRIMARY            | NULL | NULL    | NULL | 5642734 | Using where; Using join buffer (Hash Join Outer table lineitem); Using parallelquery (4 columns, 1 filters, 1 exprs; 0 extra) |
+----+-------------+----------+------+--------------------+------+---------+------+---------+--------------------------------------------------------------------------------------------------------------------------------+