DEV Community

Daemonxiao for TiDB Cloud Ecosystem

Posted on

A simple way to import TiSpark into Databricks to load TiDB data

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

TiSpark is a thin layer built for running Apache Spark on top of TiDB/TiKV to answer the complex OLAP queries. It takes advantage of both the Spark platform and the distributed TiKV cluster and seamlessly glues to TiDB, the distributed OLTP database, to provide a Hybrid Transactional/Analytical Processing (HTAP) solution to serve as a one-stop solution for both online transactions and analysis.

Databricks is a cloud-based collaborative data science, data engineering, and data analytics platform that combines the best of data warehouses and data lakes into a lakehouse architecture.

With the flexible and strong expanding ability of Databricks, you can install TiSpark in Databrick to gain special advantages in TiSpark, such as faster reading and writing, transactions, and so on. This article will show how to use TiSpark in Databricks to handle TiDB data.

Step 1: Deploy a TiDB in your own space

TiDB supports a tool named TiUP to quickly build the test cluster on a single machine.

Note:
Your machine must have a Public IP for Databricks to access. Besides, this article uses a single instance TiDB cluster, so you don't need to config hosts for TiDB, PD, and TiKV.

  1. Install TiUP.

    curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
    
  2. Declare the global environment variable.

    source ${your_shell_profile}
    
  3. Start the TiDB cluster.

    tiup playground --host 0.0.0.0
    

    After the TiDB cluster deploys completely, you will see this.

    CLUSTER START SUCCESSFULLY, Enjoy it ^-^
    To connect TiDB: mysql --comments --host 127.0.0.1 --port 4001 -u root -p (no password)
    To connect TiDB: mysql --comments --host 127.0.0.1 --port 4000 -u root -p (no password)
    To view the dashboard: http://127.0.0.1:2379/dashboard
    PD client endpoints: [127.0.0.1:2379 127.0.0.1:2382 127.0.0.1:2384]
    To view Prometheus: http://127.0.0.1:9090
    To view Grafana: http://127.0.0.1:3000
    
  4. Select PD address via MySQL client. Record the PD address under the instance column. This address, which is an intranet IP usually, is used to communicate with internal instances.

    mysql -u root -P 4000 -h 127.0.0.1
    mysql> select * from INFORMATION_SCHEMA.CLUSTER_INFO;
    +---------+-----------------+-----------------+-------------+------------------------------------------+---------------------------+-----------------+-----------+
    | TYPE    | INSTANCE        | STATUS_ADDRESS  | VERSION     | GIT_HASH                                 | START_TIME                | UPTIME          | SERVER_ID |
    +---------+-----------------+-----------------+-------------+------------------------------------------+---------------------------+-----------------+-----------+
    ...
    | pd      | 172.*.*.*:2379  | 172.*.*.*:2379  | 6.1.0       | d82f4fab6cf37cd1eca9c3574984e12a7ae27c42 | 2022-07-13T13:25:54+08:00 | 2h31m44.004814s |         0 |
    ...
    +---------+-----------------+-----------------+-------------+------------------------------------------+---------------------------+-----------------+-----------+
    
  5. Import sample data.

    tiup bench tpcc --warehouses 1 prepare
    
  6. Check the result of importing.

    mysql -u root -P 4000 -h 127.0.0.1
    mysql> use test;
    mysql> show tables;
    +----------------+
    | Tables_in_test |
    +----------------+
    | customer       |
    | district       |
    | history        |
    | item           |
    | new_order      |
    | order_line     |
    | orders         |
    | stock          |
    | warehouse      |
    +----------------+
    9 rows in set (0.00 sec)
    

Step 2: Install TiSpark in Databricks

Prerequisites:
A Databricks account, used to login Databricks workspace. If you don't have one, you can click here to find out how to try Databricks to get a free trial.

Databricks supports a custom init script, which is a shell script and runs during startup of each cluster node before the Apache Spark driver or worker JVM starts. Here we use the init script to install TiSpark.

  1. Log in to Databricks and open your workspace.

  2. Create a new Python NoteBook.

  3. Copy the following scripts into notebook.

    dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
    dbutils.fs.put(
    "/databricks/scripts/tispark-install.sh",
    """
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/tispark-assembly-3.2_2.12-3.1.0-SNAPSHOT.jar https://github.com/pingcap/tispark/releases/download/v3.1.0/tispark-assembly-3.2_2.12-3.1.0.jar 
    """, 
    True)
    
  4. Attach to a Spark cluster and run cell. Then the scripts used to install TiSpark will be stored in DBFS.

  5. Click Compute on the sidebar.

  6. Choose a cluster that you want to run with TiSpark.

  7. Click Edit to config cluster.

  8. In Configuration panel, set Databricks Runtime Version to "10.4 LTS".

  9. In Advanced options, add dbfs:/databricks/scripts/tispark-install.sh to Init Scripts.

    Advanced options

Step 3: Set TiSpark Configurations in Spark Config

Once setting Init Scrpits, you need to add some configurations for TiSpark in Spark Config.

  1. Add the following configuration. {pd address} is we recorded in Step 1. For more information about TiSpark conf, you can check TiSpark Configurations list.

    spark.sql.extensions org.apache.spark.sql.TiExtensions
    spark.tispark.pd.addresses {pd address}
    spark.sql.catalog.tidb_catalog org.apache.spark.sql.catalyst.catalog.TiCatalog
    spark.sql.catalog.tidb_catalog.pd.addresses {pd address}
    
  2. (Optional) If the {pd address} of your TiDB cluster is different with the Public IP of the machine, you need to add a special conf to build a host mapping between {pd address} (which equivalents to Intranet IP) and Public IP.

    spark.tispark.tikv.host_mapping {pd address}:{Public IP}
    

    Advanced options

  3. Click Confirm and restart to enable configuration.

Step 4: Handle your TiDB data in Databricks with TiSpark

After the cluster with TiSpark has been started, you can create a new notebook, attach it to this cluster, and start operating TiDB data with TiSpark in Databricks directly.

  1. Create a Scala notebook and attach the Spark cluster with TiSpark.

  2. Use tidb_catalog to enable TiSpark in SparkSession.

    spark.sql("use tidb_catalog")
    
  3. Use SELECT SQL to read TiDB data.

    spark.sql("select * from test.stock limit 10").show
    

    query result

  4. Use Spark DataSource API to write TiDB data.

    a. Because TiSpark doesn't support DDL, you need to create a table in TiDB before writing in Databricks. Here use MySQL client to create a best_sotck table on your own space.

    mysql -uroot -P4000 -h127.0.0.1
    mysql> use test;
    mysql> create table best_stock (s_i_id int(11), s_quantity int(11))
    

    b. Set TiDB options, such as address, password, port and so on.

    Descriptions of fields
    tidb.addr: TiDB address. It is the same as the PD IP we recorded in this Step 1.
    tidb.password: TiDB password of user.
    tidb.port: The port of TiDB.
    tidb.user: The user used to connect with TiDB cluster.

    val **tidbOptions**: Map[String, String] = Map(
    "tidb.addr" -> "{tidb address}",
    "tidb.password" -> "",
    "tidb.port" -> "4000",
    "tidb.user" -> "root")
    

    c. Select data and write back to TiDB with the specified option.

    Descriptions of fields
    format: Specify "TiDB" for TiSpark.
    database: The destination database of writing.
    table: **The destination table of writing.
    **Mode:
    The datasource writing mode.

    val DF = spark.sql("select s_i_id, s_quantity from test.stock where s_quantity>99 ")
    DF.write
    .format("tidb")
    .option("database", "test")
    .option("table", "best_stock")
    .options(tidbOptions)
    .mode("append")
    .save()
    

    insert

    d. Check the writing data by SELECT SQL.
    spark.sql("select * from test.best_stock limit 10").show

    check result

  5. Use DELETE SQL to delete TiDB data and check with SELECT SQL.

    spark.sql("delete from test.best_stock where s_quantity > 99")
    spark.sql("select * from test.best_stock").show
    

    check result

Conclusion

In this article, we use TiSpark in Databricks to access TiDB data. The key steps are:

  1. Install TiSpark in Databricks via init scripts.

  2. Set TiSpark Configurations in Databricks.

If you are interested in TiSpark, you can find more information on our TiSpark homepage. Welcome to share any ideas or PR on the TiSpark GitHub repository.

Top comments (0)