Michael

Posted on May 3 • Originally published at gbase.cn

End-to-End: Hadoop Deployment and Data Loading into GBase 8a

#gbase #database #数据库

This post walks through setting up a distributed Hadoop cluster from scratch and loading data from HDFS into GBase 8a, GBASE's China-domestically developed MPP database. The full pipeline covers environment prep, config tweaks, cluster verification, and the final load command.

1. Environment Setup

Hadoop User and SSH

Create a hadoop user on all nodes and configure passwordless SSH.

Environment Variables

Add Java and Hadoop paths to ~/.bash_profile on every node:

export JAVA_HOME=/home/hadoop/hadoop/jdk1.8.0_333
export HADOOP_HOME=/data1/hadoop/hadoop-3.4.2
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
...

/etc/hosts

All nodes — including the GBase 8a cluster — must be able to resolve hostnames:

192.168.28.201 hadoopnode1
192.168.28.202 hadoopnode2
192.168.28.203 hadoopnode3

2. Hadoop Configuration

All config files live under ${HADOOP_HOME}/etc/hadoop/.

core-site.xml — Default FS and temp directory

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://hadoopnode1:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>file:/data1/hadoop/data/tmp</value>
</property>

hdfs-site.xml — Storage directories and replication

<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/data1/hadoop/data/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/data1/hadoop/data/data</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

mapred-site.xml — Yarn as the execution framework

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

workers — DataNode list

hadoopnode1
hadoopnode2
hadoopnode3

Push the configs to all nodes with scp.

3. Start and Verify the Cluster

Format the NameNode: hdfs namenode -format
Start everything: start-all.sh
Check with jps — you should see NameNode, DataNode, ResourceManager, and NodeManagers across the three nodes.

4. HDFS Smoke Test

hdfs dfs -mkdir -p /mytest
echo "1234567" > hdfs_put_test.txt
hdfs dfs -put /home/hadoop/hdfs_put_test.txt /mytest
hadoop fs -ls hdfs://hadoopnode1:9000/mytest/hdfs_put_test.txt

5. Loading Data into GBase 8a

Once the table exists in your gbase database, a single LOAD DATA INFILE command pulls the data from HDFS:

LOAD DATA INFILE 'hdfs://hadoop:hadoop@hadoopnode1:9870/mytest/hdfs_put_test.txt'
INTO TABLE hdfs_load_test
DATA_FORMAT 3;

Result: Loaded 1 records, confirming the full path from HDFS to GBase 8a works end‑to‑end. This integration pattern is great for batch ETL workflows in independently controlled data platforms.

DEV Community