DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

End-to-End: Hadoop Deployment and Data Loading into GBase 8a

This post walks through setting up a distributed Hadoop cluster from scratch and loading data from HDFS into GBase 8a, GBASE's China-domestically developed MPP database. The full pipeline covers environment prep, config tweaks, cluster verification, and the final load command.

1. Environment Setup

Hadoop User and SSH

Create a hadoop user on all nodes and configure passwordless SSH.

Environment Variables

Add Java and Hadoop paths to ~/.bash_profile on every node:

export JAVA_HOME=/home/hadoop/hadoop/jdk1.8.0_333
export HADOOP_HOME=/data1/hadoop/hadoop-3.4.2
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
...
Enter fullscreen mode Exit fullscreen mode

/etc/hosts

All nodes — including the GBase 8a cluster — must be able to resolve hostnames:

192.168.28.201 hadoopnode1
192.168.28.202 hadoopnode2
192.168.28.203 hadoopnode3
Enter fullscreen mode Exit fullscreen mode

2. Hadoop Configuration

All config files live under ${HADOOP_HOME}/etc/hadoop/.

core-site.xml — Default FS and temp directory

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://hadoopnode1:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>file:/data1/hadoop/data/tmp</value>
</property>
Enter fullscreen mode Exit fullscreen mode

hdfs-site.xml — Storage directories and replication

<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/data1/hadoop/data/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/data1/hadoop/data/data</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
Enter fullscreen mode Exit fullscreen mode

mapred-site.xml — Yarn as the execution framework

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
Enter fullscreen mode Exit fullscreen mode

workers — DataNode list

hadoopnode1
hadoopnode2
hadoopnode3
Enter fullscreen mode Exit fullscreen mode

Push the configs to all nodes with scp.

3. Start and Verify the Cluster

  1. Format the NameNode: hdfs namenode -format
  2. Start everything: start-all.sh
  3. Check with jps — you should see NameNode, DataNode, ResourceManager, and NodeManagers across the three nodes.

4. HDFS Smoke Test

hdfs dfs -mkdir -p /mytest
echo "1234567" > hdfs_put_test.txt
hdfs dfs -put /home/hadoop/hdfs_put_test.txt /mytest
hadoop fs -ls hdfs://hadoopnode1:9000/mytest/hdfs_put_test.txt
Enter fullscreen mode Exit fullscreen mode

5. Loading Data into GBase 8a

Once the table exists in your gbase database, a single LOAD DATA INFILE command pulls the data from HDFS:

LOAD DATA INFILE 'hdfs://hadoop:hadoop@hadoopnode1:9870/mytest/hdfs_put_test.txt'
INTO TABLE hdfs_load_test
DATA_FORMAT 3;
Enter fullscreen mode Exit fullscreen mode

Result: Loaded 1 records, confirming the full path from HDFS to GBase 8a works end‑to‑end. This integration pattern is great for batch ETL workflows in independently controlled data platforms.

Top comments (0)