This post walks through setting up a distributed Hadoop cluster from scratch and loading data from HDFS into GBase 8a, GBASE's China-domestically developed MPP database. The full pipeline covers environment prep, config tweaks, cluster verification, and the final load command.
1. Environment Setup
Hadoop User and SSH
Create a hadoop user on all nodes and configure passwordless SSH.
Environment Variables
Add Java and Hadoop paths to ~/.bash_profile on every node:
export JAVA_HOME=/home/hadoop/hadoop/jdk1.8.0_333
export HADOOP_HOME=/data1/hadoop/hadoop-3.4.2
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
...
/etc/hosts
All nodes — including the GBase 8a cluster — must be able to resolve hostnames:
192.168.28.201 hadoopnode1
192.168.28.202 hadoopnode2
192.168.28.203 hadoopnode3
2. Hadoop Configuration
All config files live under ${HADOOP_HOME}/etc/hadoop/.
core-site.xml — Default FS and temp directory
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoopnode1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data1/hadoop/data/tmp</value>
</property>
hdfs-site.xml — Storage directories and replication
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data1/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data1/hadoop/data/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
mapred-site.xml — Yarn as the execution framework
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
workers — DataNode list
hadoopnode1
hadoopnode2
hadoopnode3
Push the configs to all nodes with scp.
3. Start and Verify the Cluster
- Format the NameNode:
hdfs namenode -format - Start everything:
start-all.sh - Check with
jps— you should see NameNode, DataNode, ResourceManager, and NodeManagers across the three nodes.
4. HDFS Smoke Test
hdfs dfs -mkdir -p /mytest
echo "1234567" > hdfs_put_test.txt
hdfs dfs -put /home/hadoop/hdfs_put_test.txt /mytest
hadoop fs -ls hdfs://hadoopnode1:9000/mytest/hdfs_put_test.txt
5. Loading Data into GBase 8a
Once the table exists in your gbase database, a single LOAD DATA INFILE command pulls the data from HDFS:
LOAD DATA INFILE 'hdfs://hadoop:hadoop@hadoopnode1:9870/mytest/hdfs_put_test.txt'
INTO TABLE hdfs_load_test
DATA_FORMAT 3;
Result: Loaded 1 records, confirming the full path from HDFS to GBase 8a works end‑to‑end. This integration pattern is great for batch ETL workflows in independently controlled data platforms.
Top comments (0)