- Ubuntu 14.04 LTS 3 (1 Master, 2 Slaves)
- Hadoop 2.6.5
- Java 1.8.0_141
Hadoop consists of one master node and several slave nodes.
Hadoop can be divided into HDFS (Hadoop File System) for storing and managing files and Yarn for executing map & reduce job. In practice, both run and terminate separately.
In this article, I will cover everything from basic to actual use including SSH key registration, Java installation, Hadoop installation and setup, Hadoop execution and testing.
$ apt-get install update
$ sudo vi /etc/resolv.conf
change nameserver IP
nameserver 8.8.8.8
$ apt-get install update
When connecting from one node to another node, you need to create and register an SSH key so that you can login without a password.
When you run Hadoop, you log in from all nodes and run Hadoop on each node. At this time, you can not log in with a password, so you have to register your SSH key.
$ ssh-keygen -t rsa
$ cd .ssh
drwx------ 2 ubuntu ubuntu 4096 Jul 20 07:25 ./
drwxr-xr-x 8 ubuntu ubuntu 4096 Jul 20 08:40 ../
-rw-r--r-- 1 ubuntu ubuntu 1608 Jul 20 05:20 authorized_keys
-rw------- 1 ubuntu ubuntu 1679 Jul 20 05:16 id_rsa
-rw-r--r-- 1 ubuntu ubuntu 402 Jul 20 05:19 id_rsa.pub
-rw-r--r-- 1 ubuntu ubuntu 1554 Jul 20 07:33 known_hosts
If you do not have a known_hosts file, create a file with touch known_hosts.
$ chmod 700 ./
$ chmod 755 ../
$ chmod 600 id_rsa
$ chmod 644 id_rsa.pub
$ chmod 644 authorized_keys
$ chmod 644 known_hosts
Copy the id_rsa.pub key and put it in the authorized_keys file of each slave node.
The id_rsa.pub key of the master node is also put in the authorized_keys file of the slave node.
Each node tries to connect to each other through ssh.
$ apt-get install default-jdk
$ vi /etc/environment
Add JAVA_HOME="/usr/lib/jvm/default-java"
$ source /etc/environment
$ java -version
java version "1.7.0_131"
OpenJDK Runtime Environment (IcedTea 2.6.9) (7u131-2.6.9-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)
$ wget http://apache.mesi.com.ar/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
$ tar -xvzf hadoop-2.6.5.tar.gz
$ mv hadoop-2.6.5 hadoop
$ vi .bashrc
############################################################
### Hadoop
############################################################
export HADOOP_HOME=$HOME/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
export CLASSPATH=$HADOOP_HOME/share/hadoop/common/hadoop-common-
2.6.5.jar:$HADOOP_HOME/share/hadoop/mapreduce/:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar
$ source .bashrc
In your home directory,
$ mkdir -p hadoop_tmp/hdfs/namenode
$ mkdir -p hadoop_tmp/hdfs/datanode
All other nodes are set equal.
$ cd ~/hadoop/etc/hadoop
Only modify
hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, slaves, masters
files in that directory.
-
hadoop-env.sh
$ vi hadoop-env.sh
export HADOOP_LOG_DIR={$HADOOP_HOME}/logs/hadoop/core export HADOOP_ROOT_LOGGE=INFO,console export HADOOP_PID_DIR=$HADOOP_LOG_DIR export HADOOP_SECURITY_LOGGER=INFO,NullAppender export HDFS_AUDIT_LOGGER=INFO,NullAppender
export JAVA_HOME=${JAVA_HOME} export HADOOP_HOME=/home/ubuntu/hadoop
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs export HADOOP_PREFIX=${HADOOP_HOME} export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
-
core-site.xml
$ vi core-site.xml
fs.defaultFS hdfs://yechan-master:9000 hadoop.tmp.dir /home/ubuntu/hadoop_tmp -
hdfs-site.xml
$ vi hdfs-site.xml
dfs.replication 2 dfs.webhdfs.enabled true dfs.namenode.name.dir /home/ubuntu/hadoop_tmp/hdfs/namenode dfs.datanode.data.dir /home/ubuntu/hadoop_tmp/hdfs/datanode dfs.datanode.hdfs-blocks-metadata.enabled true -
mapred-site.xml
$ vi mapred-site.xml
mapreduce.framework.name yarn mapreduce.jobtracker.hosts.exclude.filename /home/ubuntu/hadoop/etc/hadoop/exclude mapreduce.jobtracker.hosts.filename /home/ubuntu/hadoop/etc/hadoop/include -
yarn-site.xml
$ vi yarn-site.xml
yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.nodemanager.local-dir /home/hadoop-2.6.5/data/yarn/nm-local-dir yarn.resourcemanager.fs.state-store.uri /home/hadoop-2.6.5/data/yarn/system/rmstore yarn.resourcemanager.hostname yechan-master -
slaves
$ vi slaves
yechan-master yechan-slave1 yechan-slave2
-
masters
$vi masters
yechan-master
All other nodes are set equal.
- format namenode
In Master node.
$ hadoop namenode -format
-
execution (hdfs and yarn)
$ start-all.sh
When you hit the jps command after a while, the master node
$ jps
25240 DataNode
25443 SecondaryNameNode
25069 NameNode
25592 ResourceManager
29776 Jps
25739 NodeManager
In datanode
$ jps
11799 DataNode
12617 Jps
11947 NodeManager
- End
When shutting down Hadoop,
$ stop-all.sh
$ hdfs dfs -mkdir /input
$ hdfs dfs -lsr
$ hdfs dfs -put /home/ubuntu/word.txt
$ hadoop jar /home/ubuntu/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar wordcount /input/word.txt output
$ hdfs dfs -cat /user/ubuntu/output