Installing Hadoop on Cent OS

Hadoop Installation Steps:
Prerequisite: JDK 1.6

1) Download and extract Hadoop Stable release

a)Go To:http://apache.techartifact.com/mirror/hadoop/common/stable/

select : hadoop-1.0.4.tar.gz
Download it:

I have downloaded it at path : /home/root/hadoop

b) Extract it.

# tar -xvf hadoop-1.0.4.tar.gz

2) Set Environment Variables

a)Set JAVA_HOME environment variable to Java installation directory
or
Set the Java installation that Hadoop uses by editing conf/hadoop-env.sh,
specifying the JAVA_HOME variable.

b)Set HADOOP_INSTALL environment variable pointing to directory created after extracting hadoop.

# cd /root
# vi .bash_profile

Now add environment variables over here-
.bash_profile file-

#######################################################
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi

# User specific environment and startup programs
JAVA_HOME=/usr/java/jdk1.6.0_43
export JAVA_HOME

HADOOP_INSTALL=/root/hadoop/hadoop-1.0.4
export HADOOP_INSTALL

PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_INSTALL/bin

export PATH

#####################################################

3)Check Hadoop version

[root@localhost ~]# hadoop version
Hadoop 1.0.4
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290
Compiled by hortonfo on Wed Oct 3 05:13:58 UTC 2012
From source with checksum fe2baea87c4c81a2c505767f3f9b71f4

Hadoop is installed, now we need to configure it.
Hadoop can run in 3 modes
a)Standalone
b)Pseudo-distributed
c)Fully Distributed Mode


For Standalone we are done. Nothing more to do.

But for Pseudo-distributed mode follow steps.

4)Set following content in files inside /conf folder

# cd /root/hadoop/hadoop-1.0.4/conf

core-site.xml

<?xml version=”1.0″?>
<!– core-site.xml –>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>

hdfs-site.xml

<?xml version=”1.0″?>
<!– hdfs-site.xml –>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml

<?xml version=”1.0″?>
<!– mapred-site.xml –>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

5)Install and Configure: ssh

a)Install ssh(Secure shell)

# yum -y install openssh-server openssh-clients

b)Configure RSA keys I am doing for empty passphrase-

[root@localhost ~]# ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
8a:f2:0a:36:21:ec:b0:cb:d7:7c:b6:0a:9d:6c:8a:e4 root@localhost.localdomain
The key’s randomart image is:
+–[ RSA 2048]—-+
| |
| |
| |
|. |
|oo S |
|+.. o… |
|o=..+=. |
|=oo++o o |
|.Eoo..+.. |
+—————–+

c) Add it to authorised key set

# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now you can able to ssh localhost without specifying password
try-

# ssh localhost


6)Formatting the HDFS filesystem

[root@localhost ~]# hadoop namenode -format
13/04/02 06:10:32 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by ‘hortonfo’ on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
13/04/02 06:10:34 INFO util.GSet: VM type = 64-bit
13/04/02 06:10:34 INFO util.GSet: 2% max memory = 19.33375 MB
13/04/02 06:10:34 INFO util.GSet: capacity = 2^21 = 2097152 entries
13/04/02 06:10:34 INFO util.GSet: recommended=2097152, actual=2097152
13/04/02 06:10:36 INFO namenode.FSNamesystem: fsOwner=root
13/04/02 06:10:37 INFO namenode.FSNamesystem: supergroup=supergroup
13/04/02 06:10:37 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/04/02 06:10:37 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/04/02 06:10:37 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/04/02 06:10:37 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/04/02 06:10:38 INFO common.Storage: Image file of size 110 saved in 0 seconds.
13/04/02 06:10:38 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
13/04/02 06:10:38 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/

7) Starting and stopping the daemons

a) Start the HDFS :

[root@localhost ~]# start-dfs.sh
starting namenode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out

If you get error JAVA_HOME is not set -that is-

starting namenode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-datanode-localhost.localdomain.out
localhost: Error: JAVA_HOME is not set.
localhost: starting secondarynamenode, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out
localhost: Error: JAVA_HOME is not set.

Then follow steps

# cd /root/hadoop/hadoop-1.0.4/conf
# vi hadoop-env.sh

Set JAVA_HOME in hadoop-env.sh

hadoop-env.sh for me-

#######################################################
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/usr/java/jdk1.6.0_43

# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000

# Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS”
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS”
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS”
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS”
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS

# Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR”

# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync’d from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids

# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes. See ‘man nice’.
# export HADOOP_NICENESS=10
############################################################


b) Start MapReduce daemons

[root@localhost ~]# start-mapred.sh
starting jobtracker, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-jobtracker-localhost.localdomain.out
localhost: starting tasktracker, logging to /root/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-root-tasktracker-localhost.localdomain.out

Three daemons will be started on your local machine: a namenode, a secondary namenode,
and a datanode.

c) Check whether daemons started successfully-

Hit following URLs-

For Job Tracker- http://localhost:50030/
For namenode – http://localhost:50070/

OR

[root@localhost ~]# jps
5992 NameNode
6231 SecondaryNameNode
6344 JobTracker
6454 TaskTracker
6530 Jps
6106 DataNode


d) Stop daemons:

# stop-dfs.sh
# stop-mapred.sh

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s