Apache Hadoop pseudo distributed cluster on Ubuntu virtual Machine

Hi, Here i am going to show you , How to setup up a pseudo distributed (Single Node) hadoop cluster on Ubuntu VM.

Prerequisite -

understanding of Hadoop
VM ware player
ubuntu

Things you need

VMWare Player, Just a simple google will take you to vmware web site, which will give you info and download link to the latest version of the Virtual player. Download and install it.
ubuntu VM Image, again google will help you here, (the version I used is ubuntu-14.04).
and off course a Laptop :-).

setting up VM-

after installing vmware player and extracting ubuntu to Directory of your choice, double click the vmware icon on your Desktop, click on open virtual machine and go to the directory where you have extracted ubuntu, you will find a ubuntu.vmx file, double click on it. and then Play the VM. (you can later edit the settings of VM, if you want to).

Updating ubuntu

Go to terminal and give this command - ' $>sudo apt-get update '.

type 'password', when prompted to enter password - "(Note that the user name is 'user' and password is 'password' for it )"

Installing JDK

next thing you need is to install java development kit (jdk), as Hadoop requires java installations. for hadoop-1.x cluster jdk-6 version is better, we are seeing installation of hadoop1.2.0 here as Hadoop-2.x needs some extra configurations- (YARN settings etc.)

Command is - '$>sudo apt-get install openjdk-6-jdk'

check java installation by giving command - 'java -version '

SSH Settings

Hadoop need password-less access, as jobtracker , namenodes needs to frequently communicate to tasktrackers and datanodes. so lets install an openssh-server now.
Command is - ' $>sudo apt-get install openssh-server '

it might prompt you about disk size and ask you if you want to continue or not, just type 'y'

Download Hadoop

now download and extract Hadoop - either go to http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0
or type this command on terminal
' $>wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz'

wait for few mins till it gets downloaded then extract it
command is - ' $>tar -xvf hadoop-1.2.0.tar.gz '

configuring hadoop

There are some files for controlling the configuration of hadoop installation, but at this point we require only 4 which are.

core-site.xml
hdfs-site.xml
mapred-site.xml
hadoop-env.sh

we are configuring each one by one

core-site.xml

It contains configuration information for default values for core Hadoop Properties.

Command- ' $>sudo gedit hadoop-1.2.0/conf/core-site.xml '

and just type it in gedit window for core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>

hdfs-site.xml

for configuration settings for hdfs Demons - "namenode, secondary namenode, datanode. ", it is single node cluster so replication factor is set to '1'.

Command - ' $>sudo gedit hadoop-1.2.0/conf/hdfs-site.xml '

modify below properties in it

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

mapred-site.xml:

It is for cofiguration settings for mapreduce demons- " Jobtracker, and tasktrackers

Command- ' $>sudo gedit hadoop-1.2.0/conf/mapred-site.xml '

modify bellow properties under <configuration /> block

<property>

<name>mapred.job.tracker</name>

<value>localhost:8021</value>

</property>

Getting IP address of your machine

Command- ' $>ifconfig '

then note down the IP address written after 'inet addr: ' on your terminal.

now edit etc/hosts, command for it is as follows

command- ' $>sudo gedit /etc/hosts '

in host file type your IP address which you have noted down a while ago and type localhost after space, save and close the window.

Creating ssh key

Command: ssh-keygen -t rsa –P ""

Now move the key to authorized key by the following command

Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Reboot the VM

configuration for hadoop-env.sh

Command: sudo gedit hadoop-1.2.0/conf/hadoop-env.sh

Now set the path for JAVA_HOME in hadoop-env.sh file , Un-comment the shown export in it and add the below the path.

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

now change the directory to your hadoop installations ex. cd hadoop-1.2.0

format the namenode by command

bin/hadoop namenode -format

now type command

bin/start-dfs.sh

this will start Hadoop dfs demon, after starting it now start map-reduce demon- (jobtracker and tasktrackers) , command is as follows

bin/start-mapred.sh

Thats it :-) If you want to check your hadoop cluster has started correctly or not just shoot the command jps.

thats it for single node hadoop 1 cluster setup on vmware player, next we will try to setup a multi node hadoop-2.6 cluster, watch out for next post

BigData

Search This Blog