Skip to main content

Apache Hadoop pseudo distributed cluster on Ubuntu virtual Machine

Hi,   Here i am going to show you , How to setup up a pseudo distributed (Single Node) hadoop cluster on Ubuntu VM.

Prerequisite -

  1.  understanding of Hadoop
  2.  VM ware player
  3.  ubuntu 

Things you need 

  1. VMWare Player, Just a simple google will take you to vmware web site, which will give you info and download link to the latest version of the Virtual player. Download and install it.
  2. ubuntu VM Image, again google will help you here, (the version I used is ubuntu-14.04). 
  3. and off course a Laptop :-).

setting up VM-

after installing vmware player and extracting ubuntu to Directory of your choice, double click the vmware icon on your Desktop, click on open virtual machine and go to the directory where you have extracted ubuntu, you will find a ubuntu.vmx file, double click on it. and then Play the VM. (you can later edit the settings of VM, if you want to).

Updating ubuntu

Go to terminal and give this command - ' $>sudo apt-get update '.


               type 'password', when prompted to enter password - "(Note that the user name is 'user' and password is 'password' for it )"

Installing JDK

       next thing you need is to install java development kit (jdk), as Hadoop requires java installations. for hadoop-1.x cluster jdk-6 version is better, we are seeing installation of hadoop1.2.0 here as Hadoop-2.x needs some extra configurations- (YARN settings etc.)
             Command is - '$>sudo apt-get install openjdk-6-jdk'


 check java installation by giving command - 'java -version '    

SSH Settings

 Hadoop need password-less access, as jobtracker , namenodes needs to frequently communicate to tasktrackers and datanodes. so lets install an openssh-server now.
         Command is - ' $>sudo apt-get install openssh-server '


  it might prompt you about disk size and ask you if you want to continue or not, just type 'y'

Download Hadoop

 now download and extract Hadoop - either go to http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0
or type this command on terminal
$>wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz'

 wait for few mins till it gets downloaded then extract it
command is - ' $>tar -xvf hadoop-1.2.0.tar.gz '

configuring hadoop 

There are some files for controlling the configuration of hadoop installation, but at this point we require only 4 which are.

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • hadoop-env.sh
     we are configuring each one by one

 core-site.xml 

 It contains configuration information for default values for core Hadoop Properties.
 Command- ' $>sudo gedit hadoop-1.2.0/conf/core-site.xml '

and just type it in gedit window for core-site.xml

<configuration>
  <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:8020</value>
  </property>
</configuration>

 hdfs-site.xml

for configuration settings for hdfs Demons - "namenode, secondary namenode, datanode. ", it is single node cluster so replication factor is set to '1'.

Command - ' $>sudo gedit hadoop-1.2.0/conf/hdfs-site.xml '

modify below properties in it 

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>


mapred-site.xml:

It is for cofiguration settings for mapreduce demons- " Jobtracker, and tasktrackers


Command- ' $>sudo gedit hadoop-1.2.0/conf/mapred-site.xml '

modify bellow properties under <configuration /> block 

<property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
</property>


Getting IP address of your machine 

Command- ' $>ifconfig '
then note down the IP address written after 'inet addr: ' on your terminal.

now edit etc/hosts, command for it is as follows 
command- ' $>sudo gedit /etc/hosts '

in host file type your IP address which you have noted down a while ago and type localhost after space, save and close the window.

Creating ssh key

Command: ssh-keygen -t rsa –P ""

Now move the key to authorized key by the following command 

Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

Reboot the VM 

configuration for hadoop-env.sh

Command: sudo gedit hadoop-1.2.0/conf/hadoop-env.sh

Now set the path for JAVA_HOME in hadoop-env.sh file , Un-comment the shown export in it and add the below the path.


export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386


now change the directory to your hadoop installations ex. cd hadoop-1.2.0

format the namenode by command 

bin/hadoop namenode -format

now type command 

bin/start-dfs.sh

this will start Hadoop dfs demon, after starting it now start map-reduce demon- (jobtracker and tasktrackers) , command is as follows 

bin/start-mapred.sh

Thats it :-) If you want to check your hadoop cluster has started correctly or not just shoot the command jps.

thats it for single node hadoop 1 cluster setup on vmware player, next we will try to setup a multi node hadoop-2.6 cluster, watch out for next post 

Comments

Post a Comment