Friday, October 23, 2009

Hadoop 0.20 on Ubuntu 9.04


Hadoop is a combination of High Distributed File System and a Map/Reduce Framework based on java. In other words, you can have a cluster of servers that can share their hard drive as a big file system HDFS and then you can process data using Hadoop APIs (Map/Reduce). This tool is capable of sort and summarize data with speed that commercial and opensource Database can not reach. Hadoop belongs to the Apache group and is used by Yahoo, Facebook and Google.


With all that potential, I decided to give it a try to see how I could implement a cluster of 4 machines.


I used 4 machines (1 cpu, 1GB mem, 10GB hard drive) with Ubuntu 9.04 server.
hadoop01, hadoop02, hadoop03, and hadoop04


Install Java 6 on Ubuntu
#apt-get install sun-java6-jre
#apt-get install sun-java6-jdk

Create group hadoop
Create user hadoop
root@hadoop01:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop01:~# adduser --ingroup hadoop hadoop

Configure ssh key for hadoop user
$ssh-keygen -t rsa -P ""
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Get the latest version of hadoop

http://apache.multihomed.net/hadoop/core/stable/hadoop-0.20.1.tar.gz

Install it on /usr/local/hadoop
Make sure everything under is own by hadoop:hadoop /usr/local/hadoop
Edit vi /usr/local/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Define the Data partition for the node: example /data
Make sure everything under is own by hadoop:hadoop /data
Make sure /etc/hosts or DNS contains all the cluster node names

127.0.0.1 localhost localhost.localdomain
x.x.x.x   servername

Configuring Hadoop Master Node 

Since Hadoop contains to processes HDFS and Map/Reduce, the master node should contain both servers.
For HDFS, the service is called namenode.
For Map/Reduce, the service is called jobtracker.


Slave nodes will use the following processes:
For HDFS,  the service will be Datanode.
For Map/Reduce, the service will be Tasktracker.


Edit core-site.xml and mapred-site.xml (These files are used by Hadoop 0.20. Older version were using hadoop-site.xml)


For core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop01:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property> 

<name>mapred.job.tracker</name>
<value>hadoop01:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>



For mapred-site.xml


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
     <name>mapred.job.tracker</name>
     <value>hadoop01:54311</value>
     <description>The host and port that the MapReduce job tracker runs
     at.  If "local", then jobs are run in-process as a single map
     and reduce task.
     </description>
</property>
</configuration>


These files assume the hadoop01 is the Master node for both processes.
These files assume that /data will contains the data for HDFS.
These files will be the same for all nodes and should be own by hadoop user.


Configuring Master and Slaves


On the master node, you will find these two files masters and slaves. (/usr/local/hadoop/conf directory).
These file must be modified on the master node.

The masters file contains the name of the master.

hadoop@hadoop01:/usr/local/hadoop/bin$ cat /usr/local/hadoop/conf/masters
hadoop01

The slaves file contains the list of slave servers.
hadoop@hadoop01:/usr/local/hadoop/bin$ cat /usr/local/hadoop/conf/slaves
hadoop01
hadoop02
hadoop03
hadoop04



Adding node to Hadoop Cluster(Ubuntu)
The nodes on the cluster must be installed basically the same as the Master node.

Install Java 6
#apt-get install sun-java6-jre
#apt-get install sun-java6-jdk


Create group hadoop
Create user hadoop


root@hadoop02:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop02:~# adduser --ingroup hadoop hadoop

Have DNS or /etc/hosts setup for the cluster so any machine can be accessed by name.
Copy SSH keys for hadoop user on master node dir .ssh/*

Get Hadoop package installed
Probably you have it already in other hadoop datanode.
Untar the content on /usr/local/hadoop

Define your data directory: example /data
Copy configurations file from master node.

hadoop@hadoop01:/usr/local/hadoop/conf$ scp hadoop-env.sh hadoop03:/usr/local/hadoop/conf
hadoop-env.sh                                                                                                100% 2278     2.2KB/s   00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp core-site.xml  hadoop03:/usr/local/hadoop/conf
core-site.xml                                                                                                100% 1321     1.3KB/s   00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp mapred-site.xml   hadoop03:/usr/local/hadoop/conf
mapred-site.xml                                                                                              100%  455     0.4KB/s   00:00
hadoop@hadoop01:/usr/local/hadoop/conf$

Go to Master node and add the new node to the file


hadoop@hadoop01:/usr/local/hadoop/conf$ vi slave




Starting the cluster

Make sure that the /data is empty in all the nodes. Then on the master lets format the namenode.

$/usr/local/hadoop/bin/hadoop namenode -format

Start the HDFS cluster

$/usr/local/hadoop/bin/start-dfs.sh

Verify that Datanode has started in all the nodes running

Master:

hadoop@hadoop01:/usr/local/hadoop/bin$ jps
410 Jps
32161 SecondaryNameNode
32054 DataNode
31944 NameNode

Node:

hadoop@hadoop03:~$ jps
5832 Jps
5785 DataNode

Finally start the Map/Reduce Cluster.

$/usr/local/hadoop/bin/start-mapred.sh


Note: to stop the cluster, first run $/usr/local/hadoop/bin/stop-mapred.sh and then $/usr/local/hadoop/bin/stop-dfs.sh

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:


Finally, you can start to use the Hadoop Cluster, however it will require that you learn Hadoop API or some kind of language. If you are looking for something more like SQL , try to use Hive.

No comments:

Post a Comment