With all that potential, I decided to give it a try to see how I could implement a cluster of 4 machines.
I used 4 machines (1 cpu, 1GB mem, 10GB hard drive) with Ubuntu 9.04 server.
hadoop01, hadoop02, hadoop03, and hadoop04
Install Java 6 on Ubuntu
#apt-get install sun-java6-jre
#apt-get install sun-java6-jdk
Create group hadoop
Create user hadoop
root@hadoop01:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop01:~# adduser --ingroup hadoop hadoop
Configure ssh key for hadoop user
$ssh-keygen -t rsa -P ""
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Get the latest version of hadoop
http://apache.multihomed.net/hadoop/core/stable/hadoop-0.20.1.tar.gz
Install it on /usr/local/hadoop
Make sure everything under is own by hadoop:hadoop /usr/local/hadoop
Edit vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
Define the Data partition for the node: example /data
Make sure everything under is own by hadoop:hadoop /data
Make sure /etc/hosts or DNS contains all the cluster node names
127.0.0.1 localhost localhost.localdomain
x.x.x.x servername
Configuring Hadoop Master Node
Since Hadoop contains to processes HDFS and Map/Reduce, the master node should contain both servers.
For HDFS, the service is called namenode.
For Map/Reduce, the service is called jobtracker.
Slave nodes will use the following processes:
For HDFS, the service will be Datanode.
For Map/Reduce, the service will be Tasktracker.
Edit core-site.xml and mapred-site.xml (These files are used by Hadoop 0.20. Older version were using hadoop-site.xml)
For core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop01:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop01:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
For mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop01:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
These files assume the hadoop01 is the Master node for both processes.
These files assume that /data will contains the data for HDFS.
These files will be the same for all nodes and should be own by hadoop user.
Configuring Master and Slaves
On the master node, you will find these two files masters and slaves. (/usr/local/hadoop/conf directory).
These file must be modified on the master node.
The masters file contains the name of the master.
hadoop@hadoop01:/usr/local/hadoop/bin$ cat /usr/local/hadoop/conf/masters
hadoop01
Adding node to Hadoop Cluster(Ubuntu)
The nodes on the cluster must be installed basically the same as the Master node.
Install Java 6
#apt-get install sun-java6-jre
#apt-get install sun-java6-jdk
Create group hadoop
Create user hadoop
root@hadoop02:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop02:~# adduser --ingroup hadoop hadoop
Have DNS or /etc/hosts setup for the cluster so any machine can be accessed by name.
Copy SSH keys for hadoop user on master node dir .ssh/*
Get Hadoop package installed
Probably you have it already in other hadoop datanode.
Untar the content on /usr/local/hadoop
Define your data directory: example /data
Copy configurations file from master node.
hadoop@hadoop01:/usr/local/hadoop/conf$ scp hadoop-env.sh hadoop03:/usr/local/hadoop/conf
hadoop-env.sh 100% 2278 2.2KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp core-site.xml hadoop03:/usr/local/hadoop/conf
core-site.xml 100% 1321 1.3KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp mapred-site.xml hadoop03:/usr/local/hadoop/conf
mapred-site.xml 100% 455 0.4KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$
Go to Master node and add the new node to the file
hadoop@hadoop01:/usr/local/hadoop/conf$ vi slave
Starting the cluster
#apt-get install sun-java6-jdk
Create group hadoop
Create user hadoop
root@hadoop01:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop01:~# adduser --ingroup hadoop hadoop
Configure ssh key for hadoop user
$ssh-keygen -t rsa -P ""
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Get the latest version of hadoop
http://apache.multihomed.net/hadoop/core/stable/hadoop-0.20.1.tar.gz
Install it on /usr/local/hadoop
Make sure everything under is own by hadoop:hadoop /usr/local/hadoop
Edit vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
Define the Data partition for the node: example /data
Make sure everything under is own by hadoop:hadoop /data
Make sure /etc/hosts or DNS contains all the cluster node names
127.0.0.1 localhost localhost.localdomain
x.x.x.x servername
Configuring Hadoop Master Node
Since Hadoop contains to processes HDFS and Map/Reduce, the master node should contain both servers.
For HDFS, the service is called namenode.
For Map/Reduce, the service is called jobtracker.
Slave nodes will use the following processes:
For HDFS, the service will be Datanode.
For Map/Reduce, the service will be Tasktracker.
Edit core-site.xml and mapred-site.xml (These files are used by Hadoop 0.20. Older version were using hadoop-site.xml)
For core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop01:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hadoop01:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
For mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop01:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
These files assume the hadoop01 is the Master node for both processes.
These files assume that /data will contains the data for HDFS.
These files will be the same for all nodes and should be own by hadoop user.
Configuring Master and Slaves
On the master node, you will find these two files masters and slaves. (/usr/local/hadoop/conf directory).
These file must be modified on the master node.
The masters file contains the name of the master.
hadoop@hadoop01:/usr/local/hadoop/bin$ cat /usr/local/hadoop/conf/masters
hadoop01
The slaves file contains the list of slave servers.
hadoop@hadoop01:/usr/local/hadoop/bin$ cat /usr/local/hadoop/conf/slaves
hadoop01
hadoop02
hadoop03
hadoop04
Adding node to Hadoop Cluster(Ubuntu)
The nodes on the cluster must be installed basically the same as the Master node.
Install Java 6
#apt-get install sun-java6-jre
#apt-get install sun-java6-jdk
Create group hadoop
Create user hadoop
root@hadoop02:~# addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
root@hadoop02:~# adduser --ingroup hadoop hadoop
Have DNS or /etc/hosts setup for the cluster so any machine can be accessed by name.
Copy SSH keys for hadoop user on master node dir .ssh/*
Get Hadoop package installed
Probably you have it already in other hadoop datanode.
Untar the content on /usr/local/hadoop
Define your data directory: example /data
Copy configurations file from master node.
hadoop@hadoop01:/usr/local/hadoop/conf$ scp hadoop-env.sh hadoop03:/usr/local/hadoop/conf
hadoop-env.sh 100% 2278 2.2KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp core-site.xml hadoop03:/usr/local/hadoop/conf
core-site.xml 100% 1321 1.3KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$ scp mapred-site.xml hadoop03:/usr/local/hadoop/conf
mapred-site.xml 100% 455 0.4KB/s 00:00
hadoop@hadoop01:/usr/local/hadoop/conf$
Go to Master node and add the new node to the file
hadoop@hadoop01:/usr/local/hadoop/conf$ vi slave
Starting the cluster
Make sure that the /data is empty in all the nodes. Then on the master lets format the namenode.
$/usr/local/hadoop/bin/hadoop namenode -format
Start the HDFS cluster
$/usr/local/hadoop/bin/start-dfs.sh
Verify that Datanode has started in all the nodes running
Master:
hadoop@hadoop01:/usr/local/hadoop/bin$ jps
410 Jps
32161 SecondaryNameNode
32054 DataNode
31944 NameNode
Node:
hadoop@hadoop03:~$ jps
5832 Jps
5785 DataNode
Finally start the Map/Reduce Cluster.
$/usr/local/hadoop/bin/start-mapred.sh
Note: to stop the cluster, first run $/usr/local/hadoop/bin/stop-mapred.sh and then $/usr/local/hadoop/bin/stop-dfs.sh
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
- http://localhost:50030/ - web UI for MapReduce job tracker(s)
- http://localhost:50060/ - web UI for task tracker(s)
- http://localhost:50070/ - web UI for HDFS name node(s)
Finally, you can start to use the Hadoop Cluster, however it will require that you learn Hadoop API or some kind of language. If you are looking for something more like SQL , try to use Hive.
No comments:
Post a Comment