Haoop Self-study Diary--2. Haoop Cluster Environment Construction

Haoop Self-study Diary --- 2. Haoop Cluster Environment Construction

Setting up the Environment

I use a simple windows 7 notebook and use VirtualBox to create a Centos virtual machine to install Hadoop.

VirtualBox: 6.0.8 r130520 (Qt5.6.2)
CentOS: CentOS Linux release 7.6.1810 (Core)
jdk: 1.8.0_202
hadoop: 2.6.5

Cluster planning

  • A master node serves as NameNode in hdfs and ResourceManager in yarn.
  • Three data nodes, data1, data2 and data3, serve as DataNode in hdfs and NodeManager in yarn.
Name ip hdfs yarn
master 192.168.37.200 NameNode ResourceManager
data1 192.168.37.201 DataNode NodeManager
data2 192.168.37.202 DataNode NodeManager
data3 192.168.37.203 DataNode NodeManager

Create data nodes

Replicate data1 by using the virtual machine created previously as a template machine. Be careful to regenerate MAC addresses for all network cards.
See: Haoop Self-study Diary --- 1. Single Haoop Environment Construction

After completion, open the data1 virtual machine and modify the configuration

1.core-site.xml

Modify the global configuration core-site.xml to change local to master

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/core-site.xml

<configuration>
<property>
   <name>fs.default.name</name>
   <value>hdfs://master:9000</value>
</property>
</configuration>

2.yarn-site.xml

Modify yarn configuration yarn-site.xml, add NodeManager, Application Master, client connection information respectively.
The detailed structure is as follows:

Reference resources: Apache Hadoop YARN

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/yarn-site.xml

<configuration>
<property>
   <name>yarn.nodemanager.aux.services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux.services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>master:8025</value>
</property>
<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>master:8030</value>
</property>
<property>
   <name>yarn.resourcemanager.address</name>
   <value>master:8050</value>
</property>
</configuration>

3.mapred-site.xml

Modify the monitoring configuration mapred-site.xml to change the link address to master

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/mapred-site.xml

<configuration>
<property>
   <name>mapreduce.framework.name</name>
   <value>master:54311</value>
</property>
</configuration>

4.hdfs-site.xml

Modify hdfs configuration mapred-site.xml to remove NameNode configuration information

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/hdfs-site.xml

<configuration>
<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/software/hadoop-2.6.5/hadoop_data/hdfs/datanode</value>
</property>
</configuration>

Then copy the modified data1 machine to data2, data3, master.

configure network

According to the plan, each virtual machine hostly network card is configured and static ip is used. Take data1 machine as an example:

View the status of the network:

[root@localhost ~]# ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.11.91.160  netmask 255.255.255.0  broadcast 10.11.91.255
        inet6 fe80::cfb8:10a7:ca04:61ea  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:41:1e:cf  txqueuelen 1000  (Ethernet)
        RX packets 23859  bytes 2131126 (2.0 MiB)
        RX errors 0  dropped 3  overruns 0  frame 0
        TX packets 438  bytes 58284 (56.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.37.104  netmask 255.255.255.0  broadcast 192.168.37.255
        inet6 fe80::7327:497d:15cd:6be0  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:14:8a:70  txqueuelen 1000  (Ethernet)
        RX packets 667  bytes 73822 (72.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 39  bytes 6314 (6.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16  base 0xd240  

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Here, enp0s3 is a bridge network card, which is used for surfing the Internet and can be ignored. Enp0s3 is a hostly network card, which needs to be configured:

Copy a configuration file:

[root@localhost ~]# cd /etc/sysconfig/network-scripts/
[root@localhost network-scripts]# ls
ifcfg-enp0s3  ifdown-eth   ifdown-post    ifdown-Team      ifup-aliases  ifup-ipv6   ifup-post    ifup-Team      init.ipv6-global
ifcfg-lo      ifdown-ippp  ifdown-ppp     ifdown-TeamPort  ifup-bnep     ifup-isdn   ifup-ppp     ifup-TeamPort  network-functions
ifdown        ifdown-ipv6  ifdown-routes  ifdown-tunnel    ifup-eth      ifup-plip   ifup-routes  ifup-tunnel    network-functions-ipv6
ifdown-bnep   ifdown-isdn  ifdown-sit     ifup             ifup-ippp     ifup-plusb  ifup-sit     ifup-wireless
[root@localhost network-scripts]# cp ifcfg-enp0s3 ifcfg-enp0s8

Edit the following configuration:

[root@localhost network-scripts]# vim ifcfg-enp0s8

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static"
DEFROUTE="yes"
PEERDNS="yes"
PEERROUTES="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_PEERDNS="yes"
IPV6_PEERROUTES="yes"
IPV6_FAILURE_FATAL="no"
NAME="enp0s8"
UUID="06cae6b3-98ea-4177-8de1-acee3e51f9a2"
DEVICE="enp0s8"
ONBOOT="yes"
IPADDR=192.168.37.201
NETMASK=255.255.255.0
GATEWAY=192.168.37.1

Configure hosts

[root@localhost network-scripts]# vim /etc/hosts

192.168.37.200  master
192.168.37.201  data1
192.168.37.202  data2
192.168.37.203  data3

After restart, the new configuration takes effect.

Configure master node

Because the master node needs to exist as a management node and does not store data, its configuration needs to be modified.

hdfs-site.xml

The master is only NameNode, not DataNode, and needs to modify the hdfs configuration:

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/hdfs-site.xml

<configuration>
<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/software/hadoop-2.6.5/hadoop_data/hdfs/namenode</value>
</property>
</configuration>

masters and slaves

The masters file tells which machine is NameNode:

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/masters

master

The slaves file tells which machine is DataNode:

[root@ ~]# vim /software/hadoop-2.6.5/etc/hadoop/slaves

data1
data2
data3

Restart master node

Create hdfs directory

Take data1 as an example:

1. Connect to data1

[root@ ~]# ssh data1

2. Delete the hdfs directory

[root@ ~]# rm -rf /software/hadoop-2.6.5/hadoop_data/hdfs/

3. Create the DataNode directory

[root@ ~]# mkdir -p /software/hadoop-2.6.5/hadoop_data/hdfs/datanode

The same operation is performed in data2, data3.

Format NameNode directory

Reconstruct NameNode directory in master:

[root@ ~]# rm -rf /software/hadoop-2.6.5/hadoop_data/hdfs
[root@ ~]# mkdir -p /software/hadoop-2.6.5/hdfs/namenode

Format:

[root@ ~]# hadoop namenode -format

Configure hosts

Take data1 as an example:

[root@localhost network-scripts]# vim /etc/hostname

data1

Restart the machine after modification

Start the cluster

Start hdfs and yarn directly with start-all

[root@master ~]# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
The authenticity of host 'master (192.168.37.200)' can't be established.
ECDSA key fingerprint is SHA256:sVo82fntVBJ6mhn1+oSp+1lLVknmE7s4JcMg4MVoLO0.
ECDSA key fingerprint is MD5:89:6d:c0:42:b1:21:79:07:c4:41:19:a2:0a:45:19:43.
Are you sure you want to continue connecting (yes/no)? yes
master: Warning: Permanently added 'master,192.168.37.200' (ECDSA) to the list of known hosts.
master: starting namenode, logging to /software/hadoop-2.6.5/logs/hadoop-root-namenode-..out
data2: starting datanode, logging to /software/hadoop-2.6.5/logs/hadoop-root-datanode-..out
data1: starting datanode, logging to /software/hadoop-2.6.5/logs/hadoop-root-datanode-..out
data3: starting datanode, logging to /software/hadoop-2.6.5/logs/hadoop-root-datanode-..out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /software/hadoop-2.6.5/logs/hadoop-root-secondarynamenode-..out
starting yarn daemons
starting resourcemanager, logging to /software/hadoop-2.6.5/logs/yarn-root-resourcemanager-..out
data3: starting nodemanager, logging to /software/hadoop-2.6.5/logs/yarn-root-nodemanager-..out
data1: starting nodemanager, logging to /software/hadoop-2.6.5/logs/yarn-root-nodemanager-..out
data2: starting nodemanager, logging to /software/hadoop-2.6.5/logs/yarn-root-nodemanager-..out

View the master node process:

[root@master ~]# jps
3766 ResourceManager
3626 SecondaryNameNode
3452 NameNode
4061 Jps

View the data node process:

[root@master ~]# ssh data1 "/software/jdk1.8.0_202/bin/jps"
3462 NodeManager
3606 Jps
3369 DataNode

View the web interface

1.ResourceManager interface

Address: http://10.11.91.122:8088/

You can see that there are three nodes data1, data2, data3.

2.hdfs interface

Address: http://10.11.91.122:50070/

You can see that Live Nodes has three

Switch to the DataNode interface:

You can see the complete information of three nodes.

Problems encountered in building clusters

1. Data node node cannot start

At the beginning of the build, I failed to start the namenode due to configuration errors. After formatting the namenode, I found that the datanode could not start. Looking at the log, we found that the cluster ID of NameNode is not consistent with that of DataNode.
Manual modification to the same after the problem is resolved.

2. The DataNode interface displays only one machine

I configure three machines, but no matter how I start, only one random DataNode will appear on the interface:

Strangely enough, I overview the interface and see three machines:

Finally, the reason was that I did not configure hostname, which resulted in the default hostname of all three machines.
After modifying the hostname of three machines, the problem was solved

Tags: Hadoop NodeManager network xml

Posted on Wed, 24 Jul 2019 07:23:53 -0400 by PHPilliterate