[Hadoop of big data] cluster environment construction

1 Introduction to Hadoop

1.1 advantages

1) High reliability: Hadoop bottom layer maintains multiple data copies, so even if a Hadoop computing element or storage fails, it will not lead to data loss.
2) High scalability: allocating task data among clusters can easily expand thousands of nodes.
3) Efficiency: under the idea of MapReduce, Hadoop works in parallel to speed up task processing.
4) High fault tolerance: it can automatically reassign failed tasks.

1.2 composition


In the era of Hadoop 1. X, MapReduce in Hadoop handles business logic operation and resource scheduling at the same time, with great coupling.
In the Hadoop 2. X era, Yan was added. Yarn is only responsible for resource scheduling, and MapReduce is only responsible for computing.
Hadoop 3. X has no change in composition.

1.1.1 overview of HDFS architecture

1) NameNode(NN): stores file metadata, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), block list of each file, DataNode where the block is located, etc.
2) DataNode(DN): stores file block data and the checksum of block data in the local file system.
3) Secondary NameNode(2NN): Backup metadata of NameNode at regular intervals.

1.1.2 overview of yarn architecture

YT another resource negotiator, referred to as YARN for short, is another resource coordinator, which is the resource manager of Hadoop.

1) Resource Manager (RM): the leader of the entire cluster resources (memory, CPU, etc.).
3) Application master (AM): the boss of a single task.
2) NodeManager(NM): single node server resource manager.
4) Container: container, which is equivalent to an independent server. It encapsulates the resources required for task operation, such as memory, CPU, disk, network, etc.

explain:

  1. There can be multiple clients
  2. Multiple applicationmasters can run on a cluster
  3. There can be multiple containers on each NodeManager

1.1.3 overview of MapReduce architecture

MapReduce divides the calculation process into two stages: Map and Reduce.
1) The Map phase processes the input data in parallel.
2) In the Reduce phase, the Map results are summarized.

1.1.4 relationship among HDFS, YARN and MapReduce

2. Setting up Hadoop operating environment

2.1 template virtual machine environment preparation

2.1.1 installing template virtual machine

IP address 192.168.10.100, host name Hadoop 102, memory 4G, hard disk 50G.

1) Create a new virtual machine

2) Customize new virtual machines



3) Select the system that the virtual machine will install in the future

4) Configure the name and storage location of the computer

5) Configure the number of CPU s of the computer
One principle is to select full (the same number of CPU s as the physical machine, but no more than).
(1) View the number of physical machine CPU s

(2) Set the number of virtual machine processors

(3) There are certain requirements for memory size. 4G is recommended. You can't give too much. In the later stage, multiple virtual machines will start at the same time

6) Select virtual Internet access mode

Select NAT mode

7) Select the IO mode of the corresponding file system

8) Select disk type

9) Select disk type and disk size


10) Storage location of virtual machine files

11) Complete configuration

12) Installation system
(1) Select the system image file to install


(2) Start installation


(3) Select language

(4) Configuration time


(5) Minimize installation system

(6) Configure disk partitions

(7) Add partition manually

Add 1G boot area


Add swap 4g partition

The root directory is used as storage. Give him the remaining space (50 - 4 - 1) = 45

Click partition to finish

(8) Turn off kdump to reduce memory consumption.

(9) Modify host name

(10) Start installation

(11) The installation time is relatively long (set the root user password, which must be set)

(12) After installation, restart the virtual machine

13) Configure network
Network configuration is performed for the installed VMware to facilitate the virtual machine to connect to the network. NAT mode is recommended for this setting. It is required that the Windows of the host computer and the Linux of the virtual machine can be connected to the network, and the Linux of the virtual machine can access the Internet through the Windows of the host computer.

(1) Edit network configuration for VMware




Then click OK to complete the network configuration of VMware.

(2) Network configuration for Windows

double-click

Modify ipv4

Modify the IP information (address, gateway, DNS server) in the above format. After modification, click OK to exit.

(3) Modify virtual machine network IP configuration

(4) Execute the systemctl restart network command to restart the network service.

[root@hadoop100 ~]# systemctl restart network

(5) Modify host name

[root@hadoop100 ~]# vim /etc/hostname
hadoop100

(6) Configure hosts file

[root@hadoop100 ~]# vim /etc/hosts
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108

(7) Restart

[root@hadoop100 ~]# reboot

(8) Modify Windows hosts file
Enter the C:\Windows\System32\drivers\etc path;
Open the hosts file, add the following, and then save:

192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108

2.1.2 install necessary software

1) Installing EPEL release: Extra Packages for Enterprise Linux is an additional software package for the "red hat" operating system, which is applicable to RHEL, CentOS and Scientific Linux. As a software warehouse, most rpm packages cannot be found in the official repository).

[root@hadoop100 ~]# yum install -y epel-release

2) If the minimum system version of Linux is installed, the following tools need to be installed; If you are installing Linux Desktop Standard Edition, you do not need to perform the following operations

  1. Net tool: toolkit collection, including ifconfig and other commands
[root@hadoop100 ~]# yum install -y net-tools
  1. vim: Editor
[root@hadoop100 ~]# yum install -y vim

2.1.3 close the firewall

[root@hadoop100 ~]# systemctl stop firewalld
[root@hadoop100 ~]# systemctl disable firewalld.service

2.1.4 configure commands with mkdif permission for user-defined users

[root@hadoop100 ~]# vim /etc/sudoers

## Allow root to run any commands anywhere 
root    ALL=(ALL)       ALL

## Allows members of the 'sys' group to run networking, software, 
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, LOCATE, DRIVERS

## Allows people in group wheel to run all commands
%wheel  ALL=(ALL)       ALL

# Add a row
liyibin    ALL=(ALL)       NOPASSWD:ALL

## Same thing without a password
# %wheel        ALL=(ALL)       NOPASSWD: ALL

2.1.5 create a folder in / opt directory and modify its master and group

1) Create the module and software folders in the / opt directory

[root@hadoop100 ~]# mkdir /opt/module
[root@hadoop100 ~]# mkdir /opt/software

2) The owner and group of the modified module and software folder are user-defined users

[root@hadoop100 ~]# chown liyibin:liyibin /opt/module 
[root@hadoop100 ~]# chown liyibin:liyibin /opt/software

2.1.6 uninstall the JDK of the virtual machine and restart the virtual machine

[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps
[root@hadoop100 ~]# reboot

2.2 cloning virtual machines

1) Using the template machine Hadoop 100, clone three virtual machines: Hadoop 102, Hadoop 103, Hadoop 104.

2) Modify the clone machine IP. The following is an example of Hadoop 102.
(1) Modify the static IP of the cloned virtual machine

[root@hadoop100 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static" 
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="84251f4a-50b7-41df-9dc2-6682548df157"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.10.102
GATEWAY=192.168.10.2
DNS1=192.168.10.2

(2) Modify the host name of the clone machine

[root@hadoop100 ~]# vim /etc/hostname
hadoop102

(3) Restart the clone machine

[root@hadoop100 ~]# reboot

2.3 install JDK in Hadoop 102

1) Uninstall the existing JDK. Refer to step 2.1.6.
2) Import the JDK into the / opt/software directory for installation.

[liyibin@hadoop102 ~]# cd /opt/software/
[liyibin@hadoop102 software]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/

3) Configure environment variables
(1) New profile

[liyibin@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh

# Add the following
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin

(2) Save exit
(3) Make the new environment variable effective

[liyibin@hadoop102 ~]$ source /etc/profile

4) Test whether the JDK is installed successfully

[liyibin@hadoop102 ~]$ java -version
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)

2.4 installing Hadoop on Hadoop 102

Hadoop download address: https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/

1) Import the compressed package hadoop-3.1.3.tar.gz into / opt/software
2) Extract and install

[liyibin@hadoop102 ~]$ cd /opt/software/
[liyibin@hadoop102 module]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/

3) Add hadoop to the environment variable

[liyibin@hadoop102 ~]$ sudo vim /etc/profile.d/my_env.sh

# Add the following
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

(2) Save exit
(3) Make the new environment variable effective

[liyibin@hadoop102 ~]$ source /etc/profile

4) Test for successful installation

[liyibin@hadoop102 ~]$ hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-common-3.1.3.jar

5) Restart (restart the virtual machine if the Hadoop command cannot be used)

[liyibin@hadoop102 ~]$ reboot

2.5 Hadoop directory structure

1) View Hadoop directory structure

[liyibin@hadoop102 hadoop-3.1.3]$ ll
 Total consumption 180
drwxr-xr-x. 2 liyibin liyibin    183 9 December 2019 bin
drwxrwxr-x. 4 liyibin liyibin     37 10 July 20:12 data
drwxr-xr-x. 3 liyibin liyibin     20 9 December 2019 etc
drwxr-xr-x. 2 liyibin liyibin    106 9 December 2019 include
drwxr-xr-x. 3 liyibin liyibin     20 9 December 2019 lib
drwxr-xr-x. 4 liyibin liyibin    288 9 December 2019 libexec
-rw-rw-r--. 1 liyibin liyibin 147145 9 April 2019 LICENSE.txt
drwxrwxr-x. 3 liyibin liyibin   4096 10 June 24-17:37 logs
-rw-rw-r--. 1 liyibin liyibin  21867 9 April 2019 NOTICE.txt
-rw-rw-r--. 1 liyibin liyibin   1366 9 April 2019 README.txt
drwxr-xr-x. 3 liyibin liyibin   4096 9 December 2019 sbin
drwxr-xr-x. 4 liyibin liyibin     31 9 December 2019 share
drwxrwxr-x. 2 liyibin liyibin     22 10 May 16:43 wcinput
drwxr-xr-x. 2 liyibin liyibin     88 10 May 16:44 wcoutput

2) Important directory
(1) bin directory: stores scripts for operating Hadoop related services (hdfs, yarn, mapred).
(2) etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files.
(3) lib Directory: the local library where Hadoop is stored (the function of compressing and decompressing data).
(4) sbin Directory: stores scripts for starting or stopping Hadoop related services.
(5) share Directory: stores the dependent jar packages, documents, and official cases of Hadoop.

3 Hadoop operation mode

1) Hadoop official website: http://hadoop.apache.org/
2) Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.
(1) Local mode: stand-alone operation, just to demonstrate the official case. Not used in production environment.
(2) Pseudo distributed mode: it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Some companies that are short of money are used for testing, and the production environment is not used.
(3) Fully distributed mode: multiple servers form a distributed environment. Use in production environment.

3.1 construction of fully distributed operation mode

3.1.1 establishment of operation environment

See the previous chapter for virtual machine configuration.

3.1.2 writing cluster distribution script

1) scp(secure copy)
(1) scp definition: scp can copy data between servers. (from server1 to server2).
(2) Basic syntax: scp -r p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname
Command recursion file path / name to copy destination user @ host: destination path / name
(3) Copy JDK and Hadoop from Hadoop 102 to Hadoop 103 and Hadoop 104.

You need to create / opt/module and / opt/software directories in Hadoop 103 and Hadoop 104, and modify these two directories to liyibin:liyibin.

[liyibin@hadoop102 hadoop-3.1.3]$ scp -r /opt/module/* liyibin@hadoop103:/opt/module
[liyibin@hadoop102 hadoop-3.1.3]$ scp -r /opt/module/* liyibin@hadoop104:/opt/module

2) rsync remote synchronization tool
rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links.
Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates difference files. scp is to copy all the files.

(1) Basic syntax:
rsync -av p d i r / pdir/ pdir/fname u s e r @ user@ user@host: p d i r / pdir/ pdir/fname
Command option parameter file path / name to copy destination user @ host: destination path / name

Parameter Description:
-a archive copy
-v displays the copy process

3) xsync cluster distribution script
(1) Requirement: circularly copy files to the same directory of all nodes
(2) Script implementation

[liyibin@hadoop102 bin]$ cd /home/liyibin/bin/
[liyibin@hadoop102 bin]$ vim xsync 

#!/bin/bash

#1. Number of judgment parameters
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi

#2. Traverse all machines in the cluster
for host in hadoop102 hadoop103 hadoop104
do
    echo ====================  $host  ====================
    #3. Traverse all directories and send them one by one

    for file in $@
    do
        #4. Judge whether the file exists
        if [ -e $file ]
            then
                #5. Get parent directory
                pdir=$(cd -P $(dirname $file); pwd)

                #6. Get the name of the current file
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done

(3) The modified script xsync has execution permission

liyibin@hadoop102 bin]$ chmod +x xsyncshell

(4) Synchronize environment variables

[liyibin@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh 

(5) Make environment variables effective

[liyibin@hadoop103 bin]$ source /etc/profile
[liyibin@hadoop104 opt]$ source /etc/profile

3.1.3 SSH non secret login

1) Configure ssh
Basic syntax: ssh the IP address of another computer

[liyibin@hadoop102 ~]$ ssh hadoop103
Are you sure you want to continue connecting (yes/no)? 

Enter yes and enter.

2) No key configuration
(1) Generate public and private keys

[liyibin@hadoop102 ~]$ cd /home/liyibin/.ssh/
[liyibin@hadoop102 .ssh]$ ssh-keygen -t rsa

Then click (three carriage returns) and two file IDS will be generated_ RSA (private key), id_rsa.pub (public key)
(2) Copy the public key to the target machine for password free login

[liyibin@hadoop102 .ssh]$ ssh-copy-id hadoop102
[liyibin@hadoop102 .ssh]$ ssh-copy-id hadoop103
[liyibin@hadoop102 .ssh]$ ssh-copy-id hadoop104

Similarly, the same configuration needs to be performed on Hadoop 103 and Hadoop 104.
(3) Explanation of file functions in the. ssh folder (~ /. ssh)

filefunction
known_hostsRecord the public key of the computer accessed by ssh
id_rsaGenerated private key
id_rsa.pubGenerated public key
authorized_keysStore the authorized secret free login server public key

3.1.4 cluster configuration

1) Cluster deployment planning
(1) NameNode and SecondaryNameNode should not be installed on the same server
(2) Resource manager also consumes a lot of memory. It should not be configured on the same machine as NameNode and SecondaryNameNode.

hadoop102hadoop103hadoop104
HDFSNameNode DataNodeDataNodeSecondaryNameNode DataNode
YARNNodeManagerResourceManager NodeManagerNodeManager

2) Profile description
Hadoop configuration files are divided into two types: default configuration files and user-defined configuration files. Only when users want to modify a default configuration value, they need to modify the user-defined configuration file and change the corresponding attribute value.
(1) Default profile

Default fileposition
core-default.xmlhadoop-common-3.1.3.jar/core-default.xml
hdfs-default.xmlhadoop-hdfs-3.1.3.jar/hdfs-default.xml
yarn-default.xmlhadoop-yarn-common-3.1.3.jar/yarn-default.xml
mapred-default.xmlhadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

(2) Custom profile
core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml are stored in = = $HADOOP_HOME/etc/hadoop = = on this path, the user can modify the configuration according to the project requirements.

3) Configure cluster
(1) Core configuration file core-site.xml

[liyibin@hadoop102 ~]$ cd /opt/module/hadoop-3.1.3/etc/hadoop/
[liyibin@hadoop102 hadoop]$ vim core-site.xml 

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- appoint NameNode Address of -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:8020</value>
    </property>
    <!-- appoint hadoop Storage directory of data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>
    <!-- to configure HDFS The static user used for web page login is liyibin -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>liyibin</value>
    </property>
</configuration>

(2) HDFS profile

[liyibin@hadoop102 hadoop]$ vim hdfs-site.xml 

The contents of the document are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- nn web End access address NameNode -->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
    <!-- 2nn web End access address SecondaryNameNode -->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
</configuration>

(3) YARN profile

[liyibin@hadoop102 hadoop]$ vim yarn-site.xml 

The contents of the document are as follows:

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
    <!-- appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- appoint ResourceManager Address of-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>
    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
    <!-- Enable log aggregation -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- Set log aggregation server address -->
    <property>  
        <name>yarn.log.server.url</name>  
        <value>http://hadoop102:19888/jobhistory/logs</value>
    </property>
    <!-- Set the log retention time to 7 days -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
</configuration>

(4) MapReduce profile

[liyibin@hadoop102 hadoop]$ vim mapred-site.xml 

The contents of the document are as follows:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- appoint MapReduce The program runs on Yarn Upper default local -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
   <!-- Historical server address -->
   <property>
       <name>mapreduce.jobhistory.address</name>
       <value>hadoop102:10020</value>
   </property>

   <!-- History server web End address -->
   <property>
       <name>mapreduce.jobhistory.webapp.address</name>
       <value>hadoop102:19888</value>
   </property>
</configuration>

4) Distribute the configured Hadoop configuration file

[liyibin@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) View the distributed files on Hadoop 103 and Hadoop 104

[liyibin@hadoop103 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[liyibin@hadoop104 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

6) Configure workers

[liyibin@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

# Add the following contents to this document
hadoop102
hadoop103
hadoop104

Distribution of modified documents

[liyibin@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/workers

3.1.5 cluster

1) If the cluster is started for the first time, you need to format the NameNode on the Hadoop 102 node (Note: formatting NameNode will generate a new cluster id, resulting in inconsistent cluster IDS between NameNode and datanode, and the cluster cannot find past data. If the cluster reports an error during operation and needs to reformat NameNode, be sure to stop the NameNode and datanode processes first, and delete the data and logs directories of all machines before formatting.) .

[liyibin@hadoop102 hadoop-3.1.3]$ hdfs namenode -format

2) Start HDFS

[liyibin@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh

3) Start YARN on the node (Hadoop 103) where the resource manager is configured

[liyibin@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
  1. View the NameNode of HDFS on the Web side
    Enter in the browser: http://hadoop102:9870

5) View YARN's ResourceManager on the Web
Enter in the browser: http://hadoop103:8088

3.1.6 summary of cluster start / stop modes

1) Each module starts / stops separately (ssh configuration is the premise)
(1) Overall start / stop HDFS

start-dfs.sh/stop-dfs.sh

(2) Overall start / stop of YARN

start-yarn.sh/stop-yarn.sh

2) Each service component starts / stops one by one
(1) Start / stop HDFS components separately

hdfs --daemon start/stop namenode/datanode/secondarynamenode

(2) Start / stop YARN

yarn --daemon start/stop  resourcemanager/nodemanager

3.1.7 writing common Hadoop deposit scripts

1) Hadoop cluster startup and shutdown script (including HDFS, Yan and Historyserver): myhadoop.sh

[liyibin@hadoop102 ~]$ cd /home/liyibin/bin
[liyibin@hadoop102 bin]$ vim myhadoop.sh

The contents of the document are as follows

#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== close hadoop colony ==================="

        echo " --------------- close historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- close yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- close hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac

Exit after saving, and then grant script execution permission

[liyibin@hadoop102 bin]$ chmod +x myhadoop.sh

2) View three server Java process scripts: jpsall

[liyibin@hadoop102 ~]$ cd /home/liyibin/bin
[liyibin@hadoop102 bin]$ vim jpsall

The contents of the document are as follows

#!/bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo =============== $host ===============
        ssh $host jps 
done

Exit after saving, and then grant script execution permission

[liyibin@hadoop102 bin]$ chmod +x jpsall

4. Common Hadoop ports

Port nameHadoop2.xHadoop3.x
NameNode internal communication port8020/90008020/9000/9820
NameNode Web port500709870
SecondaryNameNode Web port98689868
MapReduce view task execution port80888088
History server communication port1002010020
History server web port1988819888

Tags: Big Data Hadoop hdfs

Posted on Sun, 31 Oct 2021 15:09:49 -0400 by nicephotog