Getting Started with HADOOP: Fully Distributed Running Mode with Installed Hadoop Running Mode (Development Focus)

This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) Added in part by Haop.

0 Introduction

Analysis:

1) Prepare 3 clients (turn off firewall, static IP, host name)

2) Install JDK

3) Configuring environment variables

4) Install Hadoop

5) Configuring environment variables

6) Configuring clusters

7) Single Spot Start

8) Configure ssh

9) Group and test clusters

1. Virtual machine preparation

2. Write cluster distribution script xsync

1) secure copy of scp

(1) scp definition

scp implements data copy between server and server. (from server1 to server2)

(2) Basic grammar

scp    -r        $pdir/$fname             $user@$host:$pdir/$fname

Command Recursive File Path/Name to Copy Destination User@Host: Destination Path/Name

(3) Case Practice

  • Prerequisite: Both / opt/module, / opt/software directories have been created in hadoop102, hadoop103, and hadoop104 and have been modified to atguigu:atguigu

[atguigu@hadoop102 ~]$ sudo chown atguigu:atguigu -R /opt/module

(a) On hadoop102, copy the directory / opt/module/jdk1.8.0_212 in hadoop102 to hadoop103.

[atguigu@hadoop102 ~]$ scp -r /opt/module/jdk1.8.0_212  atguigu@hadoop103:/opt/module

(b) On hadoop103, copy the directory / opt/module/hadoop-3.1.3 in hadoop102 to hadoop103.

[atguigu@hadoop103 ~]$ scp -r atguigu@hadoop102:/opt/module/hadoop-3.1.3 /opt/module/

(c) On hadoop103, copy all directories under the / opt/module directory of hadoop102 to hadoop104.

[atguigu@hadoop103 opt]$ scp -r atguigu@hadoop102:/opt/module/* atguigu@hadoop104:/opt/module

2) rsync Remote Synchronization Tool

rsync is mainly used for backup and mirroring. It has the advantages of fast speed, avoiding duplication of the same content and supporting symbolic links.

Difference between rsync and scp: rsync copies files faster than scp, rsync only updates the difference files. SCP copies all files to the past.

(1) Basic grammar

rsync    -av       $pdir/$fname             $user@$host:$pdir/$fname

Command) Option Parameters File Path/Name to Copy Destination User@Host: Destination Path/Name

Description of option parameters

option

function

-a

Archive Copy

-v

Show copy process

(2) Case Practice

(a) Delete/opt/module/hadoop-3.1.3/wcinput in hadoop103

[atguigu@hadoop103 hadoop-3.1.3]$ rm -rf wcinput/

(b) Synchronize/opt/module/hadoop-3.1.3 to hadoop103 in hadoop102

[atguigu@hadoop102 module]$ rsync -av hadoop-3.1.3/ atguigu@hadoop103:/opt/module/hadoop-3.1.3/

3) xsync cluster distribution script

(1) Requirements: Cyclically copy files to the same directory of all nodes

(2) Demand analysis:

(a) The original copy of the rsync command:

rsync  -av     /opt/module    atguigu@hadoop103:/opt/

(b) Expect scripts:

File name to sync

(c) Expect scripts to work in any path (scripts are placed in paths where global environment variables are declared)

[atguigu@hadoop102 ~]$ echo $PATH
/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/atguigu/.local/bin:/home/atguigu/bin:/opt/module/jdk1.8.0_212/bin

(3) Script implementation

(a) Create xsync files in the / home/atguigu/bin directory

[atguigu@hadoop102 opt]$ cd /home/atguigu
[atguigu@hadoop102 ~]$ mkdir bin
[atguigu@hadoop102 ~]$ cd bin
[atguigu@hadoop102 bin]$ vim xsync

Write the following code in this file

#!/bin/bash

#1.Number of judgement parameters
if [ $# -lt 1 ]
then
    echo Not Enough Arguement!
    exit;
fi

#2.Traverse all machines in a cluster
for host in hadoop102 hadoop103 hadoop104
do
    echo ====================  $host  ====================
    #3.Traverse through all directories, sending one by one

    for file in $@
    do
        #4.Determine whether a file exists
        if [ -e $file ]
            then
                #5.Get parent directory
                pdir=$(cd -P $(dirname $file); pwd)

                #6.Get the name of the current file
                fname=$(basename $file)
                ssh $host "mkdir -p $pdir"
                rsync -av $pdir/$fname $host:$pdir
            else
                echo $file does not exists!
        fi
    done
done

(b) Modify script xsync has execute permission

[atguigu@hadoop102 bin]$ chmod +x xsync

(c) Test scripts

[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin

(d) Copy the script into/bin for global invocation

[atguigu@hadoop102 bin]$ sudo cp xsync /bin/

(e) Synchronize environment variable configuration (root owner)

[atguigu@hadoop102 ~]$ sudo ./bin/xsync /etc/profile.d/my_env.sh

Note: If sudo is used, xsync must complete its path.

Let environment variables take effect

[atguigu@hadoop103 bin]$ source /etc/profile

[atguigu@hadoop104 opt]$ source /etc/profile

3. SSH Secret Login Configuration

1) Configure ssh

(1) Basic grammar

ssh IP address of another computer

(2) Solutions for Host key verification failed when ssh connection occurs

[atguigu@hadoop102 ~]$ ssh hadoop103

  • If the following appears

Are you sure you want to continue connecting (yes/no)? 

  • Enter yes and return

(3) Back to hadoop102

[atguigu@hadoop103 ~]$ exit

2) No key configuration

(1) Secret-free login principle

(2) Generate public and private keys

[atguigu@hadoop102 .ssh]$ pwd
/home/atguigu/.ssh

[atguigu@hadoop102 .ssh]$ ssh-keygen -t rsa

Then hit (three carriage returns) and two files will be generated, id_rsa (private key), id_rsa.pub (public key)

(3) Copy the public key to the target machine for Secret Logon Exemption

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop102
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop103
[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop104

Be careful:

You also need to configure an atguigu account on hadoop103 to log on to servers like hadoop102, hadoop103, and hadoop104 without encryption.

You also need to configure an atguigu account on hadoop104 to log on to servers like hadoop102, hadoop103, and hadoop104 without encryption.

You also need to use the root account on hadoop102 to configure a secret login to hadoop102, hadoop103, and hadoop104.

3). Explanation of File Functions in (~/.ssh) Folder

known_hosts

Record that ssh has accessed the computer's public key

id_rsa

Generated private key

id_rsa.pub

Generated Public Key

authorized_keys

Store Authorized Secret Logon Server Public Key

4. Cluster Configuration

1) Cluster Deployment Planning

Be careful:

  • Do not install NameNode and SandaryNameNode on the same server
  • ResourceManager also consumes memory and should not be configured on the same machine as NameNode and SecondaryNameNode.

hadoop102 

hadoop103

hadoop104

HDFS

NameNode

DataNode

DataNode

SecondaryNameNode

DataNode

YARN

NodeManager

ResourceManager

NodeManager

NodeManager

2) Profile description

Hadoop profiles are categorized into two categories: default and custom profiles, which require the user to modify the custom profile and change the corresponding property values only if they want to modify a default configuration value.

(1) Default profile:

Default file to get

Location where files are stored in the jar package in Hadoop

[core-default.xml]

hadoop-common-3.1.3.jar/core-default.xml

[hdfs-default.xml]

hadoop-hdfs-3.1.3.jar/hdfs-default.xml

[yarn-default.xml]

hadoop-yarn-common-3.1.3.jar/yarn-default.xml

[mapred-default.xml]

hadoop-mapreduce-client-core-3.1.3.jar/mapred-default.xml

(2) Custom profile:

Four configuration files, core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, are stored on the path of $HADOOP_HOME/etc/hadoop, and users can modify the configuration according to project requirements.

3) Configuring clusters

(1) Core Profile

Configure core-site.xml

[atguigu@hadoop102 ~]$ cd $HADOOP_HOME/etc/hadoop

[atguigu@hadoop102 hadoop]$ vim core-site.xml

The contents of the file are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Appoint NameNode Address -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop102:8020</value>
    </property>

    <!-- Appoint hadoop Storage directory for data -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-3.1.3/data</value>
    </property>

    <!-- To configure HDFS The static user used for Web page login is atguigu -->
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>atguigu</value>
    </property>
</configuration>

(2) HDFS profile

Configure hdfs-site.xml

[atguigu@hadoop102 hadoop]$ vim hdfs-site.xml

The contents of the file are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- nn web End Access Address-->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop102:9870</value>
    </property>
    <!-- 2nn web End Access Address-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop104:9868</value>
    </property>
</configuration>

(3) YARN Profile

Configure yarn-site.xml

[atguigu@hadoop102 hadoop]$ vim yarn-site.xml

The contents of the file are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Appoint MR go shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- Appoint ResourceManager Address-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop103</value>
    </property>

    <!-- Inheritance of environment variables -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

(4) MapReduce configuration file

Configure mapred-site.xml

[atguigu@hadoop102 hadoop]$ vim mapred-site.xml

The contents of the file are as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <!-- Appoint MapReduce The program runs on Yarn upper -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

4) Distribute configured Hadoop profiles on Clusters

[atguigu@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/

5) Go to 103 and 104 to check file distribution

[atguigu@hadoop103 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
[atguigu@hadoop104 ~]$ cat /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml

5. Cluster

1) Configure workers

[atguigu@hadoop102 hadoop]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/workers

Add the following to the file:

hadoop102
hadoop103
hadoop104

Note: No spaces are allowed at the end of the content added to the file, and no empty lines are allowed in the file.

Synchronize all node profiles

[atguigu@hadoop102 hadoop]$ xsync /opt/module/hadoop-3.1.3/etc

2) Start the cluster

(1) If the cluster is started for the first time, NameNode needs to be formatted on the hadoop102 node(Note: Formatting NameNode will result in a new cluster id, resulting in inconsistencies between the cluster IDs of NameNode and DataNode, and the cluster cannot find previous data. If the cluster fails during operation and needs to reformat NameNode, be sure to stop the namenode and datanode processes first, and delete the data and logs directories of all machines before formatting.)

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs namenode -format

(2) Start HDFS

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh

(3) Start YARN on a node configured with ResourceManager (hadoop103)

[atguigu@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

(4) Web-side view of NameNode for HDFS

(a) Enter in the browser: http://hadoop102:9870

(b) View data information stored on HDFS

(5) View YARN's ResourceManager on the Web

(a) Enter in the browser: http://hadoop103:8088

(b) View Job information running on YARN

3) Cluster Basic Testing

(1) Upload files to the cluster

  • Upload small files
[atguigu@hadoop102 ~]$ hadoop fs -mkdir /input
[atguigu@hadoop102 ~]$ hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
  • Upload large files

[atguigu@hadoop102 ~]$ hadoop fs -put  /opt/software/jdk-8u212-linux-x64.tar.gz  /

(2) After uploading the file, check where the file is stored

  • View HDFS file storage path

[atguigu@hadoop102 subdir0]$ pwd

/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-1436128598-192.168.10.102-1610603650062/current/finalized/subdir0/subdir0

  • View HDFS file contents stored on disk
[atguigu@hadoop102 subdir0]$ cat blk_1073741825
hadoop yarn
hadoop mapreduce 
atguigu
atguigu

(3) Stitching

-rw-rw-r--.1 atguigu atguigu 134217728 May 23 16:01 blk_1073741836

-rw-rw-r--.1 atguigu atguigu 1048583 May 23 16:01 blk_1073741836_1012.meta

-rw-rw-r--.1 atguigu atguigu 63439959 May 23 16:01 blk_1073741837

-rw-rw-r--.1 atguigu atguigu) 495635 May 23 16:01 blk_1073741837_1013.meta

[atguigu@hadoop102 subdir0]$ cat blk_1073741836>>tmp.tar.gz
[atguigu@hadoop102 subdir0]$ cat blk_1073741837>>tmp.tar.gz
[atguigu@hadoop102 subdir0]$ tar -zxvf tmp.tar.gz

(4) Download

[atguigu@hadoop104 software]$ hadoop fs -get /jdk-8u212-linux-x64.tar.gz ./

(5) Executing the wordcount program

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

6. Configure History Server

In order to view the history of the program, you need to configure the history server. The specific configuration steps are as follows:

1) Configure mapred-site.xml

[atguigu@hadoop102 hadoop]$ vim mapred-site.xml

Add the following configuration to the file.

<!-- History server-side address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop102:10020</value>
</property>

<!-- History Server web End Address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop102:19888</value>
</property>

2) Distribution Configuration

[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/mapred-site.xml

3) Start the history server at hadoop102

[atguigu@hadoop102 hadoop]$ mapred --daemon start historyserver

4) Check whether the history server is started

[atguigu@hadoop102 hadoop]$ jps

5) View JobHistory

http://hadoop102:19888/jobhistory

7. Configure aggregation of logs

Log aggregation concept: Upload program run log information to the HDFS system after the application has finished running.

Benefits of log aggregation: It is easy to see the details of program operation and facilitate development and debugging.

Note: To turn on log aggregation, you need to restart NodeManager, ResourceManager, and HistoryServer.

The steps to turn on log aggregation are as follows:

1) Configure yarn-site.xml

[atguigu@hadoop102 hadoop]$ vim yarn-site.xml

Add the following configuration to the file.

<!-- Turn on log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- Set Log Aggregation Server Address -->
<property>  
    <name>yarn.log.server.url</name>  
    <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- Set log retention time to 7 days -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

2) Distribution Configuration

[atguigu@hadoop102 hadoop]$ xsync $HADOOP_HOME/etc/hadoop/yarn-site.xml

3) Close NodeManager, ResourceManager and HistoryServer

[atguigu@hadoop103 hadoop-3.1.3]$ sbin/stop-yarn.sh

[atguigu@hadoop103 hadoop-3.1.3]$ mapred --daemon stop historyserver

4) Start NodeManager, ResourceManage, and HistoryServer

[atguigu@hadoop103 ~]$ start-yarn.sh

[atguigu@hadoop102 ~]$ mapred --daemon start historyserver

5) Delete existing output files on HDFS

[atguigu@hadoop102 ~]$ hadoop fs -rm -r /output

6) Execute WordCount program

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

7) View logs

(1) History server address

http://hadoop102:19888/jobhistory

(2) List of historical tasks

(3) View the task run log

(4) Run log details

8 Summary of cluster start/stop modes

1) Start/Stop modules separately (ssh configuration is a prerequisite) Common

(1) Start/stop HDFS as a whole

start-dfs.sh/stop-dfs.sh

(2) Start/Stop YARN as a whole

start-yarn.sh/stop-yarn.sh

2) Start/Stop each service component one by one

(1) Start/stop HDFS components separately

hdfs --daemon start/stop namenode/datanode/secondarynamenode

(2) Start/Stop YARN

yarn --daemon start/stop  resourcemanager/nodemanager

9. Write common scripts for Hadoop clusters

1) Hadoop cluster start-stop script (including HDFS, Yarn, Historyserver): myhadoop.sh

[atguigu@hadoop102 ~]$ cd /home/atguigu/bin

[atguigu@hadoop102 bin]$ vim myhadoop.sh

  • Enter the following
#!/bin/bash

if [ $# -lt 1 ]
then
    echo "No Args Input..."
    exit ;
fi

case $1 in
"start")
        echo " =================== start-up hadoop colony ==================="

        echo " --------------- start-up hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo " --------------- start-up yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo " --------------- start-up historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo " =================== Close hadoop colony ==================="

        echo " --------------- Close historyserver ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo " --------------- Close yarn ---------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo " --------------- Close hdfs ---------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
    echo "Input Args Error..."
;;
esac
  • Exit after saving and grant script execution privileges

[atguigu@hadoop102 bin]$ chmod +x myhadoop.sh

2) View three server Java process scripts: jpsall

[atguigu@hadoop102 ~]$ cd /home/atguigu/bin

[atguigu@hadoop102 bin]$ vim jpsall

  • Enter the following
#!/bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo =============== $host ===============
        ssh $host jps 
done
  • Exit after saving and grant script execution privileges

[atguigu@hadoop102 bin]$ chmod +x jpsall

3) Distribute/home/atguigu/bin directory to ensure custom scripts are available on all three machines

[atguigu@hadoop102 ~]$ xsync /home/atguigu/bin/

10. Common port number Description

Port Name

Hadoop2.x

Hadoop3.x

NameNode internal communication port

8020 / 9000

8020 / 9000/9820

NameNode HTTP UI

50070

9870

MapReduce View Execute Task Port

8088

8088

History Server Communication Port

19888

19888

11. Cluster Time Synchronization

If the server is in a public network environment (capable of connecting to an external network), cluster time synchronization may not be used because the server is calibrated periodically with the network time.

If the server is in an intranet environment, it is necessary to configure cluster time synchronization. Otherwise, over time, time bias will occur, causing the cluster to execute tasks in an asynchronous time.

1) Requirements

Find a machine that acts as a time server and synchronizes all machines with this cluster time on a regular basis. The production environment requires periodic synchronization according to the accuracy of tasks. The test environment uses 1 minute synchronization to see the effect as soon as possible.

2) Time server configuration (root user is required)

(1) View ntpd service status and start-up self-start status for all nodes

[atguigu@hadoop102 ~]$ sudo systemctl status ntpd
[atguigu@hadoop102 ~]$ sudo systemctl start ntpd
[atguigu@hadoop102 ~]$ sudo systemctl is-enabled ntpd

(2) Modify hadoop102 ntp.conf configuration file

[atguigu@hadoop102 ~]$ sudo vim /etc/ntp.conf

The modifications are as follows

(a) Modify 1 (authorize all machines on the 192.168.10.0-192.168.10.255 network segment to query and synchronize time from this machine)

#restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap

For restrict 192.168.10.0 mask 255.255.255.0 nomodify notrap

(b) Modify 2 (Cluster in LAN, do not use time on other Internet)

server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst

by

#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst

(c) Add 3 (when the node loses network connectivity, it can still use local time as a time server to provide time synchronization for other nodes in the cluster)

server 127.127.1.0

fudge 127.127.1.0 stratum 10

(3) Modify hadoop102/etc/sysconfig/ntpd file

[atguigu@hadoop102 ~]$ sudo vim /etc/sysconfig/ntpd

Add the following (synchronize hardware time with system time)

SYNC_HWCLOCK=yes

(4) Restart ntpd service

[atguigu@hadoop102 ~]$ sudo systemctl start ntpd

(5) Set ntpd service startup

[atguigu@hadoop102 ~]$ sudo systemctl enable ntpd

3) Other machine configurations (must be root user)

(1) Turn off ntp services and self-start on all nodes

[atguigu@hadoop103 ~]$ sudo systemctl stop ntpd
[atguigu@hadoop103 ~]$ sudo systemctl disable ntpd
[atguigu@hadoop104 ~]$ sudo systemctl stop ntpd
[atguigu@hadoop104 ~]$ sudo systemctl disable ntpd

(2) Configure one minute synchronization with the time server on other machines

[atguigu@hadoop103 ~]$ sudo crontab -e

Write timer tasks as follows:

*/1 * * * * /usr/sbin/ntpdate hadoop102

(3) Modify any machine time

[atguigu@hadoop103 ~]$ sudo date -s "2021-9-11 11:11:11"

(4) Check if the machine is synchronized with the time server after 1 minute

[atguigu@hadoop103 ~]$ sudo date

Tags: Big Data Hadoop

Posted on Mon, 20 Sep 2021 18:20:29 -0400 by yashvant