Hadoop pseudo-distributed deployment and common operations

hadoop pseudo-distributed deployment

Having deployed hadoop 2.x before, now do a deployment of hadoop 3.x.

hadoop has three components: hdfs for storing data, mapreduce for computing (jobs), yarn for resources (cpu memory) and job scheduling.

In fact, the official hadoop website has deployment steps:

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

The following is a pseudo-distributed deployment based on the official network.

1. Supportable Platforms


Supportable platforms for hadoop: GNU/Linux, Windows. Hadoop supports up to 2000 node clusters on the GNU/Linux platform.
Here is the deployment on the Linux machine.

2. Necessary Software


Java must be installed, but some Java versions are mentioned on the hadoop website as having bugs and are not recommended, so if you want to deploy, it's best to first see which Java versions are not available and what's causing the bugs.
Java installation requires downloading the installation package for Java8, unzipping it into the / usr/java / directory, chown modifying the owner, configuring the environment variables, and taking effect.

SSH must be installed, because some scripts for hadoop manage remote hadoop processes through ssh, which ssh can see if there is ssh, and none need to be installed.

3. Create a user to deploy hadoop

[root@hadoop001 ~]# useradd hadoop
[root@hadoop001 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
[root@hadoop001 ~]# su - hadoop
[hadoop@hadoop001 ~]$ mkdir sourcecode software app log data lib tmp

4. Configure ssh passwordless access

See Official Website:

You can verify password access with ssh to your local computer first.
Then configure passwordless access:

#Enter three times
[hadoop@hadoop001 ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
91:35:29:64:ad:91:28:5d:6b:fa:60:06:b2:90:d0:f5 hadoop@hadoop001
The key's randomart image is:
+--[ RSA 2048]----+
|.. ... ++oo.     |
|... ..o.++o.     |
|o . ..E =+       |
| . o . o..       |
|  .   = S        |
|     o o         |
|        .        |
|                 |
|                 |
+-----------------+

[hadoop@hadoop001 .ssh]$ cd .ssh
[hadoop@hadoop001 .ssh]$ ll -a
total 12
drwx------.  2 hadoop hadoop   36 Dec  2 16:46 .
drwx------. 10 hadoop hadoop 4096 Dec  2 16:46 ..
-rw-------.  1 hadoop hadoop 1675 Dec  2 16:46 id_rsa
-rw-r--r--.  1 hadoop hadoop  397 Dec  2 16:46 id_rsa.pub

[hadoop@hadoop001 .ssh]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

[hadoop@hadoop001 .ssh]$ ll
total 12
-rw-rw-r--. 1 hadoop hadoop  397 Dec  2 16:46 authorized_keys
-rw-------. 1 hadoop hadoop 1675 Dec  2 16:46 id_rsa
-rw-r--r--. 1 hadoop hadoop  397 Dec  2 16:46 id_rsa.pub

[hadoop@hadoop001 .ssh]$ chmod 0600 ~/.ssh/authorized_keys

[hadoop@hadoop001 .ssh]$ ll
total 12
-rw-------. 1 hadoop hadoop  397 Dec  2 16:46 authorized_keys
-rw-------. 1 hadoop hadoop 1675 Dec  2 16:46 id_rsa
-rw-r--r--. 1 hadoop hadoop  397 Dec  2 16:46 id_rsa.pub
[hadoop@hadoop001 .ssh]$ 
[hadoop@hadoop001 .ssh]$ ssh hadoop001
The authenticity of host 'hadoop001 (192.168.14.128)' can't be established.
ECDSA key fingerprint is f9:95:75:87:df:44:35:2b:8e:0f:dc:eb:87:1b:57:ec.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hadoop001,192.168.14.128' (ECDSA) to the list of known hosts.
Last login: Thu Dec  2 15:37:15 2021
[hadoop@hadoop001 ~]$  exit
logout
Connection to hadoop001 closed.
[hadoop@hadoop001 .ssh]$ 

5. Download the hadoop installation package and configure it


Download deployment:

[hadoop@hadoop001 ~]$  cd software
[hadoop@hadoop001 software]$  wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
[hadoop@hadoop001 software]$ tar -xzvf hadoop-3.2.2.tar.gz -C ../app/
[hadoop@hadoop001 software]$ cd ../app
[hadoop@hadoop001 app]$  ln -s hadoop-3.2.2 hadoop

Configuration file: hadoop-env.sh

Then you need to configure the etc/hadoop/hadoop-env.sh file to add JAVA_inside HOME environment variable

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ vi hadoop-env.sh
//Edit increase:
export JAVA_HOME=/usr/java/jdk1.8.0_45
//Then save

There are three modes for a hadoop cluster:

Local mode, no process is started, rarely used;
Pseudo-distributed, start a single process (older, younger, and younger), learn to use;
Cluster mode, start a number of processes (two elders, many younger brothers), production use.
Pseudo-distributed is used here.

Configuration file core-site.xml

Pseudo-distributed mode, requires etc/hadoop/core-site.xml file.

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ vi core-site.xml
//Edit increase:
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:9000</value>
    </property>
</configuration>
//Then save

Where hadoop001 is the name of the machine, configuring this means that the namenode is started on hadoop001.

Configuration file hdfs-site.xml

Pseudo-distributed mode, requires etc/hadoop/hdfs-site.xml file.

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ vi hdfs-site.xml
//Edit increase:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
//Then save

Configure this to configure the number of copies on hdfs to be 1.

The three processes that configure hdfs are started with the machine name hadoop001
Profile/etc/hosts

This file configures the mapping of the ip machine name of this machine.
Configure this to have namenode, secondarynamenode, datanode all start with hadoop001
The following:

[root@hadoop001 ~]# vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.14.128 hadoop001
Configuration file core-site.xml

Configured above, fs.defaultFS: hdfs://hadoop001:9000
Let the namenode start with hadoop001.

Profile workers

etc/hadoop/workers (this file should be salves for hadoop 2.x)

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ vi workers 
hadoop001
//Originally localhost, need to be modified to machine name

Let the datanode start with hadoop001.

Configuration file hdfs-site.xml

Configure this to have secondarynamenode start with hadoop001:

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ vi hdfs-site.xml
//Edit increase:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop001:9868</value>
    </property>
       <property>
        <name>dfs.namenode.secondary.https-address</name>
        <value>hadoop001:9869</value>
    </property>

</configuration>
//Then save

This configuration can be found in the default profile of the official website:

The above configuration is to enable the older, younger and younger siblings to start with the same machine name, not one with hadoop001 and one with localhost.
This has the advantage that if the machine ip changes in the future, just modify the ip of the hosts file.

hadoop profile modification of / tmp directory

By default, the HDFS data and the three process PIDs both exist in the / tmp / directory, and if the file is not accessed within 30 days, it will be cleared out in the / tmp / directory. For example, if the pid files of the three processes are deleted, if you restart or stop the three processes, the corresponding pid can not be found, and then restart the three processes. You think that the restart did not actually restart, it is still the previous configuration. If you modify the configuration of hdfs, the restart will not take effect.
Data store directories are also dangerous at / tmp
Therefore, you need to modify the / tmp / directory of the corresponding configuration file to another directory.
Now change the / TMP / directory to another directory such as: /home/hadoop/tmp

Look at the default configuration for the official website:
core-default.xml:

hdfs-default.xml:
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data
dfs.namenode.checkpoint.dir file://${hadoop.tmp.dir}/dfs/namesecondary

Looking at the default configuration instructions above, the data directories of namenode, secondarynamenode, and datanode are in the / tmp directory, so the core-site.xml file needs to be configured:
/home/hadoop/app/hadoop/etc/hadoop/core-site.xml
Editing adds the following:

    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/tmp/hadoop-${user.name}</value>
    </property>
  #Note that it's not / home/hadoop/tmp, followed by hadoop-${user.name}

In addition, where is the modification of the directory where the pid file is located?
Inside the configuration file hadoop-env.sh
Edit the / home/hadoop/app/hadoop/etc/hadoop/hadoop-env.sh file,
Find the following two lines:

# Where pid files are stored.  /tmp by default.
# export HADOOP_PID_DIR=/tmp

Modify to:

# Where pid files are stored.  /tmp by default.
 export HADOOP_PID_DIR=/home/hadoop/tmp

In addition, the above is a pseudo-distributed deployment, there must be more than one node in production, so
hdfs-default.xml:
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data
dfs.namenode.checkpoint.dir file://${hadoop.tmp.dir}/dfs/namesecondary
In particular, there must be more than one dfs.datanode.data.dir, for example, 10 blade servers, 10 physical disks, a total of 5T of space
So that's it:
dfs.datanode.data.dir : /data01/dfs/dn,/data02/dfs/dn,/data03/dfs/dn...

If one disk writes 30M/s, 10 disks are 300M/s
Multiple disks are designed to store more space and efficiently read and write IO. Definitely faster than a single disk.
Therefore, DN data directory parameters on production must not be used by default ${hadoop.tmp.dir} and need to be written clearly according to your own actual situation.

6.namenode formatting

Execute the command bin/hdfs namenode-format for file system formatting:

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@hadoop001 hadoop]$ bin/hdfs namenode -format
WARNING: /home/hadoop/app/hadoop-3.2.2/logs does not exist. Creating.
2021-12-02 22:59:25,107 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop001/192.168.14.128
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.2.2
.....
2021-12-02 22:59:25,905 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1643070568-192.168.14.128-1638457165899
2021-12-02 22:59:25,916 INFO common.Storage: Storage directory /home/hadoop/tmp/hadoop-hadoop/dfs/name has been successfully formatted.
2021-12-02 22:59:25,933 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-12-02 22:59:25,992 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 400 bytes saved in 0 seconds .
2021-12-02 22:59:26,000 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-12-02 22:59:26,006 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2021-12-02 22:59:26,006 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop001/192.168.14.128
************************************************************/
[hadoop@hadoop001 hadoop]$ 

Then: start NameNode daemon and DataNode daemon, execute sbin/start-dfs.sh:

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@hadoop001 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [hadoop001]
Starting datanodes
Starting secondary namenodes [hadoop001]
[hadoop@hadoop001 hadoop]$

You can go to this place to view related logs:

[hadoop@hadoop001 logs]$ pwd
/home/hadoop/app/hadoop/logs
[hadoop@hadoop001 logs]$  ll
total 120
-rw-rw-r--. 1 hadoop hadoop 33392 Dec  2 23:04 hadoop-hadoop-datanode-hadoop001.log
-rw-rw-r--. 1 hadoop hadoop   691 Dec  2 23:04 hadoop-hadoop-datanode-hadoop001.out
-rw-rw-r--. 1 hadoop hadoop 38007 Dec  2 23:04 hadoop-hadoop-namenode-hadoop001.log
-rw-rw-r--. 1 hadoop hadoop   691 Dec  2 23:04 hadoop-hadoop-namenode-hadoop001.out
-rw-rw-r--. 1 hadoop hadoop 30465 Dec  2 23:04 hadoop-hadoop-secondarynamenode-hadoop001.log
-rw-rw-r--. 1 hadoop hadoop   691 Dec  2 23:04 hadoop-hadoop-secondarynamenode-hadoop001.out
-rw-rw-r--. 1 hadoop hadoop     0 Dec  2 22:59 SecurityAuth-hadoop.audit

Startup If there are no problems, you can access hdfs on the web page at: http://localhost:9870/
The localhost needs to be modified to the appropriate ip.
Previous Hadoop version 2.X, access web interface, hdfs port number is 50070, now 3.X version, port number is 9870
The following:

7.hdfs and mapreduce validation

Verify as follows:

[hadoop@hadoop001 hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -mkdir input
#It is important to note that writing'input'directly does not write the specific path of hdfs; it creates the directory'/user/<username>}'. 
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -put etc/hadoop/*.xml input
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:20 /user
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /user/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:20 /user/hadoop
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /user/hadoop/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:21 /user/hadoop/input
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /user/hadoop/input/
Found 9 items
-rw-r--r--   1 hadoop supergroup       9213 2021-12-02 23:21 /user/hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 hadoop supergroup       1012 2021-12-02 23:21 /user/hadoop/input/core-site.xml
-rw-r--r--   1 hadoop supergroup      11392 2021-12-02 23:21 /user/hadoop/input/hadoop-policy.xml
-rw-r--r--   1 hadoop supergroup       1124 2021-12-02 23:21 /user/hadoop/input/hdfs-site.xml
-rw-r--r--   1 hadoop supergroup        620 2021-12-02 23:21 /user/hadoop/input/httpfs-site.xml
-rw-r--r--   1 hadoop supergroup       3518 2021-12-02 23:21 /user/hadoop/input/kms-acls.xml
-rw-r--r--   1 hadoop supergroup        682 2021-12-02 23:21 /user/hadoop/input/kms-site.xml
-rw-r--r--   1 hadoop supergroup        758 2021-12-02 23:21 /user/hadoop/input/mapred-site.xml
-rw-r--r--   1 hadoop supergroup        690 2021-12-02 23:21 /user/hadoop/input/yarn-site.xml
[hadoop@hadoop001 hadoop]$
[hadoop@hadoop001 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar grep input output 'dfs[a-z.]+'
......
[hadoop@hadoop001 hadoop]$
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /user/hadoop/
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:21 /user/hadoop/input
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:24 /user/hadoop/output
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -ls /user/hadoop/output/
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2021-12-02 23:24 /user/hadoop/output/_SUCCESS
-rw-r--r--   1 hadoop supergroup         90 2021-12-02 23:24 /user/hadoop/output/part-r-00000
[hadoop@hadoop001 hadoop]$ bin/hdfs dfs -cat /user/hadoop/output/*
1	dfsadmin
1	dfs.replication
1	dfs.namenode.secondary.https
1	dfs.namenode.secondary.http
[hadoop@hadoop001 hadoop]$ 

To see if the three processes of hdfs start and stop the process, the commands are as follows:

[hadoop@hadoop001 hadoop]$ jps
6017 Jps
4860 SecondaryNameNode
4525 NameNode
4671 DataNode
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ ps -ef|grep hadoop
hadoop      4525      1  0 23:04 ?        00:00:10 /usr/java/jdk1.8.0_45/bin/java -Dproc_namenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -Dyarn.log.file=hadoop-hadoop-namenode-hadoop001.log -Dyarn.home.dir=/home/hadoop/app/hadoop-3.2.2 -Dyarn.root.logger=INFO,console -Djava.library.path=/home/hadoop/app/hadoop-3.2.2/lib/native -Dhadoop.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -Dhadoop.log.file=hadoop-hadoop-namenode-hadoop001.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-3.2.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode
hadoop      4671      1  0 23:04 ?        00:00:09 /usr/java/jdk1.8.0_45/bin/java -Dproc_datanode -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -Dyarn.log.file=hadoop-hadoop-datanode-hadoop001.log -Dyarn.home.dir=/home/hadoop/app/hadoo-3.2.2 -Dyarn.root.logger=INFO,console -Djava.library.path=/home/hadoop/app/hadoop-3.2.2/lib/native -Dhadoop.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -hadoop.log.file=hadoop-hadoop-datanode-hadoop001.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-3.2.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode
hadoop      4860      1  0 23:04 ?        00:00:05 /usr/java/jdk1.8.0_45/bin/java -Dproc_secondarynamenode -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -Dyarn.log.file=hadoop-hadoop-secondarynamenode-hadoop001.log -Dyarn.home.dir=/home/hadoop/app/hadoop-3.2.2 -Dyarn.root.logger=INFO,console -Djava.library.path=/home/hadoop/app/hadoop-3.2.2/lib/native -Dhadoop.log.dir=/home/hadoop/app/hadoop-3.2.2/logs -Dhadoop.log.file=hadoop-hadoop-secondarynamenode-hadoop001.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-3.2.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
hadoop      6033   4057  0 23:29 pts/2    00:00:00 grep --color=auto hadoop
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ sbin/stop-dfs.sh
Stopping namenodes on [hadoop001]
Stopping datanodes
Stopping secondary namenodes [hadoop001]
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ jps
6523 Jps
[hadoop@hadoop001 hadoop]$

That's about deploying hdfs.

Deployment of 8.YARN

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
In pseudo-distributed mode, you can run mapreduce tasks on yarn by setting several parameters and running the ResourceManager and NodeManager processes.
Set the parameters below.
Edit the file: etc/hadoop/mapred-site.xml, add the following:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

Edit the file: etc/hadoop/yarn-site.xml, add the following:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Then start two processes of yarn with sbin/start-yarn.sh:
The following:

[hadoop@hadoop001 hadoop]$ vi etc/hadoop/mapred-site.xml
[hadoop@hadoop001 hadoop]$ vi etc/hadoop/yarn-site.xml
[hadoop@hadoop001 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [hadoop001]
Starting datanodes
Starting secondary namenodes [hadoop001]
[hadoop@hadoop001 hadoop]$ jps
6713 NameNode
6861 DataNode
7181 Jps
7054 SecondaryNameNode
[hadoop@hadoop001 hadoop]$ 
[hadoop@hadoop001 hadoop]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[hadoop@hadoop001 hadoop]$ jps
7457 NodeManager
7777 Jps
7316 ResourceManager
6713 NameNode
6861 DataNode
7054 SecondaryNameNode
[hadoop@hadoop001 hadoop]$ 

yarn's ReurceManager side can then be accessed through the web interface:
The address is: http://localhost:8088/
localhost This needs to be modified to the corresponding ip address.

I have a virtual machine, so I can see it directly. If it is a cloud host, you need to open the corresponding port number on the cloud host website you bought.
If it's a cloud host, port 8088 can be easily mined, for example: https://segmentfault.com/a/1190000015264170
You can modify yarn's web interface port number, modify the yarn-site.xml file, and add the following to the file (if necessary):

 <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>hadoop001:8123</value>
</property>

These configurations can be found in the default profile on the official website.
Then restart the yarn process.

Environment variable configuration

Configured environment variables for hadoop:

#Increase in ~/.bashrc:
#export HADOOP_HOME=/home/hadoop/app/hadoop
#export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

[hadoop@hadoop001 ~]$ vi .bashrc
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ source .bashrc 
[hadoop@hadoop001 ~]$ which hdfs
~/app/hadoop/bin/hdfs
[hadoop@hadoop001 ~]$ which yarn
~/app/hadoop/bin/yarn
[hadoop@hadoop001 ~]$ which hadoop
~/app/hadoop/bin/hadoop
[hadoop@hadoop001 ~]$ 

Validation of yarn

Run a MapReduce job.
You can use the above method of validating hdfs to validate it again, which is validated here with the wordcount case.

[hadoop@hadoop001 ~]$ vi wordcount.txt
word count
zhangsan lisi
word
word zhangsan happy hadppy
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ pwd
/home/hadoop
[hadoop@hadoop001 ~]$ hdfs dfs -mkdir /input
[hadoop@hadoop001 ~]$ hdfs dfs -ls /
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-03 00:29 /input
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:20 /user
[hadoop@hadoop001 ~]$ hdfs dfs -put wordcount.txt /input
[hadoop@hadoop001 ~]$ hdfs dfs -ls /input/
Found 1 items
-rw-r--r--   1 hadoop supergroup         58 2021-12-03 00:30 /input/wordcount.txt
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ hdfs dfs -cat /input/wordcount.txt
word count
zhangsan lisi
word
word zhangsan happy hadppy

[hadoop@hadoop001 ~]$ find ./ -name *example*.jar
./app/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar
./app/hadoop-3.2.2/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-test-sources.jar
./app/hadoop-3.2.2/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-sources.jar
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ yarn jar ./app/hadoop-3.2.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /input /output
2021-12-03 00:33:19,888 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
...
2021-12-03 00:33:28,218 INFO mapreduce.Job:  map 0% reduce 0%
2021-12-03 00:33:33,293 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-03 00:33:37,321 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-03 00:33:39,339 INFO mapreduce.Job: Job job_1638461493821_0001 completed successfully
....
[hadoop@hadoop001 ~]$
[hadoop@hadoop001 ~]$ hdfs dfs -ls /
Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2021-12-03 00:30 /input
drwxr-xr-x   - hadoop supergroup          0 2021-12-03 00:33 /output
drwx------   - hadoop supergroup          0 2021-12-03 00:33 /tmp
drwxr-xr-x   - hadoop supergroup          0 2021-12-02 23:20 /user
[hadoop@hadoop001 ~]$ 
[hadoop@hadoop001 ~]$ hdfs dfs -ls /output/
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2021-12-03 00:33 /output/_SUCCESS
-rw-r--r--   1 hadoop supergroup         50 2021-12-03 00:33 /output/part-r-00000
[hadoop@hadoop001 ~]$ hdfs dfs -cat /output/*
count	1
hadppy	1
happy	1
lisi	1
word	3
zhangsan	2
[hadoop@hadoop001 ~]$ 

At this point, both hdfs and yarn are deployed.

Tags: Hadoop

Posted on Thu, 02 Dec 2021 12:09:53 -0500 by mikes1471