1, Process analysis
(1) Configure a client, for example, the host name is hadoop102 (close the firewall and configure IP), create directories / opt/software and / opt/module, and modify the owner and group of these two directories to be the current user, such as user zlc;
(2) Then copy out two clients, such as Hadoop 103 and Hadoop 104;
(3) Only on Hadoop 102 client: install JDK and Hadoop, the installation package path is / opt/software, the decompression path is / opt/module, and configure their environment variables;
(4) Copy the JDK, Hadoop directory and configuration on Hadoop 102 to Hadoop 103 and Hadoop 104
(5) Configure cluster
(6) Single point start
(7) Configure ssh
(8) Get together and test the cluster
2, Virtual machine preparation
1. Configure a client. For example, the host name is Hadoop 102 (turn off the firewall and configure IP), and create directories / opt/software and / opt/module
2. Then copy out two clients, such as Hadoop 103 and Hadoop 104;
3. Only on Hadoop 102 client: install JDK and Hadoop, the installation package path is / opt/software, the decompression path is / opt/module, and configure their environment variables;
3, Copy JDK and Hadoop directories: scp command
1. Because JDK and Hadoop have been installed on Hadoop 102, Hadoop 103 and Hadoop 104 do not need to be installed again. You can directly use the scp command to copy the JDK, Hadoop directory and configuration on Hadoop 102 to the other two clients.
2. scp secure copy command is required. This command can copy data between servers. Basic syntax: scp -r file path / name to be copied destination user @ host: destination path / name, for example
# Copy the jdk of Hadoop 102 to Hadoop 103 scp -r /opt/module/jdk1.8.0_212 zlc@hadoop103:/opt/module # Copy hadoop of hadoop 102 to hadoop 103 scp -r hadoop-3.1.3/ zlc@hadoop103:/opt/module/
(Note: scp supports not only push copying from the source machine to the destination machine, but also pull copying from the destination machine to the source machine, that is, you can copy files to Hadoop 103 in Hadoop 102 or pull files from Hadoop 102 on Hadoop 103. In addition, you can copy files from Hadoop 102 to Hadoop 104 on Hadoop 103.)
3. The following figure shows how to copy the JDK to Hadoop 103 in Hadoop 102:
Similarly, in Hadoop 102, Hadoop is copied to Hadoop 103.
4. On Hadoop 103, copy the files of Hadoop 102 to Hadoop 104:
Extended supplement
Remote synchronization tool: rsync command
rsync is mainly used for backup and mirroring. It has the advantages of high speed, avoiding copying the same content and supporting symbolic links. Difference between rsync and scp: copying files with rsync is faster than scp. rsync only updates difference files. scp is to copy all the files.
Syntax: rsync -av file path / name to copy destination user @ host: destination path / name
For example:
# Delete / opt/module/hadoop-3.1.3/wcinput in Hadoop 103 rm -rf wcinput/ # Synchronize / opt/module/hadoop-3.1.3 in Hadoop 102 to Hadoop 103 rsync -av hadoop-3.1.3/ zlc@hadoop103:/opt/module/hadoop-3.1.3/
4, Distribution environment variable configuration: write cluster distribution script xsync
1. Write xsync script: it is used for cluster distribution script and circularly copy files to the same directory of all nodes,
It is expected that the script can be used in any path, so the script should be placed in the path where the global environment variable is declared:
Therefore, the script xsync can be created under the path / home/zlc/bin
The content is as follows (learn the Hadoop tutorial of zishang Silicon Valley):
#!/bin/bash #1. Number of judgment parameters if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. Traverse all machines in the cluster for host in hadoop102 hadoop103 hadoop104 do echo ==================== $host ==================== #3. Traverse all directories and send them one by one for file in $@ do #4. Judge whether the file exists if [ -e $file ] then #5. Get parent directory pdir=$(cd -P $(dirname $file); pwd) #6. Get the name of the current file fname=$(basename $file) ssh $host "mkdir -p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
However, there is no executable permission at present. You can use the command chmod 777 xsync to give it execute permission
2. Use the xsync script to synchronize the xsync script in / home/zlc/bin, that is, itself, to two other clients (Hadoop 103 and Hadoop 104)
On Hadoop 103:
On Hadoop 104:
3. Synchronize the customized environment variable configuration file to the same directory of the other two clients (administrator permission is required here to write the configuration file directory):
4. Update the configuration file directory on the other two clients (i.e. Hadoop 103 and Hadoop 104) and verify:
5, SSH password free login
1. First use Hadoop 102 SSH to access hadoop 103
In this way, there will be a. ssh hidden directory under the ~ directory of Hadoop 102,
Similarly, there will also be a. ssh hidden directory under the ~ directory of Hadoop 103,
2. Enter the. ssh directory of Hadoop 102 and execute
ssh-keygen -t rsa
Generate public and private keys for Hadoop 102 clients
3. Send the public key of Hadoop 102 to Hadoop 103 and Hadoop 104 (you can also configure your own password free access)
The party connected without secret will generate authorized_keys authentication file, so here is to configure Hadoop 102 to access itself without secret. Therefore, authorized is also generated under the local machine_ Keys authentication file
4. Similarly, we repeat the above configuration on Hadoop 103 and Hadoop 104
5. Conduct distribution test and distribute files to Hadoop 103 and Hadoop 104 on Hadoop 102 client
6, Cluster configuration
1. Cluster configuration planning: NameNode and SecondaryNameNode should not be installed on the same server
Resource manager also consumes a lot of memory. It should not be configured on the same machine as NameNode and SecondaryNameNode. For example, my cluster configuration plan is:
2. Profile description: Hadoop profiles are divided into two types: default profile and user-defined profile. Only when users want to modify a default configuration value, they need to modify the user-defined profile and change the corresponding attribute value.
-
Default profile:
-
Custom configuration files: core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml are stored in $Hadoop_ On the path of home / etc / Hadoop, users can modify the configuration again according to the project requirements.
3. Configure cluster:
Core configuration file core-site.xml
The customized contents of each file are as follows:
Core configuration file core-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- appoint NameNode Address of --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop102:8020</value> </property> <!-- appoint hadoop Storage directory of data --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop-3.1.3/data</value> </property> <!-- to configure HDFS The static user used for web page login is atguigu --> <property> <name>hadoop.http.staticuser.user</name> <value>atguigu</value> </property> </configuration>
HDFS configuration file hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- nn web End access address--> <property> <name>dfs.namenode.http-address</name> <value>hadoop102:9870</value> </property> <!-- 2nn web End access address--> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop104:9868</value> </property> </configuration>
YARN configuration file yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- appoint MR go shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- appoint ResourceManager Address of--> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop103</value> </property> <!-- Inheritance of environment variables --> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration>
MapReduce configuration file mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- appoint MapReduce The program runs on Yarn upper --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
4. Distribute the configured Hadoop configuration file on the cluster:
5. Check the file distribution on Hadoop 103 and Hadoop 104:
7, Group test
1. Configure workers (be careful not to have extra spaces or empty lines)
2. Distribute this file to all nodes of the cluster
3. Start the cluster:
(1) If the cluster is started for the first time, you need to format the NameNode on the Hadoop 102 node
After initialization, the data and logs directories will be generated in the hadoop directory
(2) Start HDFS
In Hadoop 102:
In Hadoop 103:
In Hadoop 104:
By comparing the cluster configuration planning table above, the HDFS line is correct:
hadoop visual management can also be carried out in the graphical interface of the web page:
http://hadoop102:9870
(3) Start YARN on the node (Hadoop 103) where the resource manager is configured
In Hadoop 102:
In Hadoop 103:
In Hadoop 104:
By comparing the cluster configuration planning table above, both lines are correct:
YARN visual management can also be carried out in the graphical interface of the web page:
http://hadoop103:8088
4. Cluster Basic test
(1) Create directory
Create directory in Hadoop 104
View the graphical management page of hdfs in Hadoop 102
(2) Upload the local file of the client to hdfs
It can be seen in the visualization page
(3) Upload the local large file of the client to hdfs
(4) View HDFS file storage path
After uploading the file, check where the file is stored
(5) View the backup of HDFS files
In Hadoop 102:
In Hadoop 103:
In Hadoop 104: