Painless setting up hadoop cluster and running Wordcount program

catalog

Pre preparation

First of all, open my own virtual machine. I use CentOS 7 system, but the operation of different systems is not different.

View local network information

Enter the virtual network editor

Enter NAT settings and view the following information



View network connection status

You can see the network successfully connected

It is not in our habit to enter ifconfig command and find that there is no eth0 (if it is eth0, you can skip this step). And remote ssh connection is not possible



cd /etc/sysconfig/network-scripts/
mv ifcfg-ens33 ifconfig-eth0

Change network information

If you have eth0, you can execute it from here
Enter administrator mode, because if you do not enter, it will show that you cannot save.

su

vim /etc/sysconfig/network-scripts/ifcfg-eth0

Make changes to the following information. Note that the ip and gateway here need to be recorded by yourself.

Restart the network card, and you can see that the changes take effect



service network restart

If the network card fails to restart

vim /etc/default/grub

Add the following

Execute the following command to change our configuration

grub2-mkconfig -o /boot/grub2/grub.cfg


If it doesn't work, do it

reboot

Change host name

vim /etc/hostname


Reboot the computer by executing the reboot command

Finally, check whether the configuration is correct:

Clone virtual machine to obtain slave1 and slave2 nodes

Shut down the virtual machine and enter the virtual machine clone

Click Next

Go to the next step again

Complete cloning

Select installation location

Click finish and wait.









Configure parameter information of slave1 and slave2

Use the same method above to configure the parameter information of slave1 and slave2. Remember to select different ip addresses, and select slave1 and slave2 as the host names.

vim /etc/sysconfig/network-scripts/ifcfg-eth0

vim /etc/hostname


reboot to see if the configuration is successful




Mapping host name to ip

vim /etc/hosts

Add the following

Check whether the configuration is successful



Configure ssh password free login

For your convenience, I have synthesized one more command, and input "enter" all the time during the operation.

ssh-keygen -t rsa&&cd  ~/.ssh/&&cat  ~/.ssh/id_rsa.pub >>  ~/.ssh/authorized_keys&&chmod 600 ~/.ssh/authorized_keys&&cat ~/.ssh/authorized_keys&&ls

The results are as follows

master slave1 slave2 executes this command and authorizes_ Keys (the red part of the icon above), copied to the authorized part of the master node_ Keys.
The final results are as follows:

Pass these public keys to the child nodes, and test whether they can be password free login.




scp ~/.ssh/authorized_keys root@slave1:~/.ssh/
scp ~/.ssh/authorized_keys root@slave2:~/.ssh/

Turn off firewall and SELinux

Do the following on all nodes:

 yum install iptables-services
 systemctl stop firewalld

On the master node, do the following:

vim /etc/selinux/config

Install JDK

Download address: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Enter the folder where the jdk is placed and execute (remember to change to the name of your own compressed package)

mkdir -p /usr/local/java # Create the folder you want
tar -vzxf jdk-8u251-linux-x64.tar.gz -C /usr/local/java/ # Extract to the specified location

View name

At the bottom of the file or specify file add

export JAVA_HOME=/usr/local/java/jdk1.8.0_251
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin

Execute code application environment variable

source /etc/profile

To see if the installation was successful:

java -version

Create a new user

adduser hadoop

Do the following

The following information was found to indicate that the configuration was successful

Give their superuser privileges:



vim /etc/sudoers

Change to the following form

hadoop environment configuration

Download and install

Download website: https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
Place hadoop in the folder you specify, enter the directory, and execute the following command.

tar -vzxf hadoop-3.1.3.tar.gz -C /usr/&&cd /usr&&cd ./hadoop-3.1.3&&mkdir -p dfs/name&&mkdir -p dfs/name&&mkdir temp&&ls

Environment configuration

cd ./etc/hadoop/&&vim hadoop-env.sh

Add the following environment variables

export JAVA_HOME=/usr/local/java/jdk1.8.0_251/
HADOOP_PREFIX=/usr/hadoop-3.1.3

vim yarn-env.sh

Add the following:

if [ "$JAVA_HOME" != "" ];then
  #echo "run java in $JAVA_HOME"
  JAVA_HOME=/usr/local/java/jdk1.8.0_251/
fi


Open slaves or workers in the current folder

vim workers

Delete the hostname and add your own node name.

vim /etc/profile

Add the following environment variables

export HADOOP_HOME=/usr/hadoop-3.1.3
export PATH=$HADOOP_HOME/bin:$PATH

Change profile

Change sh file

cd /usr/hadoop-3.1.3/sbin/

Set start-dfs.sh ,stop-dfs.sh Add the following parameters at the top of both files

HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root

start-yarn.sh ,stop-yarn.sh The following should also be added at the top:

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

Change xml file

cd /usr/hadoop-3.1.3/etc/hadoop/
vim core-site.xml

Add the following information

<configuration>
 <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
       </property>
       <property>
                <name>io.file.buffer.size</name>
                <value>131072</value>
       </property>
       <property>
               <name>hadoop.tmp.dir</name>
               <value>file:/usr/local/hadoop-3.1.3/tmp</value>
               <description>Abase for other temporary   directories.</description>
       </property>
       <property>
               <name>hadoop.proxyuser.hduser.hosts</name>
               <value>*</value>
       </property>
       <property>
               <name>hadoop.proxyuser.hduser.groups</name>
               <value>*</value>
       </property>
</configuration>

vim hdfs-site.xml

Add the following information


<configuration>
       <property>
               <name>dfs.namenode.secondary.http-address</name>
               <value>master:9001</value>
       </property>
     <property>
             <name>dfs.namenode.name.dir</name>
             <value>file:/usr/hadoop-3.1.3/dfs/name</value>
       </property>
      <property>
              <name>dfs.datanode.data.dir</name>
              <value>file:/usr/hadoop-3.1.3/dfs/data</value>
       </property>
       <property>
               <name>dfs.replication</name>
               <value>3</value>
        </property>
        <property>
                <name>dfs.webhdfs.enabled</name>
                <value>true</value>
         </property>
</configuration>
vim mapred-site.xml

Add the following information

<configuration>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
        </property>
</configuration>
vim yarn-site.xml

Add the following information

<configuration>
        <property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
        </property>
        <property>                                                                
               <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
               <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
               <name>yarn.resourcemanager.address</name>
               <value>master:8032</value>
       </property>
       <property>
               <name>yarn.resourcemanager.scheduler.address</name>
               <value>master:8030</value>
       </property>
       <property>
               <name>yarn.resourcemanager.resource-tracker.address</name>
               <value>master:8031</value>
      </property>
      <property>
               <name>yarn.resourcemanager.admin.address</name>
               <value>master:8033</value>
       </property>
       <property>
               <name>yarn.resourcemanager.webapp.address</name>
               <value>master:8088</value>
       </property>
</configuration>

Check success

hadoop version


implement

hadoop classpath


Copy the printed information and add it to yarn as I do below- site.xml Medium.

       <property>
        <name>yarn.application.classpath</name>
        <value>/usr/hadoop-3.1.3/etc/hadoop:/usr/hadoop-3.1.3/share/hadoop/common/lib/*:/usr/hadoop-3.1.3/share/hadoop/common/*:/usr/hadoop-3.1.3/share/hadoop/hdfs:/usr/hadoop-3.1.3/share/hadoop/hdfs/lib/*:/usr/hadoop-3.1.3/share/hadoop/hdfs/*:/usr/hadoop-3.1.3/share/hadoop/mapreduce/lib/*:/usr/hadoop-3.1.3/share/hadoop/mapreduce/*:/usr/hadoop-3.1.3/share/hadoop/yarn:/usr/hadoop-3.1.3/share/hadoop/yarn/lib/*:/usr/hadoop-3.1.3/share/hadoop/yarn/*</value>
       </property>

Transfer and connect

Transfer to two child nodes

scp -r /usr/hadoop-3.1.3/ root@slave1:/usr/&&scp -r /usr/hadoop-3.1.3/ root@slave2:/usr/

Format namenode

/usr/hadoop-3.1.3/bin/hdfs namenode -format

Open cluster

/usr/hadoop-3.1.3/sbin/stop-all.sh&&/usr/hadoop-3.1.3/sbin/start-dfs.sh&&/usr/hadoop-3.1.3/sbin/start-yarn.sh

Check whether it is opened successfully

hdfs dfsadmin -report

If live datanodes is not 0, it is successful

If unsuccessful solution 1
If jps is executed on the slave node without this, it is caused by executing the / usr/hadoop-3.1.3/bin/hdfs namenode -format code multiple times:

Enter hdfs-site.xml Find the following two paths, and delete all the contents on the master and slave nodes.

Execute the following command again






/usr/hadoop-3.1.3/bin/hdfs namenode -format
/usr/hadoop-3.1.3/sbin/stop-all.sh&&/usr/hadoop-3.1.3/sbin/start-dfs.sh&&/usr/hadoop-3.1.3/sbin/start-yarn.sh
hdfs dfsadmin -report

If unsuccessful solution II
If you have the following datanodes

It should be caused by not closing the firewall:
Execute the following command on all nodes:



systemctl stop firewalld

Execute the following command again

/usr/hadoop-3.1.3/sbin/stop-all.sh&&/usr/hadoop-3.1.3/sbin/start-dfs.sh&&/usr/hadoop-3.1.3/sbin/start-yarn.sh
hdfs dfsadmin -report

Run the Wordcount program

Randomly find several txt files to place in the specified path

hadoop dfs -mkdir -p /usr/hadoop-3.1.3/input&&hadoop dfs -put You put txt Path to/*   /usr/hadoop-3.1.3/input&&hadoop dfs -ls  /usr/hadoop-3.1.3/input

Note that the output path cannot exist in advance. If it exists, delete it with the following command:

hadoop dfs -rmr /usr/hadoop-3.1.3/output

Run the Wordcount program:

hadoop jar /usr/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /usr/hadoop-3.1.3/input /usr/hadoop-3.1.3/output

The following results show that the operation is successful

View output folder

hadoop dfs -ls /usr/hadoop-3.1.3/output


Print the results

hadoop dfs -cat /usr/hadoop-3.1.3/output/part-r-00000

Tags: Hadoop vim network ssh

Posted on Sun, 07 Jun 2020 06:55:31 -0400 by jrolands