Once ubantu is installed, download hadoop for installation
1. Default environment
Ubuntu 18.04 64 bits as the system environment (or Ubuntu 14.04, Ubuntu 16.04, 32-bit, 64-bit)
Download address for hadoop3.1.3.tar.gz file Portal
Can use Thunder download, faster
2. Preparations
Create a user named "hadoop" and use/bin/bash as a shell
sudo useradd -m hadoop -s /bin/bash
Use the following command to set the password, which can be simply set to 123 456, and enter the password twice as prompted:
sudo passwd hadoop 123456 123456
Administrator privileges for hadoop users to facilitate deployment
sudo adduser hadoop sudo
3. Update apt
After logging in with the hadoop user, we first update apt, and then we use apt to install the software. If there is no update, some software may not be installed. Perform the following command:
sudo apt-get update
2. Install SSH and configure SSH passwordless login
Cluster, single-node mode requires SSH login (similar to remote login, where you can log in to a Linux host and run commands), Ubuntu has SSH client installed by default, and SSH server installed:
sudo apt-get install openssh-server
After installation, you can log on to this machine with the following commands:
ssh localhost
At this point, you will be prompted as follows (SSH first login prompt), enter yes, and then press the prompt to enter the password 123456, so you can log on to your computer.
This login requires a password every time. It is convenient to configure SSH passwordless login.
First exit the previous ssh, return to our original terminal window, then use ssh-keygen to generate the key and add it to the authorization:
exit # Exit the ssh localhost just now cd ~/.ssh/ # If there is no directory, execute ssh localhost once ssh-keygen -t rsa # There will be a reminder, just press Enter, 3,4 times cat ./id_rsa.pub >> ./authorized_keys # Join Authorization
Then use the ssh localhost command to log on without entering a password
3. Install JAVA Environment
Hadoop3.1.3 requires JDK version 1.8 and above.
- Installation package jdk-8u162-linux-x64.tar.gz for JDK1.8
- You can download the JDK1.8 installation package (extract code: ziyu) from Baidu Cloud Disk by clicking here. Please download the compressed file jdk-8u162-linux-x64.tar.gz to your local computer, assuming it is stored in the directory'/home/hadoop/Desktop/'.
- Execute the following command
cd /usr/lib sudo mkdir jvm #Create/usr/lib/jvm directory to store JDK files cd ~ #Enter the home directory of the hadoop user cd Desktop #Note the case sensitivity, the JDK installation package jdk-8u162-linux-x64.tar.gz has just been uploaded to this directory by FTP software sudo tar -zxvf ./jdk-8u162-linux-x64.tar.gz -C /usr/lib/jvm #Unzip the JDK file into the / usr/lib/jvm directory
- Setting environment variables
The hadoop user's environment variable profile is opened. Add the following lines at the beginning of the file:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH
Save the.bashrc file and exit the vim editor. Then proceed with the following commands to make the configuration of the.bashrc file take effect immediately:
source ~/.bashrc
Use java-version to see if the Java environment was successfully installed 22#4 Hadoop3.1.3
- Install Hadoop in/usr/local/
sudo tar -zxf ~/Desktop/hadoop-3.1.3.tar.gz -C /usr/local # Unzip to/usr/local cd /usr/local/ sudo mv ./hadoop-3.1.3/ ./hadoop # Change folder name to hadoop sudo chown -R hadoop ./hadoop # Modify file permissions
- Hadoop is decompressed and ready to use. Enter the following command to check if Hadoop is available, and success will display Hadoop version information:
cd /usr/local/hadoop ./bin/hadoop version
Four Hadoop Single Machine Configuration
The default mode for Hadoop is non-distributed (local), which runs without additional configuration. Non-distributed, a single Java process, makes debugging easy.
- Let's choose to run the grep example, using all the files in the input folder as input, filter out the words that match the regular expression dfs[a-z.]+ and count the occurrences, and output the results to the output folder.
cd /usr/local/hadoop mkdir ./input cp ./etc/hadoop/*.xml ./input # Configuration file as input file ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+' cat ./output/* # View run results
4.2 Hadoop Pseudo-Distributed Installation
Hadoop It can run in a pseudo-distributed manner on a single node. Hadoop Process to Separate Java The process runs, and the node acts as both NameNode Also as DataNode,At the same time, read HDFS Files in. Hadoop The configuration file is located in /usr/local/hadoop/etc/hadoop/ In, pseudo-distributed requires two profile modifications core-site.xml and hdfs-site.xml . Hadoop The configuration file is xml Format, each configuration to declare property Of name and value To achieve.
- Modify the configuration file core-site.xml (easier to edit through gedit: GEDIT. /etc/hadoop/core-site.xml), and add
<configuration> </configuration>
Modify to
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Similarly, modify the configuration file hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property> </configuration>
-
Hadoop Profile Description
Hadoop runs in a configuration file (the configuration file is read when Hadoop runs), so if you need to switch from pseudo-distributed to non-distributed mode, you need to delete the configuration item in core-site.xml.
Additionally, pseudo-distributed can run even if you only need to configure fs.defaultFS and dfs.replication (as is the case with the official tutorial)However, if the hadoop.tmp.dir parameter is not configured, the default temporary directory used is/tmp/hadoo-hadoop, which may be cleaned up by the system on reboot, causing format to have to be re-executed. So we set it up and also specified dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise errors may occur in the next steps. -
Once the configuration is complete, format the NameNode:
cd /usr/local/hadoop ./bin/hdfs namenode -format
If successful, you will see " successfully formatted" Tips,
If you prompt Error at this step: JAVA_HOME is not set and could not be found. Error: JAVA_HOME environment variable is not set there before. Press the tutorial to set JAVA_HOME variable first, otherwise the subsequent process will not proceed. If you have already set JAVA_HOME in the.bashrc file according to the previous tutorial, or Error: JAVA_HOME is not set and could not be found. Then go to the installation directory of Hadoop to modify the configuration file ** "/usr/local/hadoop/etc/hadoop/hadoop-env.sh",** find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then modify it to the specific address of the JAVA installation path, for example, "export JAVA_HOME=/usr/lib/jvm/default-java"Then, start Hadoop again
4. Open the NameNode and DataNode Daemons
cd /usr/local/hadoop ./sbin/start-dfs.sh #start-dfs.sh is a complete executable with no spaces in it
-
Once the boot is complete, you can use the command jps to determine if the boot was successful
If started successfully, the following processes are listed:'NameNode','DataNode', and'SecondaryNameNode'
5. Running Hadoop Pseudo Distribution Instances
Pseudo-distribution reads HDFS Data on, to use HDFS,First of all, HDFS Create directory in
./bin/hdfs dfs -mkdir -p /user/hadoop
The command is "./bin/hadoop dfs"initial Shell There are actually three ways to command shell Command mode. 1. hadoop fs 2. hadoop dfs 3. hdfs dfs hadoop fs Applicable to any different file system, such as the local file system and HDFS file system hadoop dfs Applicable only to HDFS file system hdfs dfs with hadoop dfs The command of the HDFS file system Next the ./etc/hadoop In xml Files are copied as input files to the Distributed File System, which will /usr/local/hadoop/etc/hadoop Copy to Distributed File System /user/hadoop/input Medium. We are using hadoop Users, and corresponding user directories have been created /user/hadoop ,So you can use relative paths in commands such as input,Its corresponding absolute path is /user/hadoop/input:
./bin/hdfs dfs -mkdir input ./bin/hdfs dfs -put ./etc/hadoop/*.xml input
You can take the results back to local:
rm -r ./output # Delete the local output folder first (if it exists) ./bin/hdfs dfs -get output ./output # Copy output folder on HDFS to local machine cat ./output/*
Hadoop When running a program, the output directory cannot exist, otherwise an error will be raised. org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists", so to execute again, ** you need to execute the following command to delete the output folder: **
./bin/hdfs dfs -rm -r output # Delete output folder
Output directory cannot exist when running program
When running Hadoop programs, the output directory specified by the program (such as output) cannot exist in order to prevent overwriting the results, otherwise an error will be prompted, so the output directory needs to be deleted before running. When developing applications, consider adding the following code to the program to automatically delete the output directory at each run time to avoid tedious command line operations:
Configuration conf = new Configuration(); Job job = new Job(conf); /* Delete Output Directory */ Path outputPath = new Path(args[1]); outputPath.getFileSystem(conf).delete(outputPath, true);
6. Last but not least, you must shut down Hadoop correctly, or you will start an error next time and need to reformat it
To close Hadoop, run
./sbin/stop-dfs.sh
Be careful
Next time you start hadoop, you don't need to initialize NameNode, just run. /sbin/start-dfs.sh!