Using IDEA to submit MapReduce jobs to pseudo distributed Hadoop remotely

Environmental Science

  1. VirtualBox 6.1
  2. IntelliJ IDEA 2020.1.1
  3. Ubuntu-18.04.4-live-server-amd64
  4. jdk-8u251-linux-x64
  5. hadoop-2.7.7

Install pseudo distributed Hadoop

Install pseudo distributed reference: Hadoop installation tutorial single machine / pseudo distributed configuration Hadoop 2.6.0 (2.7.1) / Ubuntu 14.04 (16.04)

Let's not go over it here, note that you need to install yarn.

That is to say, I use the host only network mode.

After starting successfully, using jps, the display should have the following items:

Modify configuration

First, use ifconfig to view the local IP address. I'm 192.168.56.101 here. I'll use the IP address as an example to show.

Modify core-site.xml to change localhost to server IP

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://192.168.56.101:9000</value>
</property>

Modify mapred-site.xml and add mapreduce.jobhistory.address

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>192.168.56.101:10020</value>
</property>

If this item is not added, the following error will be reported

[main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)  

Modify yarn-site.xml and add the following

<property>
    <name>yarn.resourcemanager.address</name>
    <value>192.168.56.101:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>192.168.56.101:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>192.168.56.101:8031</value>
</property>

If it is not added, an error will be reported

INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

After configuration, you need to restart dfs, yarn, and history server.

Configure Hadoop running environment of Windows

First, extract hadoop-2.7.7.tar.gz in Linux to a directory in Windows. In this paper, D:\ProgramData\hadoop

Then configure the environment variables:

HADOOP_HOME=D:\ProgramData\hadoop

HADOOP_BIN_PATH=%HADOOP_HOME%\bin

HADOOP_PREFIX=D:\ProgramData\hadoop

In addition, the PATH variable is appended at the end;% HADOOP_HOME%\bin

Then go to download winutils. The download address is https://github.com/cdarlint/winutils. Find the corresponding version to download. Here, download version 2.7.7.

Copy winutils.exe to the $Hadoop? Home \ bin directory and hadoop.dll to the C:\Windows\System32 directory.

Write WordCount

First create the data file wc.txt

hello world
dog fish
hadoop 
spark
hello world
dog fish
hadoop 
spark
hello world
dog fish
hadoop 
spark

Then move to Linux, and use HDFS dfs - put / path / wc.txt. / input to put the data file into dfs

Then use IDEA to create maven project and modify pom.xml file

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>WordCount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <repositories>
        <repository>
            <id>aliyun</id>
            <name>aliyun</name>
            <url>https://maven.aliyun.com/repository/central/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
        <dependency>
            <groupId>commons-cli</groupId>
            <artifactId>commons-cli</artifactId>
            <version>1.2</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.1.1</version>
        </dependency>
    </dependencies>


    <build>
        <finalName>${project.artifactId}</finalName>
    </build>

</project>

The next step is to write the WordCount program. Here I refer to

https://www.cnblogs.com/frankdeng/p/9256254.html

Then modify WordcountDriver

package cabbage;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * It is equivalent to the client of a yarn cluster,
 * We need to encapsulate the relevant running parameters of our mr program here, and specify the jar package
 * Final submission to yarn
 */
public class WordcountDriver {
    /**
     * remove directory specified
     *
     * @param conf
     * @param dirPath
     * @throws IOException
     */
    private static void deleteDir(Configuration conf, String dirPath) throws IOException {
        FileSystem fs = FileSystem.get(conf);
        Path targetPath = new Path(dirPath);
        if (fs.exists(targetPath)) {
            boolean delResult = fs.delete(targetPath, true);
            if (delResult) {
                System.out.println(targetPath + " has been deleted sucessfullly.");
            } else {
                System.out.println(targetPath + " deletion failed.");
            }
        }

    }


    public static void main(String[] args) throws Exception {
        System.setProperty("HADOOP_USER_NAME", "hadoop");
        // 1 get configuration information or job object instance
        Configuration configuration = new Configuration();
        System.setProperty("hadoop.home.dir", "D:\\ProgramData\\hadoop");
        configuration.set("mapreduce.framework.name", "yarn");
        configuration.set("fs.default.name", "hdfs://192.168.56.101:9000");
        configuration.set("mapreduce.app-submission.cross-platform", "true");//Cross platform submission
        configuration.set("mapred.jar","D:\\Work\\Study\\Hadoop\\WordCount\\target\\WordCount.jar");
        // 8. The configuration is submitted to yarn to run. windows and Linux variables are inconsistent
//        configuration.set("mapreduce.framework.name", "yarn");
//        configuration.set("yarn.resourcemanager.hostname", "node22");
        //Delete the output directory first
        deleteDir(configuration, args[args.length - 1]);

        Job job = Job.getInstance(configuration);

        // 6 specify the local path of the jar package of this program
//        job.setJar("/home/admin/wc.jar");
        job.setJarByClass(WordcountDriver.class);

        // 2. Specify the mapper/Reducer business class to be used by this business job
        job.setMapperClass(WordcountMapper.class);
        job.setCombinerClass(WordcountReducer.class);
        job.setReducerClass(WordcountReducer.class);

        // 3 specify kv type of mapper output data
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 4 specifies the kv type of the final output data
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 5 specify the directory of the original input file of the job
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7. Submit the relevant parameters configured in the job and the jar package of the java class used in the job to yarn for running
//        job.submit();
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }
}

critical code

System.setProperty("HADOOP_USER_NAME", "hadoop");

If you do not add this line, you will get permission error

org.apache.hadoop.ipc.RemoteException: Permission denied: user=administration, access=WRITE, inode="/":root:supergroup:drwxr-xr-x

If you modify or report an error, you can consider modifying the file permissions 777

Here I mainly refer to several articles

https://www.cnblogs.com/acmy/archive/2011/10/28/2227901.html

https://blog.csdn.net/jzy3711/article/details/85003606

System.setProperty("hadoop.home.dir", "D:\\ProgramData\\hadoop");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("fs.default.name", "hdfs://192.168.56.101:9000");
configuration.set("mapreduce.app-submission.cross-platform", "true");//Cross platform submission
configuration.set("mapred.jar","D:\\Work\\Study\\Hadoop\\WordCount\\target\\WordCount.jar");

If you don't add this line, you will report an error

Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class cabbage.WordcountMapper not found

Here I mainly refer to

https://blog.csdn.net/u011654631/article/details/70037219

//Delete the output directory first
deleteDir(configuration, args[args.length - 1]);

output will not be overwritten every time it runs. If it is not deleted, an error will be reported, which should be known here.

Add dependency

Then add the dependent Libary reference. Right click on the project - > open module settings or press F12 to open module properties

Then click the plus sign on the right of dependencies - > Library

Then import all the corresponding packages under $Hadoop? Home

Then import $Hadoop? Home \ share \ Hadoop \ tools \ Lib

Then use maven's package to package the jar package

Add resources

Create log4j.properties in resources and add the following

log4j.rootLogger=INFO, stdout
#log4j.logger.org.springframework=INFO
#log4j.logger.org.apache.activemq=INFO
#log4j.logger.org.apache.activemq.spring=WARN
#log4j.logger.org.apache.activemq.store.journal=INFO
#log4j.logger.org.activeio.journal=INFO
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n

Then move the core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml in Linux. The final project structure is shown in the figure below

Configure IDEA

After the above configuration, you can set the operation parameters

Pay attention to two places

  1. Program documents, specify the input file and output folder, note that hdsf://ip:9000/user/hadoop/xxx

  2. Working Directory, i.e. Working Directory, specified as the directory where $Hadoop? Home is located

function

Click Run. If the error report says that there is no dependency, for example, I will report that there is no slf4j log package, and then add it to the dependency.

After running, IDEA is shown as follows:

Then look at the output in the output file. Enter HDFS DFS - cat. / output / *, and the following results will be displayed, which is correct.

If you have any questions, you can put them up in the comment area and discuss them together.

reference resources

  1. http://dblab.xmu.edu.cn/blog/install-hadoop/
  2. https://blog.csdn.net/u011654631/article/details/70037219
  3. https://www.cnblogs.com/yjmyzz/p/how-to-remote-debug-hadoop-with-eclipse-and-intellij-idea.html
  4. https://www.cnblogs.com/frankdeng/p/9256254.html
  5. https://www.cnblogs.com/acmy/archive/2011/10/28/2227901.html
  6. https://blog.csdn.net/djw745917/article/details/88703888
  7. https://www.jianshu.com/p/7a1f131469f5

Tags: Java Hadoop Apache log4j xml

Posted on Thu, 14 May 2020 03:50:06 -0400 by Helios