Submit Spark tasks remotely to yarn cluster

Reference article:How to submit spark tasks to yarn cluster remotely in idea

Several modes of running spark tasks:

1, local mode, write code in idea and run directly.

2,standalone mode, need to jar package program, upload to cluster, spark-submit submit to cluster run

3,yarn mode (local,client,cluster) as above, also requires jar packages to submit to cluster operation

If you test it yourself, it can be cumbersome to use the above methods. Each time you change your code, you need to pack it and upload it to the cluster, and then spark-submit submits to the cluster to run, which is also a waste of time. Here's how to submit to the yarn cluster remotely in your local idea

Look directly at demo below (the code is easier to write)

package spark

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf}
import spark.wordcount.kafkaStreams

object RemoteSubmitApp {
  def main(args: Array[String]) {
    // Set up users to submit tasks
    System.setProperty("HADOOP_USER_NAME", "root")
    val conf = new SparkConf()
      .setAppName("WordCount")
      // Set yarn-client mode submission
      .setMaster("yarn")
      // Set ip for resourcemanager
      .set("yarn.resourcemanager.hostname","master")
      // Set the number of executor s
      .set("spark.executor.instance","2")
      // Set the memory size of executor
      .set("spark.executor.memory", "1024M")
      // Set yarn queue for submitting tasks
      .set("spark.yarn.queue","spark")
      // Set the ip address of the driver
      .set("spark.driver.host","192.168.17.1")
      // Set the path to the jar package, where you can add additional dependent packages separated by commas
      .setJars(List("D:\\develop_soft\\idea_workspace_2018\\sparkdemo\\target\\sparkdemo-1.0-SNAPSHOT.jar"
    ))
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val scc = new StreamingContext(conf, Seconds(1))
    scc.sparkContext.setLogLevel("WARN")
    //scc.checkpoint("/spark/checkpoint")
    val topic = "jason_flink"
    val topicSet = Set(topic)
    val kafkaParams = Map[String, Object](
      "auto.offset.reset" -> "latest",
      "value.deserializer" -> classOf[StringDeserializer]
      , "key.deserializer" -> classOf[StringDeserializer]
      , "bootstrap.servers" -> "master:9092,storm1:9092,storm2:9092"
      , "group.id" -> "jason_"
      , "enable.auto.commit" -> (true: java.lang.Boolean)
    )
    kafkaStreams = KafkaUtils.createDirectStream[String, String](
      scc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topicSet, kafkaParams))
    kafkaStreams.foreachRDD(rdd=> {
      if (!rdd.isEmpty()) {
        rdd.foreachPartition(fp=> {
          fp.foreach(f=> {
            println(f.value().toString)
          })
        })
      }
    })
    scc.start()
    scc.awaitTermination()
  }
}

Then we run it right-click and see the printed log

...
19/08/16 23:17:35 INFO Client:    client token: N/A   diagnostics: AM container is launched, waiting for AM container to Register with RM   ApplicationMaster host: N/A   ApplicationMaster RPC port: -1   queue: spark   start time: 1565997454105   final status: UNDEFINED   tracking URL: http://master:8088/proxy/application_1565990507758_0020/   user: root19/08/16 23:17:36 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)19/08/16 23:17:37 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)19/08/16 23:17:38 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)19/08/16 23:17:39 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)19/08/16 23:17:40 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)19/08/16 23:17:40 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master, 
PROXY_URI_BASES -> http://master:8088/proxy/application_1565990507758_0020), 
/proxy/application_1565990507758_002019/08/16 23:17:40 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter19/08/16 23:17:40 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)19/08/16 23:17:41 INFO Client: Application report for application_1565990507758_0020 (state: RUNNING)19/08/16 23:17:41 INFO Client:    client token: N/A   diagnostics: N/A   ApplicationMaster host: 192.168.17.145   ApplicationMaster RPC port: 0   queue: spark   start time: 1565997454105   final status: UNDEFINED   
   tracking URL: http://master:8088/proxy/application_1565990507758_0020/
   user: root
 ... 

You can see that the submission was successful,And then we turn it on yarn See if there is any on the Monitoring page job

 

You can see one spark The program is running,Then we click in,See how it works

 

You can see that it's working properly,Choose one job,look down executor Printed Log

This is what we wrote kafka Data,No problem,When it stops,Just need to idea Click inside to stop the program,This makes testing much easier.

Problems that may be encountered during operation:

1,First you need to yarn-site.xml,core-site.xml,hdfs-site.xml put to resource Below,Because programs need these environments to run.

2,Permission issues

 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):

 Permission denied: user=JasonLee, access=WRITE, inode="/user":root:supergroup:drwxr-xr-x  
  at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:342)  
  at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)  at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189)  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1744)  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1687)  at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:60)  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2980)  at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1096)  at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.
   mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:652)  at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.
   callBlockingMethod(ClientNamenodeProtocolProtos.java)  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868)  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814)  at java.security.AccessController.doPrivileged(Native Method)  at javax.security.auth.Subject.doAs(Subject.java:422)  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)

This is because it was submitted locally so the user name is JasonLee, which does not have access to hdfs. The easiest solution is to set the user as root in the code

System.setProperty("HADOOP_USER_NAME", "root")

3. Missing environment variables

  Exception in thread "main" java.lang.IllegalStateException: 
  Library directory 'D:\develop_soft\idea_workspace_2018\sparkdemo\assembly\target\scala-2.11\jars' 
  does not exist; make sure Spark is built.  at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)  at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:347)  at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)  at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:526)  at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:814)  at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)  at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)  at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)  at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:839)  at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)  at spark.RemoteSubmitApp$.main(RemoteSubmitApp.scala:31)  at spark.RemoteSubmitApp.main(RemoteSubmitApp.scala)

This error occurred because we did not configure SPARK_HOME environment variable, set SPARK_directly in the environment variables in the configurations in idea HOME is OK, as shown in the following figure:

4, no driver ip set

19/08/17 07:52:45 ERROR ApplicationMaster: Failed to connect to driver at 169.254.42.204:64010, retrying ...19/08/17 07:52:48 ERROR ApplicationMaster: Failed to connect to driver at 169.254.42.204:64010, retrying ...19/08/17 07:52:48 ERROR ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!  at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:577)  at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:433)  at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:256)  at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)  at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)  at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)  at java.security.AccessController.doPrivileged(Native Method)  at javax.security.auth.Subject.doAs(Subject.java:422)  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)  at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)  at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:785)  at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)

This error is due to the driver host not being set, because we are running in yarn-client mode and the driver is our local machine, so to set the local ip, otherwise the driver cannot be found.

.set("spark.driver.host","192.168.17.1")

5. There is also a need to ensure that your computer and virtual machine are in the same network segment, and to turn off the firewall of your computer, or you may not be able to connect.

I submit in yarn-client mode. Yarn has two queues. When submitting, you need to set the name of the queue below.

There are many other parameters that can be set in the code, such as executor's memory, the number,

driver's memory and so on, you can set it up according to your own situation, of course, this can also be submitted to the standalone cluster

Tags: Apache Spark Java Hadoop

Posted on Thu, 21 May 2020 20:09:40 -0400 by twilightnights