Reference article:How to submit spark tasks to yarn cluster remotely in idea
Several modes of running spark tasks:
1, local mode, write code in idea and run directly.
2,standalone mode, need to jar package program, upload to cluster, spark-submit submit to cluster run
3,yarn mode (local,client,cluster) as above, also requires jar packages to submit to cluster operation
If you test it yourself, it can be cumbersome to use the above methods. Each time you change your code, you need to pack it and upload it to the cluster, and then spark-submit submits to the cluster to run, which is also a waste of time. Here's how to submit to the yarn cluster remotely in your local idea
Look directly at demo below (the code is easier to write)
package spark import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{SparkConf} import spark.wordcount.kafkaStreams object RemoteSubmitApp { def main(args: Array[String]) { // Set up users to submit tasks System.setProperty("HADOOP_USER_NAME", "root") val conf = new SparkConf() .setAppName("WordCount") // Set yarn-client mode submission .setMaster("yarn") // Set ip for resourcemanager .set("yarn.resourcemanager.hostname","master") // Set the number of executor s .set("spark.executor.instance","2") // Set the memory size of executor .set("spark.executor.memory", "1024M") // Set yarn queue for submitting tasks .set("spark.yarn.queue","spark") // Set the ip address of the driver .set("spark.driver.host","192.168.17.1") // Set the path to the jar package, where you can add additional dependent packages separated by commas .setJars(List("D:\\develop_soft\\idea_workspace_2018\\sparkdemo\\target\\sparkdemo-1.0-SNAPSHOT.jar" )) conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val scc = new StreamingContext(conf, Seconds(1)) scc.sparkContext.setLogLevel("WARN") //scc.checkpoint("/spark/checkpoint") val topic = "jason_flink" val topicSet = Set(topic) val kafkaParams = Map[String, Object]( "auto.offset.reset" -> "latest", "value.deserializer" -> classOf[StringDeserializer] , "key.deserializer" -> classOf[StringDeserializer] , "bootstrap.servers" -> "master:9092,storm1:9092,storm2:9092" , "group.id" -> "jason_" , "enable.auto.commit" -> (true: java.lang.Boolean) ) kafkaStreams = KafkaUtils.createDirectStream[String, String]( scc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicSet, kafkaParams)) kafkaStreams.foreachRDD(rdd=> { if (!rdd.isEmpty()) { rdd.foreachPartition(fp=> { fp.foreach(f=> { println(f.value().toString) }) }) } }) scc.start() scc.awaitTermination() } }
Then we run it right-click and see the printed log
... 19/08/16 23:17:35 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: spark
start time: 1565997454105
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1565990507758_0020/
user: root
19/08/16 23:17:36 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)
19/08/16 23:17:37 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)
19/08/16 23:17:38 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)
19/08/16 23:17:39 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)
19/08/16 23:17:40 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
19/08/16 23:17:40 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master, PROXY_URI_BASES -> http://master:8088/proxy/application_1565990507758_0020), /proxy/application_1565990507758_0020
19/08/16 23:17:40 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
19/08/16 23:17:40 INFO Client: Application report for application_1565990507758_0020 (state: ACCEPTED)
19/08/16 23:17:41 INFO Client: Application report for application_1565990507758_0020 (state: RUNNING)
19/08/16 23:17:41 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 192.168.17.145
ApplicationMaster RPC port: 0
queue: spark
start time: 1565997454105
final status: UNDEFINED tracking URL: http://master:8088/proxy/application_1565990507758_0020/ user: root ...
You can see that the submission was successful,And then we turn it on yarn See if there is any on the Monitoring page job
You can see one spark The program is running,Then we click in,See how it works
You can see that it's working properly,Choose one job,look down executor Printed Log
This is what we wrote kafka Data,No problem,When it stops,Just need to idea Click inside to stop the program,This makes testing much easier.
Problems that may be encountered during operation:
1,First you need to yarn-site.xml,core-site.xml,hdfs-site.xml put to resource Below,Because programs need these environments to run.
2,Permission issues
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=JasonLee, access=WRITE, inode="/user":root:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:342) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:189)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1744)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1687)
at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:60)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2980)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1096)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB. mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2. callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)
This is because it was submitted locally so the user name is JasonLee, which does not have access to hdfs. The easiest solution is to set the user as root in the code
System.setProperty("HADOOP_USER_NAME", "root")
3. Missing environment variables
Exception in thread "main" java.lang.IllegalStateException: Library directory 'D:\develop_soft\idea_workspace_2018\sparkdemo\assembly\target\scala-2.11\jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:347)
at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:526)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:814)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:839)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:85)
at spark.RemoteSubmitApp$.main(RemoteSubmitApp.scala:31)
at spark.RemoteSubmitApp.main(RemoteSubmitApp.scala)
This error occurred because we did not configure SPARK_HOME environment variable, set SPARK_directly in the environment variables in the configurations in idea HOME is OK, as shown in the following figure:
4, no driver ip set
19/08/17 07:52:45 ERROR ApplicationMaster: Failed to connect to driver at 169.254.42.204:64010, retrying ...
19/08/17 07:52:48 ERROR ApplicationMaster: Failed to connect to driver at 169.254.42.204:64010, retrying ...
19/08/17 07:52:48 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:577)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:433)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:256)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:785)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
This error is due to the driver host not being set, because we are running in yarn-client mode and the driver is our local machine, so to set the local ip, otherwise the driver cannot be found.
.set("spark.driver.host","192.168.17.1")
5. There is also a need to ensure that your computer and virtual machine are in the same network segment, and to turn off the firewall of your computer, or you may not be able to connect.
I submit in yarn-client mode. Yarn has two queues. When submitting, you need to set the name of the queue below.
There are many other parameters that can be set in the code, such as executor's memory, the number,
driver's memory and so on, you can set it up according to your own situation, of course, this can also be submitted to the standalone cluster