Submit Spark tasks remotely to yarn cluster

Reference article:How to submit spark tasks to yarn cluster remotely in idea Several modes of running spark tasks: 1, local mode, write code in idea and run directly. 2,standalone mode, need to jar package program, upload to cluster, spark-submit submit to cluster run 3,yarn mode (local,client,cluster) as above, also requires jar packa ...

Posted on Thu, 21 May 2020 20:09:40 -0400 by twilightnights

RDD common operators of spark notes

hello everyone! Here are the saprk operator notes I learned during the epidemic holiday. I just spent the whole afternoon sorting them out and sharing them with you! It's not easy to code. If it helps you, remember to like it! Article catalog 1, spark action operator 2, spark single value type 3, spark double value type 4, spa ...

Posted on Mon, 18 May 2020 04:02:17 -0400 by Mattyspatty

Machine Learning Model Training Scheme in Mass Data Scenarios

It is very difficult to train the machine learning model by single point in the process of actual processing and solving the engineering problem of machine learning.These scenarios include online recommendations, CTR estimates, Lookalike marketing, and so on. When there are hundreds of millions of data, tens of thousands of dimensional features ...

Posted on Mon, 11 May 2020 23:50:09 -0400 by les48

Analysis of Hadoop YARN ResourceManager crash caused by data limit of ZooKeeper node

This problem makes us encounter again. It happens infrequently, but once it happens, it will cause resource manager service crash, ZK registration watch too many and other problems. It has always been a hindrance to not completely solve this problem, so based on the previous two times of analysis and reading the latest version of Hadoop 3.2.1 c ...

Posted on Sun, 10 May 2020 10:38:53 -0400 by FireWhizzle

Several ways of reading and writing spark articles by HBase

1. Overview of how HBase is read and written It is mainly divided into: The way pure Java API s read and write HBase s; How Spark reads and writes HBase s; How Flink reads and writes HBase s; HBase is read and written through Phoenix; The first is the more original and efficient operation mode provided by HBase itself. The second and ...

Posted on Sat, 09 May 2020 23:05:48 -0400 by israfel

Flink's common Source and sink operations in stream processing

The source of flink on stream processing is basically the same as that on batch processing. There are four categories Collection based source File based source Socket based source Custom source Collection based source import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} import scala.collection.immutable.{Queue, Stack} ...

Posted on Thu, 07 May 2020 10:54:12 -0400 by bachx

Analysis of Spark SQL source code analysis phase

Introduction to Spark SQL principle analysis:Analysis of Spark SQL source code (I) overview of SQL parsing framework Catalyst process Spark SQL source parsing (2) Antlr4 parses Sql and generates a tree Analysis phase overview First, we need to introduce a new concept here. In the SQL parse phase, we will use antlr4 to parse an SQL statement int ...

Posted on Tue, 28 Apr 2020 06:53:52 -0400 by harkonenn

Big data: getting started with parallel computing - use of PySpark

The Spark application runs as an independent process set, coordinated by the Spark context in the driver. It can be automatically created (for example, if you call pyspark from shell (then the Spark context is called sc). But we haven't established it yet, so we need to define it: from pyspark import SparkContext sc = SparkContext('local[2]', ' ...

Posted on Thu, 16 Apr 2020 06:06:31 -0400 by aleigh

"partition.assignment.strategy" exception handling caused by spark streaming connection kafka

Server running environment: spark 2.4.4 + scall 2.11.12 + kafka 2.2.2 Because the business is relatively simple, kafka only has fixed topics, so it always uses the following script to perform real-time flow calculation spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4 --py-files /data/service/xxx.zip /data/service/x ...

Posted on Tue, 17 Mar 2020 23:48:00 -0400 by echo64

sparkmllib algorithm operation - Part 2

0. sparkmllib basic statistics Relevance hypothesis test Summing up device 1. Relevance Calculating the correlation between two series of data is a common operation in statistics. In spark.ml, we provide a lot of flexibility in the series to calculate the correlation between the two. At present ...

Posted on Mon, 16 Mar 2020 05:32:04 -0400 by nuklehed