Spark phase summary


  • kafka consumption data

At the same time, the data in kafka can only be consumed by one consumer under one consumer group.

kafka consumers are grouped when they consume data. The consumption of different groups is not affected. For the consumption in the same group, it should be noted that if there are 3 partitions and 3 consumers, then each consumer consumes the data corresponding to one partition; If there are two consumers, one consumer will consume one partition data and the other consumer will consume two partitions data. If there are more than 3 consumers, at most 3 consumers can consume data at the same time,

  • Sequence of data in kafka

If you save data to multiple partitions, you can only ensure that the partitions are orderly and globally disordered;

To be globally ordered, send all data to a partition.

  • kafka producer production data

Specify topic+value: the data is saved in polling mode

Specify topic+key+value: if the key is fixed, use the hash of the key to send data to a partition; If the key is dynamic, the hash of the key is also used to send data to the specified partition.

Specify topic+partition+key+value: if the number of partitions is specified, all data will be sent to the specified partition.

The production end sends data to the broker end and saves it. In order to prevent the loss of sent data, there are three ack mechanisms:

0: the producer sends data, and continues to send the next batch of data regardless of whether the leader is saved successfully and whether the follower is synchronized successfully;

1: The producer sends data to ensure that the leader is saved successfully, and continues to send the next batch of data regardless of whether the follower is synchronized successfully or not;

-1: When the producer sends data, it should not only ensure that the leader is saved successfully, but also ensure that the follower is synchronized successfully, and then send the next batch of data.

## kafka's server
##Serializer for Key
##Serializer for value
acks=[0|-1|1|all] ##Message confirmation mechanism
	0:	Just send the message without confirmation
	-1|all: Not only leader You need to write the data to the local disk and confirm it. You also need to wait for other synchronization followers Confirm
	1:It only needs leader You can confirm the message later follower Can from leader Synchronize
batch.size=1024 #The space size of user cache unsent record records in each partition
## If the data in the cache is not full, that is, there is still unused space, the request will also be sent. In order to reduce the number of requests, we can configure to be greater than 0, ## Whether the buffer is full or not, the request is sent with a delay of 10ms
buffer.memory=10240 #It controls all the cache space in a producer
retries=0 #Number of retries after sending a message failed

The parallelism of kafka consumption is the number of kaka topic partitions, or the number of partitions determines the maximum number of consumer consumption data in the same consumer group at the same time

  Offset: the identification of each message in the partition in kafka's topic. This offset is used to distinguish the position of the message in the partition corresponding to kafka. The data type of offset is Long, with a length of 8 bytes. Offsets are ordered within partitions, but not necessarily between partitions.

Multi node partition storage distribution

Replica allocation algorithm:

Sort all n brokers and i partitions to be allocated.
Assign the ith Partition to the (i mod n) Broker.
Assign the j-th replica of the i-th Partition to the ((i + j) mod n) Broker.
At the same time, we should also take into account the complexity and balance, and try to save the same number of partitions and copies in all nodes.

Segment file segment

In kafka, a topic can have multiple partitions; A partition can have multiple segment file segments. A segment file segment has two files:. log and. index files.

The. log file saves the data, and the. Index file saves the index in the data, which is a sparse index.

A segment file segment stores 1g data by default. If the segment file segment reaches 1g data, it is necessary to start splitting the second segment file segment, and so on.

First segment file segment:

-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000000000.index

-rw-r--r-- 1 root root   654696 Oct 13 17:14 00000000000000000000.log

Second segment file segment:

-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000004356.index

-rw-r--r-- 1 root root   654696 Oct 13 17:14 00000000000000004356.log

The third segment file segment:

-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000752386.index

-rw-r--r-- 1 root root   654696 Oct 13 17:14 00000000000000752386.log

Segment file segment naming rules:

It is named after the offset of the first data in the. log file in the current segment file.

push and pull in Kafka

Push mode is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of push mode is to deliver messages as fast as possible, but this can easily cause consumers to have no time to process messages. The typical manifestations are denial of service and network congestion. The pull mode can consume messages at an appropriate rate according to the consumption capacity of consumers.
The disadvantage of pull mode is that if Kafka has no data, consumers may fall into a loop and return empty data all the time. In view of this, Kafka consumers will pass in a duration parameter timeout when consuming data. If there is no data available for consumption, the consumer will wait for a period of time before returning. This duration is timeout.

Why is kafka so fast

  1. Sequential read-write disk
  2. Using pageCache page cache technology
  3. Multi directory

Kafka command

Start cluster

In turn hadoop102,hadoop103,hadoop104 Start on node kafka
[atguigu@hadoop102 kafka]$ bin/ config/ &
[atguigu@hadoop103 kafka]$ bin/ config/ &
[atguigu@hadoop104 kafka]$ bin/ config/ &

Shutdown cluster

[atguigu@hadoop102 kafka]$ bin/ stop
[atguigu@hadoop103 kafka]$ bin/ stop
[atguigu@hadoop104 kafka]$ bin/ stop

View all topic s in the current server

[atguigu@hadoop102 kafka]$ bin/ --zookeeper hadoop102:2181 --list

Create topic

[atguigu@hadoop102 kafka]$ bin/ --zookeeper hadoop102:2181 \
--create --replication-factor 3 --partitions 1 --topic first

Option Description: - topic definition topic name -- replication factor definition number of replicas -- partitions definition number of partitions

Delete topic

[atguigu@hadoop102 kafka]$ bin/ --zookeeper hadoop102:2181 \
--delete --topic first

send message

[atguigu@hadoop102 kafka]$ bin/ \
--broker-list hadoop102:9092 --topic first
>hello world
>atguigu atguigu

Consumption news

[atguigu@hadoop103 kafka]$ bin/ \
--zookeeper hadoop102:2181 --from-beginning --topic first

--From beginning: all previous data in the first topic will be read out. Select whether to add the configuration according to the business scenario.

View the details of a Topic

[atguigu@hadoop102 kafka]$ bin/ --zookeeper hadoop102:2181 \
--describe --topic first

Start consumer

[atguigu@hadoop102 kafka]$ bin/ \
--zookeeper hadoop102:2181 --topic first --consumer.config config/
[atguigu@hadoop103 kafka]$ bin/ --zookeeper hadoop102:2181 
--topic first --consumer.config config/

Kafka API

package com.atguigu.kafka;
import java.util.Properties;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class NewProducer {
public static void main(String[] args) {
Properties props = new Properties();
// Host name and port number of Kafka server
props.put("bootstrap.servers", "hadoop103:9092");
// Wait for responses from all replica nodes
props.put("acks", "all");
// Maximum number of attempts to send a message
props.put("retries", 0);
// Batch message processing size
props.put("batch.size", 16384);
// Request delay
props.put("", 1);
// Send buffer memory size
props.put("buffer.memory", 33554432);
// key serialization
// value serialization
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 50; i++) {
producer.send(new ProducerRecord<String, String>("first", 
Integer.toString(i), "hello world-" + i));

Custom partition producer

Define a class to implement the Partitioner interface and override the methods inside (outdated API)

package com.atguigu.kafka;
import java.util.Map;
import kafka.producer.Partitioner;
public class CustomPartitioner implements Partitioner {
public CustomPartitioner() {
public int partition(Object key, int numPartitions) {
// Control partition
return 0;

Kaka consumer api

#1. Address
#2. Serialization 
key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=org.apache.kafka.common.serialization.StringSerializer
#3. A specific topic (order) needs to be formulated for the topic.
#4. Consumer
public class OrderConsumer {
public static void main(String[] args) {
// 1 \ connect cluster
Properties props = new Properties(); 
props.put("bootstrap.servers", "node01:9092"); 
props.put("", "test");
//The following two lines of code --- the consumer automatically submits the offset value 
props.put("", "true"); 
props.put("",  "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(props);
while (true) {
    ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(1000);
    for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
        String value = consumerRecord.value();
        int partition = consumerRecord.partition();
        long offset = consumerRecord.offset();
        String key = consumerRecord.key();
        System.out.println("key:" + key + "value:" + value + "partition:" + partition + "offset:" + offset);

Specify partition data for consumption

public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092");
        props.put("", "test");
        props.put("", "true");
        props.put("", "1000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(props);
        TopicPartition topicPartition = new TopicPartition("test", 0);
        TopicPartition topicPartition1 = new TopicPartition("test", 1);
        kafkaConsumer.assign(Arrays.asList(topicPartition, topicPartition1));
        while (true) {
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(1000);
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
                String value = consumerRecord.value();
                int partition = consumerRecord.partition();
                long offset = consumerRecord.offset();
                String key = consumerRecord.key();
                System.out.println("key:" + key + "value:" + value + "partition:" + partition + "offset:" + offset);
    }   }

Kafka and Flume integration

flume mainly collects log data (offline or real-time).

  Configure the flume.conf file

#Name our source channel sink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#Specify which pipeline to send the data collected by our source
a1.sources.r1.channels = c1
#Specify our source data collection policy
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/servers/flumedata
a1.sources.r1.deletePolicy = never
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.ignorePattern = ^(.)*\\.tmp$
a1.sources.r1.inputCharset = UTF-8
#Specifying our channel as memory means that all data is loaded into memory
a1.channels.c1.type = memory
#Specify our sink as Kafka sink and which channel our sink reads data from = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = test
a1.sinks.k1.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1


[offcn@bd-offcn-02 kafka]$ bin/ \
--topic test \
--bootstrap-server node01:9092,node02:9092,node03:9092 \
[root@node01 flume]$ bin/flume-ng agent --conf conf --conf-file conf/flume_kafka.conf --name a1 -Dflume.root.logger=INFO,console



Spark official website component description

  Spark general operation simple process


Spark's driver is a process that executes the main method in the development program. It is responsible for the code written by developers to create SparkContext, create RDD, convert RDD and execute action operation.

Spark Executor is a working process, which is responsible for running tasks in spark jobs. Tasks are independent of each other. When the spark application is started, the Executor node is started at the same time, and always exists with the whole spark application life cycle.

Provide in memory storage for RDDS required to be cached in user programs through its own Block Manager. RDD is directly cached in the Executor process, so tasks can make full use of cached data to speed up operations at run time.

RDD (elastic distributed dataset) (key)

  • What is RDD? What are the characteristics of RDD? Can I carry data?

RDD: called elastic distributed data set

Features: immutable and divisible. The elements inside can be calculated in parallel.

It cannot carry data, which is similar to the interface in java. It carries metadata.

  • Dependency

Narrow dependency: a partition of the parent RDD can only be dependent on a partition of the child RDD = "" only child ""

Wide dependency: a partition of the parent RDD will be dependent on multiple partitions of the child RDD = "superchild"



WordCount execution flowchart

  • RDD operator classification

RDD operators are divided into two types: transformation operators and action operators.

Transformation: conversion operator, inert calculation, which only performs connection without operation. Only when an action is encountered will the conversion operator be driven to perform operation.

  map (one to one), flatMap (one to many), filter (one to N (0, 1)), join, leftouterJoin, rightouterJoin, fullouterJoin, sortBy, sortByKey, gorupBy, groupByKey, reduceBy, reduceByKey, sample, union, mapping, mappingwithindex, zip, zipWithIndex.


//Create an RDD and specify the number of partitions
val rdd: RDD[Int] = sc.parallelize(Array(1,2,3,4),2)
//Through - connect data between partitions
val result: RDD[String] = rdd.mapPartitions(x=>Iterator(x.mkString("-")))


val rdd: RDD[Int] = sc.parallelize(1 to 16,4)
//See what data is saved in each partition
val result: RDD[String] = rdd.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(",")))

  sample, union and join operators:

sample Operator:


sample(withReplacement, fraction, seed):Random sampling operator, sample The main work is to study the data itself, to replace the full-scale research, there will be similar data skew(dataSkew)And so on, we can't conduct a full-scale study, so we can only use samples to evaluate the whole.
withReplacement: Boolean : Sampling with return and sampling without return

fraction: Double: The proportion of sample space in the overall data volume is[0, 1],For example, 0.2, 0.65

seed: Long: Is the seed of a random number. It has a default value and usually does not need to pass parameters

def sampleOps(sc: SparkContext): Unit = {
    val list = sc.parallelize(1 to 100000)
    val sampled1 = list.sample(true, 0.01)
    println("sampled1 count: " + sampled1.count())
    val sampled2 = list.sample(false, 0.01)
    println("sampled2 count: " + sampled2.count())

union Operator:



amount to sql Medium union all,Conduct two rdd For the association between data, it should be noted that union Is a narrow dependency operation, rdd1 If so N Partitions, rdd2 have M Partition, then union After that, the number of partitions is N+M. 

join Operator:

rdd1.join(rdd2) amount to sql Medium join Connection operation

  A(id) a, B(aid) b

  select * from A a join B b on = b.aid

Cross connect: across join

  select * from A a across join B ====>This time the Cartesian product is generated

Internal connection: inner join,Extract the intersection of the left and right tables

  select * from A a inner join B on = b.aid perhaps

  select * from A a, B b where = b.aid

External connection: outer join

  Left outer connection left outer join Return all items in the left table, return items in the right table if they match, and return items if they do not match null

    select * from A a left outer join B on = b.aid

    //leftOutJoin operation

  val result1: RDD[(Int, (String, Option[Int]))] = rdd1.leftOuterJoin(rdd2)

  Right outer connection right outer join Just the opposite of the left outer connection

    select * from A a left outer join B on = b.aid

    val result2: RDD[(Int, (Option[String], Int))] = rdd1.rightOuterJoin(rdd2)

Full connection full join

  Total external connection full outer join = left outer join + right outer join

    val result3: RDD[(Int, (Option[String], Option[Int]))] = rdd1.fullOuterJoin(rdd2)
Premise: first join,rdd Must be of type K-V

coalesce operator, replacement (numPartitions):

coalesce(numPartition, shuffle=false): the meaning of partition merging

    numPartition: number of partitions after partition

    Shuffle: whether to enable shuffle in this repartition determines whether the current operation is wide (true) or narrow (false)

Originally, there were 100 partitions, which were merged into 10 partitions, or there were 2 partitions, which were repartitioned to 4.

Coalesce is a narrow dependency operator by default. If it is compressed to one partition, it is necessary to turn on shuffle=true. At this time, coalesce is a wide dependency operator

If you increase the partition, shuffle=false will not change the number of partitions. You can increase the partition by shuffle=true

repartition(numPartition) can be used instead of = coalesce(numPartitions, shuffle = true)

Action: execute the operator, drive the conversion operator to operate and output the result.

  count, collect (pull the task calculation results back to the Driver), foreach (not reclaiming all task calculation results. Principle: push the parameters passed in by the user to each node for execution, and only the calculation node can find the results), saveAsTextFile(path), reduce, foreachpartition, take, first, takeordered(n).

  • Which is more efficient, map or mapparitons? Examples

mapPartitions is efficient.

Example: save the data to the database. If it is a map operator, connect the database every time you save an element. After saving, disconnect the database. If the amount of data is too large, repeatedly connecting and disconnecting the database will cause great pressure on the database; On the contrary, with mappartitions, the data in a partition is operated at one time. Without saving the data in a partition, you only need to connect and disconnect the database once, which will cause less pressure on the database.

  • Which is more efficient, reduceByKey or groupByKey? Why?

reduceByKey is efficient because it is pre aggregated in the early stage, reducing network transmission.

  • Which of reduceByKey and reduce is the execution operator? Which is the conversion operator?

reduceByKey is a conversion operator

Reduce is the execution operator.

  • Persistence mode

There are two persistence methods: cache and persist

  One of the most important functions in Spark is to persist (or cache) data sets in memory across operations. When you persist an RDD, each node stores any partitions it computes in memory and reuses them in other operations on that dataset (or datasets derived from that dataset). This makes future actions faster (usually more than 10 times). Caching is a key tool for iterative algorithms and fast interactive use.

How to persist

   RDDS can be marked as persistent using the persist () or cache () methods. The first time it is calculated in an action, it is saved in the node's memory. Spark's cache is fault-tolerant -- if any partition of the RDD is lost, it will automatically recalculate using the transformation that originally created it.

The persistence method is rdd.persist() or rdd.cache()



  • Relationship and difference between cache and persist

The underlying Cache calls the persist parameterless construct, which caches data into memory by default.

Persist can choose a caching mechanism

Shared variable

It would be inefficient to support common read-write shared variables across tasks. However, Spark does provide two limited types of shared variables for two common usage patterns: Broadcast variables and accumulators.

  In other words, in order to share data more efficiently between driver and operator, spark provides two limited shared variables, one is broadcast variable and the other is accumulator

Notes on defining broadcast variables

Once a variable is defined as a broadcast variable, it can only be read and cannot be modified


The concept of accumulator is similar to the concept of counter counter counter in mr, which accumulates some data with certain characteristics. One advantage of the accumulator is that it does not need to modify the business logic of the program to complete data accumulation. At the same time, it does not need to trigger an additional action job to complete accumulation. On the contrary, it must add new business logic and trigger a new action job to complete accumulation. Obviously, the operation performance of this accumulator is better!

  • Is join a narrow dependency or a wide dependency?

Join may be narrow or wide dependent.

RDD partition

Spark currently supports hash partition and Range partition. Users can also customize the partition. Hash partition is the current default partition. The partition in spark directly determines the number of partitions in RDD, which partition each data in RDD belongs to in the Shuffle process and the number of Reduce.

The partition decision is made in the process of wide dependency. Because the narrow dependency is one-to-one and the partition is determined, it is not necessary to specify the partition operation.

def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //Load data
    val rdd = sc.parallelize(List((1,3),(1,2),(2,4),(2,3),(3,6),(3,8)),8)
    //Partition by Hash
    val result: RDD[(Int, Int)] = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
    //Get partition mode
    //Get number of partitions
  • Job

An action operator forms a job

  • DAG directed acyclic graph

Describes the execution process of RDD.


DAG directed acyclic graph is formed when action operator is encountered.

RDD only supports coarse-grained transformation, that is, a single operation performed on a large number of records. Record a series of lineages (i.e. lineages) that created the RDD in order to recover the lost partitions. The Lineage of RDD records the metadata information and conversion behavior of RDD. When some partition data of RDD is lost, it can recalculate and recover the lost data partition according to this information.



Spark environment startup command

Start% SPARK_HOME%\bin\spark-shell.cmd script

  Spark distributed environment


Submit task & execute procedure

[root@node01 spark]# bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://node01:7077 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
./examples/jars/spark-examples_2.11-2.4.7.jar \

Spark distributed HA environment installation

  To configure Zookeeper based ha, you need to add a sentence in

Note out the following:
 Add the following contents: during configuration, ensure that the following statements are in one line, otherwise the configuration is unsuccessful, and each-D Parameters are separated by spaces

Because ha is not sure that the master is started on node01, it will

export SPARK_ MASTER_ Comment out host = node01, synchronize to other machines, restart spark cluster, node1 and node02 start master.

Task submission & execution procedure:

[root@node01 spark]# bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://node01:7077,node02:7077 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 2 \
--queue default \
./examples/jars/spark-examples_2.11-2.4.7.jar \

Dynamic uplink and downlink slave  

spark]# sbin/ node01:7077 -c 4 -m 1024M
spark]# sbin/ node01:7077 -c 4 -m 1024M

Spark distributed Yan environment

Modify the hadoop configuration file yarn-site.xml

[root@node01 hadoop]$ vi yarn-site.xml
        <!--Whether to start a thread to check the amount of physical memory being used by each task. If the task exceeds the allocated value, it will be killed directly. The default is true -->
        <!--Whether to start a thread to check the amount of virtual memory being used by each task. If the task exceeds the allocated value, it will be killed directly. The default is true -->


[root@node01 conf]# vi

client mode

[root@node01 spark]# bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.11-2.4.7.jar \

  cluster mode

[root@node01 spark]# bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.11-2.4.7.jar \

Spark code writing

WordCount program written by Sparkcore

package com.atguigu
import org.apache.spark.{SparkConf, SparkContext}
object WordCount{
 def main(args: Array[String]): Unit = {
//1. Create SparkConf and set App name
 val conf = new SparkConf().setAppName("WC")
//2. Create SparkContext, which is the entry to submit Spark App
 val sc = new SparkContext(conf)
 //3. Use sc to create RDD and execute corresponding transformation and action
 sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 
1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))
//4. Close the connection

Package to cluster test

bin/spark-submit \
--class WordCount \
--master spark://hadoop102:7077 \
WordCount.jar \
/word.txt \

Efficient write to database

def saveInfoMySQLByForeachPartition(rdd: RDD[(String, Int)]): Unit = {
	rdd.foreachPartition(partition => {
		//This is within the partition and belongs to the local of the partition
		val url = "jdbc:mysql://localhost:3306/test"
		val connection = DriverManager.getConnection(url, "mark", "sorry")
		val sql =
			  |insert into wordcounts(word, `count`) Values(?, ?)
		val ps = connection.prepareStatement(sql)
		partition.foreach{case (word, count) => {
			ps.setString(1, word)
			ps.setInt(2, count)

Spark SQL

It provides two programming abstractions: DataFrame and DataSet, and acts as a distributed SQL query engine.

DataFrame has one more header information (Schema: constraint information) than RDD



Compared with RDD, dataset provides strong type support (generics) and adds type constraints to each row of data in RDD. Figure 1-7 shows the description of dataset in the official website.


  • sparkSql query style

There are two styles: one is DSL style and the other is SQL style.

DSL: using operators for data analysis has certain requirements for programming ability.

SQL: use SQL statements to analyze data.

  • schema constraint information

Refers to structured information.

  • SparkCore and SparkSql

SparkCore: underlying abstraction: RDD program entry: SparkContext

SparkSql: underlying abstraction: DataFrame and DataSet program entry: SparkSession

  • RDD,DataFrame,DataSet

DataFrame=RDD generic + schema+sql + optimization

DataSet=RDD+schma+sql + optimization

File save options



SparkSQL basic programming

Construction of SparkSession

val spark = SparkSession.builder()
           //. enableHiveSupport() / / supports hive related operations

How to build DataFrame

package chapter1
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
object Create_DataFrame {
    def main(args: Array[String]): Unit = {
        //Create program entry
        val spark: SparkSession = SparkSession.builder().appName("createDF").master("local[*]").getOrCreate()
        //Call sparkContext
        val sc: SparkContext = spark.sparkContext
        //Set console log output level
        //Create DataFrame from data source
        val personDF: DataFrame ="examples/src/main/resources/people.json")
        //Display data

Convert from RDD:

        val personDF: DataFrame = personRDD.toDF("id","name","age")

Create a DataFrame through reflection:

        val personDF: DataFrame = personRDD.toDF()

Dynamic programming

val df = spark.createDataFrame(row, schema)

        val list = List(
            new Student(1, "Wang Shengpeng", 1, 19),
            new Student(2, "Li Jinbao", 1, 49),
            new Student(3, "Zhang Haibo", 1, 39),
            new Student(4, "Zhang Wenyue", 0, 29)
        import spark.implicits._
        val ds = spark.createDataset[Student](list)

Row: represents a row of records in a two-dimensional table, or a Java object

Data loading

Spark. Read. Format (data file format). load(path)

//Guide Package
    import spark.implicits._
//The first way
//Load json file
val personDF: DataFrame ="json").load("E:\\data\\people.json")
//Load parquet file
val personDF1: DataFrame ="parquet").load("E:\\data\\people.parquet")
//Load the csv file. The csv file is special. If you want to bring the header, you must call the option method
val person2: DataFrame ="csv").option("header","true").load("E:\\data\\people.csv")
//Load tables in the database
val personDF3: DataFrame =
    .option("url", "jdbc:mysql://localhost:3306/bigdata")
    .option("user", "root")
    .option("password", "root")
    .option("dbtable", "person")

//The second way
//Load json file
val personDF4: DataFrame ="E:\\data\\people.json")
//Load parquet file
val personDF5: DataFrame ="E:\\data\\people.parquet")
//Load the csv file. The csv file is special. If you want to bring the header, you must call the option method
val person6: DataFrame ="header","true").csv("E:\\data\\people.csv")
//Load tables in the database
val properties = new Properties()
properties.put("user", "root")
properties.put("password", "root")
val personDF7: DataFrame ="jdbc:mysql://localhost:3306/bigdata", "person", properties)

Data landing

    //The first way
    //Save as json file
    //Save as parquet file
    //Save it as a csv file. If you want to bring the header, call the option method
    //Save as a table in the database
        .option("url", "jdbc:mysql://localhost:3306/bigdata")
        .option("user", "root")
        .option("password", "root")
        .option("dbtable", "person").save()

 //The second way
//Save as parque file
//Save as csv file
personDF.write.option("header", "true").csv("E:\\data\\csv")
//Save as json file
//Table saved as database
val props = new Properties()

SparkSQL and Hive integration

1. You need to import hive-site.xml of hive, add it under the classpath directory, or put it in $SPARK_HOME/conf

2. In order to parse the HDFS path in hive-site.xml normally, hdfs-site.xml and core-site.xml need to be under the classpath

package chapter5
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
object Hive_Support {
  def main(args: Array[String]): Unit = {
    //Create sparkSql program entry
    val spark: SparkSession = SparkSession.builder()
    //Call sparkContext
    val sc: SparkContext = spark.sparkContext
    //Set log level
    //Guide Package
    import spark.implicits._
    //Query tables in hive
    spark.sql("show tables").show()
    //Create table
    spark.sql("CREATE TABLE person (id int, name string, age int) row format delimited fields terminated by ' '")
    //Import data
    spark.sql("load data local inpath'./person.txt' into table person")
    //Data in query table
    spark.sql("select * from person").show()

User defined function

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SparkSession}
object UDF_Demo {
  def main(args: Array[String]): Unit = {
    //Create sparkSql program entry
    val spark: SparkSession = SparkSession.builder().appName("demo").master("local[*]").getOrCreate()
    //Call sparkContext
    val sc: SparkContext = spark.sparkContext
    //Set log level
    //Guide Package
    import spark.implicits._
    //load file
    val personDF: DataFrame ="E:\\data\\people.json")
    //Display data
    //Register as a table
    //What function does it give
    val fun = (x:String)=>{
    //If there is no addName function, register it
    spark.sql("select name,addName(name) from t_person").show()
//Release resources

Windowing function

row_number() over (partitin by XXX order by XXX)

rank() jump sort, with two second places followed by the fourth

dense_rank()   In continuous sorting, there are two second places, followed by the third place

row_number() is sorted consecutively. The two values are the same, and the sorting is also different

package com.zg.d03
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
case class StudentScore(name:String,clazz:Int,score:Int)
object SparkSqlOverDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("sparksqlover")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder().config(conf).getOrCreate()
    val arr01 = Array(("a",1,88),
    import spark.implicits._
    val scoreRDD = sc.makeRDD(arr01).map(x=>StudentScore(x._1,x._2,x._3)).toDS
    //Query t_score table data
    spark.sql("select * from t_score").show()
    //Use the windowing function to find the topN,rank() jump sort, there are two, the second is, followed by the fourth
    spark.sql("select name,clazz,score, rank() over( partition by clazz order by score desc ) rownum from t_score ").show()
    //Let's use the query result after the windowing function as a temporary table. This temporary table has the score ranking of each class, and then take the top three
    spark.sql("select * from (select name,clazz,score, rank() over( partition by clazz order by score desc ) rownum from t_score) t1 where rownum <=3 ").show()

SQL burst function

	//SQL style operation
        val sql =
              |select position.workName as workNames,count(*) as counts
              |select explode(data.list) as position
              |from t_position)
              |group by workNames
              |order by counts desc


The second generation of streaming processing framework generates mirco batch in a short time and submits a job. Quasi real time, slightly higher delay, second or sub second delay.

SparkStreaming is an extension of SparkCore's api. It uses dstream (discrete stream or dstream) as the data model to process continuous data streams based on memory. In essence, it is RDD's memory based computing.

DStream is essentially a sequence of RDD S.

  • Real time computing and quasi real time computing

Real time computing: event driven. One piece of data drives the immediate processing of one piece of data.

Quasi real time computing: time driven. No matter whether the data is received or not, it will be processed at the time node.

SparkStreaming architecture


SparkStreaming operator


Mainly learn transform, updateByKey and window functions.

SparkStreaming basic programming

Entry class StreamingContext

object SparkStreamingWordCountOps {
    def main(args: Array[String]): Unit = {
            StreamingContext Initialization of requires at least two parameters, SparkConf and BatchDuration
            SparkConf Needless to say
            batchDuration: The time interval between submitting two jobs. One DStream will be submitted each time to convert the data into batch - > RDD
            Therefore, the calculation of sparkStreaming is how often the data is calculated
        val conf = new SparkConf()
        val duration = Seconds(2)
        val ssc = new StreamingContext(conf, duration)
        //In order to perform streaming computation, you must call start to start
        //In order not to end the start startup program, you must call the awaitTermination method to wait for the completion of the program, then call the stop method to terminate the program, or the exception.


To continuously perform the streaming calculation, you must call the awaitTermination method so that the driver can reside in the background

Monitor local

Cannot read the manual copy or cut to the file in the specified directory. Only the file written through the stream can be read.

SparkStreaming consolidates HDFS

Under normal circumstances, we can read the files uploaded through put and the files copied through cp, but we can't read the files mv moved.

In this way, no additional Receiver consumes thread resources, so you can specify the master as local

object SparkStreamingHDFS {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
        val duration = Seconds(2)
        val ssc = new StreamingContext(conf, duration)
        //Reading data in local -- > needs to be written through stream
//        val lines = ssc.textFileStream("file:///E:/data/monitored")
        val lines = ssc.textFileStream("hdfs://node01:9000/data/spark")

SparkStreaming integrates Kafka

Receiver mode:

Spark extracts data from kafka and saves the data in the executor's memory, but the underlying calculation failure often causes data loss. There are solutions. Open the WAL pre write log and save the data in kafka not only in the executor's memory, but also in the WAL pre write log. In this way, the data added to the memory is lost, It can also be recovered from the WAL, but this leads to data redundancy.

Direct mode:

Spark goes to kafka every batch interval to read the latest offset range in each partition under each topic, and the data is still saved in kafka. If the calculation fails, as long as kafka saves the data long enough, it can recover indefinitely without causing data redundancy.


Features of direct connection mode: batch time reads a batch of data from kafka at regular intervals, and then consumes it

          Simplify parallelism. Number of partitions in rdd = number of partitions in topic

          The data is stored in kafka without data redundancy

          There is no single point problem


          The semantics of only once consumption can be realized

Integrated coding

When there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, consumption starts from scratch  
When there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, the newly generated data under the partition is consumed  

(1) Read data from kafka

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
    SparkStremaing Read data from kafka
            All represent the consumption kafka rate of streaming programs,
                max: The maximum number of records read from each partition per second
                    max=10,The number of partitions is 3 and the interval is 2s
                        Therefore, the maximum number of records that can be read in this batch is: 10 * 3 * 2 = 60
                        If the configuration is 0 or not set, there is no upper limit on the starting rate
                min: The minimum number of records read from each partition per second
              This means that the rate at which streaming reads data from kafka is between [min, max]
    Serialization problem during execution:
        serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord
        spark There are two ways of serialization in
            The default method is java serialization, that is to write an implementation serializable interface. This method is very stable, but it is very certain that the performance is poor
            There is also a high-performance serialization method - kryo serialization. The performance is very high. The official data is 10 times higher than the serialization performance of java. At the same time, you only need to make a declarative registration when using it
                sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)//Specify how to serialize
                    .registerKryoClasses(Array(classOf[ConsumerRecord[String, String]]))//Register the class to serialize
object StreamingFromKafkaOps {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
//            .set("spark.serializer", classOf[KryoSerializer].getName)
//            .registerKryoClasses(Array(classOf[ConsumerRecord[String, String]]))
        //Time interval between two flow calculations, batchInterval
        val batchDuration = Seconds(2) // Submit the sparkstreaming job every 2s
        val ssc = new StreamingContext(conf, batchDuration)
        val topics = Set("hadoop")
        val kafkaParams = Map[String, Object](
            "bootstrap.servers" -> "node01:9092,node02:9092,node03:9092",
            "key.deserializer" -> classOf[StringDeserializer],
            "value.deserializer" -> classOf[StringDeserializer],
            "" -> "spark-kafka-grou-0817",
            "auto.offset.reset" -> "earliest",
            "" -> "false"
            Read data from kafka
            locationStrategy: Location strategy
                How to allocate consumers to a specific topic and partition for scheduling can be determined through LocationStrategies.
                After Kafka 0.10, consumers pull data first, so caching the executor operation in an appropriate executor is very important to improve performance.
                    This method can be used if the broker instance of your executor in kafka is on the same node.
                    In most cases, using this strategy will distribute the partition to all executor s
                    When the network or device performance is unbalanced, you can configure a specific partition for a specific executor in this way.
                    Partitions that are not in this map map use the PreferConsistent policy
            consumerStrategy: Consumption strategy
                Configure the consumers of kafka created on the driver or the executor. The interface encapsulates the consumer process information and related checkpoint data
                Strategies for consumers when subscribing:
                    Subscribe           :  Subscribe to multiple topic s for consumption, which are encapsulated in a collection
                    SubscribePattern    :  You can subscribe to multiple consumers through regular matching. For example, the topic s subscribed include AAA, AAB, AAC and ADC, which can be represented by a[abc](2)
                    Assign   :   Specifying the partition of a specific topic for consumption is a more fine-grained strategy
        val message:InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
                                                                    ConsumerStrategies.Subscribe(topics, kafkaParams))
//        message.print() / / serialization problem in direct printing
        message.foreachRDD((rdd, bTime) => {
            if(!rdd.isEmpty()) {
                println(s"Time: $bTime")
                rdd.foreach(record => {

transform is a transformation operator, a transformation operator.

    transform,Is a transformation operation
        transform(p:(RDD[A]) => RDD[B]):DStream[B]
     Similar operation
        foreachRDD(p: (RDD[A]) => Unit)
     transform A very important operation of DStream is to build operations that are not in DStream. Most operations of DStream can be simulated by transform
     For example, map (P: (a) = > b) - > Transform (RDD = > RDD. Map (P: (a) = > b))

import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{
DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object Advertising_ranking {
    def main(args: Array[String]): Unit = {
        //Create program entry
        val conf: SparkConf = new SparkConf().setAppName("advertising").setMaster("local[*]")
        val sc = new SparkContext(conf)
        val ssc = new StreamingContext(sc,Seconds(5))
        //Set log level
        //receive data 
        val data: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
        //Segmentation data
        val spliData: DStream[String] = data.flatMap(_.split(" "))
        //Each click stream log is recorded once
        val pageAndOne: DStream[(String, Int)] =,1))
        //Aggregate the same Click Stream
        val pageAndCount: DStream[(String, Int)] = pageAndOne.reduceByKey(_+_)
        //Traverse the RDD encapsulated in DStream for operation
        val resultSorted: DStream[(String, Int)] = pageAndCount.transform(rdd => {
            //Flashback ranking of data in RDD
            val sorted: RDD[(String, Int)] = rdd.sortBy(_._2, false)
            //Get top3 from ranking data
            val topThree: Array[(String, Int)] = sorted.take(3)
            //Because transform needs to return a value
        //Print the overall ranking
        //Start sparkStreaming
        //Let it start all the time and wait for the program to close


import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object UpdateStateByKey_Demo {
  def updateFunc(currentValue:Seq[Int],historyValue:Option[Int]): Option[Int] = {
    val result: Int = currentValue.sum+historyValue.getOrElse(0)
  def main(args: Array[String]): Unit = {
    //Create a sparkStreaming program entry
    val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc,Seconds(5))
    //Set log level
    //Set checkpoints to save historical status
    //receive data 
    val file: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
    val spliFile: DStream[String] = file.flatMap(_.split(" "))
    //Remember each word once
    val wordAndOne: DStream[(String, Int)] =,1))
    //Conduct stateful conversion operation
    val wordAndCount: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc)
    //Start sparkStreaming
    //Keep it open and wait for it to close


 * window Window operation
 *    The stream is unbounded, so it is certainly impossible for us to make global statistics, so we can segment this unbounded data set,
 *    Each small interval that is segmented can be understood as window,
 *    sparkstreaming is a quasi real-time stream computing, small batch operation, which can be understood as a special window operation.
 * Theoretically, there are two cases for the division of this window: one is based on the number of data, and the other is based on time.
 * However, in spark streaming, only the latter is supported, that is, only time-based windows are supported, and two parameters need to be provided
 * One parameter is the length of the window: window_length
 * Another parameter is the calculation frequency of the window: sliding_interval, how often is the window operation calculated
 * streaming Another interval in the program is batchInterval. What is the relationship between these two intervals?
 * batchInterval,How often do you start a spark job, but how often do you submit a batch of data for the program
 * Special attention should be paid to:
 *  window_length And sliding_interval must be an integer multiple of batchInterval.
 *  Summary:
 *      window operation
 *          Every M long time, count the data generated in N long time
 *          M It's called sliding_interval, the sliding frequency of the window
 *          N It is called window_length, the length of the window
 *     The window is a sliding window.
 *   When sliding_ interval > window_ When length, a gap in the window will appear
 *   When sliding_ interval < window_ When length, the window will coincide
 *   When sliding_interval = window_length, the two windows fit perfectly
 *   batchInterval=2s
 *   sliding_interval=4s
 *   window_length=6s
object  WindowOps {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
        //Time interval between two flow calculations, batchInterval
        val batchDuration = Seconds(2) // Submit the sparkstreaming job every 2s
        val ssc = new StreamingContext(conf, batchDuration)
        val lines = ssc.socketTextStream("node01", 9999)
        val words = lines.flatMap(line => line.split("\\s+"))
        val pairs = => (word, 1))
        val ret = pairs.reduceByKeyAndWindow(_+_, windowDuration = Seconds(6), slideDuration = Seconds(4))

Tags: Big Data kafka Spark

Posted on Wed, 17 Nov 2021 11:07:48 -0500 by Cantaloupe