Kafka
- kafka consumption data
At the same time, the data in kafka can only be consumed by one consumer under one consumer group.
kafka consumers are grouped when they consume data. The consumption of different groups is not affected. For the consumption in the same group, it should be noted that if there are 3 partitions and 3 consumers, then each consumer consumes the data corresponding to one partition; If there are two consumers, one consumer will consume one partition data and the other consumer will consume two partitions data. If there are more than 3 consumers, at most 3 consumers can consume data at the same time,
- Sequence of data in kafka
If you save data to multiple partitions, you can only ensure that the partitions are orderly and globally disordered;
To be globally ordered, send all data to a partition.
- kafka producer production data
Specify topic+value: the data is saved in polling mode
Specify topic+key+value: if the key is fixed, use the hash of the key to send data to a partition; If the key is dynamic, the hash of the key is also used to send data to the specified partition.
Specify topic+partition+key+value: if the number of partitions is specified, all data will be sent to the specified partition.
The production end sends data to the broker end and saves it. In order to prevent the loss of sent data, there are three ack mechanisms:
0: the producer sends data, and continues to send the next batch of data regardless of whether the leader is saved successfully and whether the follower is synchronized successfully;
1: The producer sends data to ensure that the leader is saved successfully, and continues to send the next batch of data regardless of whether the follower is synchronized successfully or not;
-1: When the producer sends data, it should not only ensure that the leader is saved successfully, but also ensure that the follower is synchronized successfully, and then send the next batch of data.
## kafka's server bootstrap.servers=bd-offcn-01:9092,bd-offcn-02:9092,bd-offcn-03:9092 ##Serializer for Key key.serializer=org.apache.kafka.common.serialization.IntegerSerializer ##Serializer for value value.serializer=org.apache.kafka.common.serialization.StringSerializer acks=[0|-1|1|all] ##Message confirmation mechanism 0: Just send the message without confirmation -1|all: Not only leader You need to write the data to the local disk and confirm it. You also need to wait for other synchronization followers Confirm 1:It only needs leader You can confirm the message later follower Can from leader Synchronize batch.size=1024 #The space size of user cache unsent record records in each partition ## If the data in the cache is not full, that is, there is still unused space, the request will also be sent. In order to reduce the number of requests, we can configure linker.ms to be greater than 0, linger.ms=10 ## Whether the buffer is full or not, the request is sent with a delay of 10ms buffer.memory=10240 #It controls all the cache space in a producer retries=0 #Number of retries after sending a message failed
The parallelism of kafka consumption is the number of kaka topic partitions, or the number of partitions determines the maximum number of consumer consumption data in the same consumer group at the same time
Offset: the identification of each message in the partition in kafka's topic. This offset is used to distinguish the position of the message in the partition corresponding to kafka. The data type of offset is Long, with a length of 8 bytes. Offsets are ordered within partitions, but not necessarily between partitions.
Multi node partition storage distribution
Replica allocation algorithm:
Sort all n brokers and i partitions to be allocated.
Assign the ith Partition to the (i mod n) Broker.
Assign the j-th replica of the i-th Partition to the ((i + j) mod n) Broker.
At the same time, we should also take into account the complexity and balance, and try to save the same number of partitions and copies in all nodes.
Segment file segment
In kafka, a topic can have multiple partitions; A partition can have multiple segment file segments. A segment file segment has two files:. log and. index files.
The. log file saves the data, and the. Index file saves the index in the data, which is a sparse index.
A segment file segment stores 1g data by default. If the segment file segment reaches 1g data, it is necessary to start splitting the second segment file segment, and so on.
First segment file segment:
-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000000000.index
-rw-r--r-- 1 root root 654696 Oct 13 17:14 00000000000000000000.log
Second segment file segment:
-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000004356.index
-rw-r--r-- 1 root root 654696 Oct 13 17:14 00000000000000004356.log
The third segment file segment:
-rw-r--r-- 1 root root 10485760 Oct 13 17:14 00000000000000752386.index
-rw-r--r-- 1 root root 654696 Oct 13 17:14 00000000000000752386.log
Segment file segment naming rules:
It is named after the offset of the first data in the. log file in the current segment file.
push and pull in Kafka
Push mode is difficult to adapt to consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of push mode is to deliver messages as fast as possible, but this can easily cause consumers to have no time to process messages. The typical manifestations are denial of service and network congestion. The pull mode can consume messages at an appropriate rate according to the consumption capacity of consumers.
The disadvantage of pull mode is that if Kafka has no data, consumers may fall into a loop and return empty data all the time. In view of this, Kafka consumers will pass in a duration parameter timeout when consuming data. If there is no data available for consumption, the consumer will wait for a period of time before returning. This duration is timeout.
Why is kafka so fast
- Sequential read-write disk
- Using pageCache page cache technology
- Multi directory
Kafka command
Start cluster
In turn hadoop102,hadoop103,hadoop104 Start on node kafka [atguigu@hadoop102 kafka]$ bin/kafka-server-start.sh config/server.properties & [atguigu@hadoop103 kafka]$ bin/kafka-server-start.sh config/server.properties & [atguigu@hadoop104 kafka]$ bin/kafka-server-start.sh config/server.properties &
Shutdown cluster
[atguigu@hadoop102 kafka]$ bin/kafka-server-stop.sh stop [atguigu@hadoop103 kafka]$ bin/kafka-server-stop.sh stop [atguigu@hadoop104 kafka]$ bin/kafka-server-stop.sh stop
View all topic s in the current server
[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181 --list
Create topic
[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181 \ --create --replication-factor 3 --partitions 1 --topic first
Option Description: - topic definition topic name -- replication factor definition number of replicas -- partitions definition number of partitions
Delete topic
[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181 \ --delete --topic first
send message
[atguigu@hadoop102 kafka]$ bin/kafka-console-producer.sh \ --broker-list hadoop102:9092 --topic first >hello world >atguigu atguigu
Consumption news
[atguigu@hadoop103 kafka]$ bin/kafka-console-consumer.sh \ --zookeeper hadoop102:2181 --from-beginning --topic first
--From beginning: all previous data in the first topic will be read out. Select whether to add the configuration according to the business scenario.
View the details of a Topic
[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --zookeeper hadoop102:2181 \ --describe --topic first
Start consumer
[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh \ --zookeeper hadoop102:2181 --topic first --consumer.config config/consumer.properties [atguigu@hadoop103 kafka]$ bin/kafka-console-consumer.sh --zookeeper hadoop102:2181 --topic first --consumer.config config/consumer.properties
Kafka API
package com.atguigu.kafka; import java.util.Properties; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.Producer; import org.apache.kafka.clients.producer.ProducerRecord; public class NewProducer { public static void main(String[] args) { Properties props = new Properties(); // Host name and port number of Kafka server props.put("bootstrap.servers", "hadoop103:9092"); // Wait for responses from all replica nodes props.put("acks", "all"); // Maximum number of attempts to send a message props.put("retries", 0); // Batch message processing size props.put("batch.size", 16384); // Request delay props.put("linger.ms", 1); // Send buffer memory size props.put("buffer.memory", 33554432); // key serialization props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); // value serialization props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer<String, String> producer = new KafkaProducer<>(props); for (int i = 0; i < 50; i++) { producer.send(new ProducerRecord<String, String>("first", Integer.toString(i), "hello world-" + i)); } producer.close(); } }
Custom partition producer
Define a class to implement the Partitioner interface and override the methods inside (outdated API)
package com.atguigu.kafka; import java.util.Map; import kafka.producer.Partitioner; public class CustomPartitioner implements Partitioner { public CustomPartitioner() { super(); } @Override public int partition(Object key, int numPartitions) { // Control partition return 0; } }
Kaka consumer api
#1. Address bootstrap.servers=node01:9092 #2. Serialization key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=org.apache.kafka.common.serialization.StringSerializer #3. A specific topic (order) needs to be formulated for the topic. #4. Consumer group.id=test public class OrderConsumer { public static void main(String[] args) { // 1 \ connect cluster Properties props = new Properties(); props.put("bootstrap.servers", "node01:9092"); props.put("group.id", "test"); //The following two lines of code --- the consumer automatically submits the offset value props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(props); kafkaConsumer.subscribe(Arrays.asList("test")); while (true) { ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(1000); for (ConsumerRecord<String, String> consumerRecord : consumerRecords) { String value = consumerRecord.value(); int partition = consumerRecord.partition(); long offset = consumerRecord.offset(); String key = consumerRecord.key(); System.out.println("key:" + key + "value:" + value + "partition:" + partition + "offset:" + offset); } } } }
Specify partition data for consumption
public static void main(String[] args) { Properties props = new Properties(); props.put("bootstrap.servers", "node01:9092,node02:9092,node03:9092"); props.put("group.id", "test"); props.put("enable.auto.commit", "true"); props.put("auto.commit.interval.ms", "1000"); props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(props); TopicPartition topicPartition = new TopicPartition("test", 0); TopicPartition topicPartition1 = new TopicPartition("test", 1); kafkaConsumer.assign(Arrays.asList(topicPartition, topicPartition1)); while (true) { ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(1000); for (ConsumerRecord<String, String> consumerRecord : consumerRecords) { String value = consumerRecord.value(); int partition = consumerRecord.partition(); long offset = consumerRecord.offset(); String key = consumerRecord.key(); System.out.println("key:" + key + "value:" + value + "partition:" + partition + "offset:" + offset); } kafkaConsumer.commitSync(); } } }
Kafka and Flume integration
flume mainly collects log data (offline or real-time).
Configure the flume.conf file
#Name our source channel sink a1.sources = r1 a1.channels = c1 a1.sinks = k1 #Specify which pipeline to send the data collected by our source a1.sources.r1.channels = c1 #Specify our source data collection policy a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/servers/flumedata a1.sources.r1.deletePolicy = never a1.sources.r1.fileSuffix = .COMPLETED a1.sources.r1.ignorePattern = ^(.)*\\.tmp$ a1.sources.r1.inputCharset = UTF-8 #Specifying our channel as memory means that all data is loaded into memory a1.channels.c1.type = memory #Specify our sink as Kafka sink and which channel our sink reads data from a1.sinks.k1.channel = c1 a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.topic = test a1.sinks.k1.kafka.bootstrap.servers = node01:9092,node02:9092,node03:9092 a1.sinks.k1.kafka.flumeBatchSize = 20 a1.sinks.k1.kafka.producer.acks = 1
test
[offcn@bd-offcn-02 kafka]$ bin/kafka-console-consumer.sh \ --topic test \ --bootstrap-server node01:9092,node02:9092,node03:9092 \ --from-beginning [root@node01 flume]$ bin/flume-ng agent --conf conf --conf-file conf/flume_kafka.conf --name a1 -Dflume.root.logger=INFO,console
Spark
Sparkcore
Spark official website component description
Spark general operation simple process
Spark's driver is a process that executes the main method in the development program. It is responsible for the code written by developers to create SparkContext, create RDD, convert RDD and execute action operation.
Spark Executor is a working process, which is responsible for running tasks in spark jobs. Tasks are independent of each other. When the spark application is started, the Executor node is started at the same time, and always exists with the whole spark application life cycle.
Provide in memory storage for RDDS required to be cached in user programs through its own Block Manager. RDD is directly cached in the Executor process, so tasks can make full use of cached data to speed up operations at run time.
RDD (elastic distributed dataset) (key)
- What is RDD? What are the characteristics of RDD? Can I carry data?
RDD: called elastic distributed data set
Features: immutable and divisible. The elements inside can be calculated in parallel.
It cannot carry data, which is similar to the interface in java. It carries metadata.
- Dependency
Narrow dependency: a partition of the parent RDD can only be dependent on a partition of the child RDD = "" only child ""
Wide dependency: a partition of the parent RDD will be dependent on multiple partitions of the child RDD = "superchild"
WordCount execution flowchart
- RDD operator classification
RDD operators are divided into two types: transformation operators and action operators.
Transformation: conversion operator, inert calculation, which only performs connection without operation. Only when an action is encountered will the conversion operator be driven to perform operation.
map (one to one), flatMap (one to many), filter (one to N (0, 1)), join, leftouterJoin, rightouterJoin, fullouterJoin, sortBy, sortByKey, gorupBy, groupByKey, reduceBy, reduceByKey, sample, union, mapping, mappingwithindex, zip, zipWithIndex.
mappatition:
//Create an RDD and specify the number of partitions val rdd: RDD[Int] = sc.parallelize(Array(1,2,3,4),2) //Through - connect data between partitions val result: RDD[String] = rdd.mapPartitions(x=>Iterator(x.mkString("-"))) //Printout println(result.collect().toBuffer)
mapPartitionsWithIndex:
val rdd: RDD[Int] = sc.parallelize(1 to 16,4) //See what data is saved in each partition val result: RDD[String] = rdd.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))) //Printout result.foreach(println)
sample, union and join operators:
sample Operator: a.explain sample(withReplacement, fraction, seed):Random sampling operator, sample The main work is to study the data itself, to replace the full-scale research, there will be similar data skew(dataSkew)And so on, we can't conduct a full-scale study, so we can only use samples to evaluate the whole. withReplacement: Boolean : Sampling with return and sampling without return fraction: Double: The proportion of sample space in the overall data volume is[0, 1],For example, 0.2, 0.65 seed: Long: Is the seed of a random number. It has a default value and usually does not need to pass parameters def sampleOps(sc: SparkContext): Unit = { val list = sc.parallelize(1 to 100000) val sampled1 = list.sample(true, 0.01) println("sampled1 count: " + sampled1.count()) val sampled2 = list.sample(false, 0.01) println("sampled2 count: " + sampled2.count()) } union Operator: a.explain rdd1.union(rdd2) amount to sql Medium union all,Conduct two rdd For the association between data, it should be noted that union Is a narrow dependency operation, rdd1 If so N Partitions, rdd2 have M Partition, then union After that, the number of partitions is N+M. join Operator: rdd1.join(rdd2) amount to sql Medium join Connection operation A(id) a, B(aid) b select * from A a join B b on a.id = b.aid Cross connect: across join select * from A a across join B ====>This time the Cartesian product is generated Internal connection: inner join,Extract the intersection of the left and right tables select * from A a inner join B on a.id = b.aid perhaps select * from A a, B b where a.id = b.aid External connection: outer join Left outer connection left outer join Return all items in the left table, return items in the right table if they match, and return items if they do not match null select * from A a left outer join B on a.id = b.aid //leftOutJoin operation val result1: RDD[(Int, (String, Option[Int]))] = rdd1.leftOuterJoin(rdd2) Right outer connection right outer join Just the opposite of the left outer connection select * from A a left outer join B on a.id = b.aid //rightOuterJoin val result2: RDD[(Int, (Option[String], Int))] = rdd1.rightOuterJoin(rdd2) Full connection full join Total external connection full outer join = left outer join + right outer join //fullOuterJoin val result3: RDD[(Int, (Option[String], Option[Int]))] = rdd1.fullOuterJoin(rdd2) Premise: first join,rdd Must be of type K-V
coalesce operator, replacement (numPartitions):
coalesce(numPartition, shuffle=false): the meaning of partition merging
numPartition: number of partitions after partition
Shuffle: whether to enable shuffle in this repartition determines whether the current operation is wide (true) or narrow (false)
Originally, there were 100 partitions, which were merged into 10 partitions, or there were 2 partitions, which were repartitioned to 4.
Coalesce is a narrow dependency operator by default. If it is compressed to one partition, it is necessary to turn on shuffle=true. At this time, coalesce is a wide dependency operator
If you increase the partition, shuffle=false will not change the number of partitions. You can increase the partition by shuffle=true
repartition(numPartition) can be used instead of = coalesce(numPartitions, shuffle = true)
Action: execute the operator, drive the conversion operator to operate and output the result.
count, collect (pull the task calculation results back to the Driver), foreach (not reclaiming all task calculation results. Principle: push the parameters passed in by the user to each node for execution, and only the calculation node can find the results), saveAsTextFile(path), reduce, foreachpartition, take, first, takeordered(n).
- Which is more efficient, map or mapparitons? Examples
mapPartitions is efficient.
Example: save the data to the database. If it is a map operator, connect the database every time you save an element. After saving, disconnect the database. If the amount of data is too large, repeatedly connecting and disconnecting the database will cause great pressure on the database; On the contrary, with mappartitions, the data in a partition is operated at one time. Without saving the data in a partition, you only need to connect and disconnect the database once, which will cause less pressure on the database.
- Which is more efficient, reduceByKey or groupByKey? Why?
reduceByKey is efficient because it is pre aggregated in the early stage, reducing network transmission.
- Which of reduceByKey and reduce is the execution operator? Which is the conversion operator?
reduceByKey is a conversion operator
Reduce is the execution operator.
- Persistence mode
There are two persistence methods: cache and persist
One of the most important functions in Spark is to persist (or cache) data sets in memory across operations. When you persist an RDD, each node stores any partitions it computes in memory and reuses them in other operations on that dataset (or datasets derived from that dataset). This makes future actions faster (usually more than 10 times). Caching is a key tool for iterative algorithms and fast interactive use.
How to persist
RDDS can be marked as persistent using the persist () or cache () methods. The first time it is calculated in an action, it is saved in the node's memory. Spark's cache is fault-tolerant -- if any partition of the RDD is lost, it will automatically recalculate using the transformation that originally created it.
The persistence method is rdd.persist() or rdd.cache()
- Relationship and difference between cache and persist
The underlying Cache calls the persist parameterless construct, which caches data into memory by default.
Persist can choose a caching mechanism
Shared variable
It would be inefficient to support common read-write shared variables across tasks. However, Spark does provide two limited types of shared variables for two common usage patterns: Broadcast variables and accumulators.
In other words, in order to share data more efficiently between driver and operator, spark provides two limited shared variables, one is broadcast variable and the other is accumulator
Notes on defining broadcast variables
Once a variable is defined as a broadcast variable, it can only be read and cannot be modified
accumulator
The concept of accumulator is similar to the concept of counter counter counter in mr, which accumulates some data with certain characteristics. One advantage of the accumulator is that it does not need to modify the business logic of the program to complete data accumulation. At the same time, it does not need to trigger an additional action job to complete accumulation. On the contrary, it must add new business logic and trigger a new action job to complete accumulation. Obviously, the operation performance of this accumulator is better!
- Is join a narrow dependency or a wide dependency?
Join may be narrow or wide dependent.
RDD partition
Spark currently supports hash partition and Range partition. Users can also customize the partition. Hash partition is the current default partition. The partition in spark directly determines the number of partitions in RDD, which partition each data in RDD belongs to in the Shuffle process and the number of Reduce.
The partition decision is made in the process of wide dependency. Because the narrow dependency is one-to-one and the partition is determined, it is not necessary to specify the partition operation.
HashPartitioner
def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("WARN") //Load data val rdd = sc.parallelize(List((1,3),(1,2),(2,4),(2,3),(3,6),(3,8)),8) //Partition by Hash val result: RDD[(Int, Int)] = rdd.partitionBy(new org.apache.spark.HashPartitioner(2)) //Get partition mode println(result.partitioner) //Get number of partitions println(result.getNumPartitions) }
- Job
An action operator forms a job
- DAG directed acyclic graph
Describes the execution process of RDD.
How?
DAG directed acyclic graph is formed when action operator is encountered.
Lineage
RDD only supports coarse-grained transformation, that is, a single operation performed on a large number of records. Record a series of lineages (i.e. lineages) that created the RDD in order to recover the lost partitions. The Lineage of RDD records the metadata information and conversion behavior of RDD. When some partition data of RDD is lost, it can recalculate and recover the lost data partition according to this information.
Spark environment startup command
Start% SPARK_HOME%\bin\spark-shell.cmd script
Spark distributed environment
sbin/start-all.sh sbin/stop-all.sh
Submit task & execute procedure
[root@node01 spark]# bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://node01:7077 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --queue default \ ./examples/jars/spark-examples_2.11-2.4.7.jar \ 100
Spark distributed HA environment installation
To configure Zookeeper based ha, you need to add a sentence in spark-env.sh:
Note out the following: #SPARK_MASTER_HOST=node01 export SPARK_MASTER_PORT=7077 Add the following contents: during configuration, ensure that the following statements are in one line, otherwise the configuration is unsuccessful, and each-D Parameters are separated by spaces export SPARK_DAEMON_JAVA_OPTS=" -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node01:2181,node02:2181,node03:2181 -Dspark.deploy.zookeeper.dir=/spark"
Because ha is not sure that the master is started on node01, it will
export SPARK_ MASTER_ Comment out host = node01, synchronize spark-env.sh to other machines, restart spark cluster, node1 and node02 start master.
Task submission & execution procedure:
[root@node01 spark]# bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://node01:7077,node02:7077 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --queue default \ ./examples/jars/spark-examples_2.11-2.4.7.jar \ 100
Dynamic uplink and downlink slave
spark]# sbin/start-slave.sh node01:7077 -c 4 -m 1024M spark]# sbin/stop-slave.sh node01:7077 -c 4 -m 1024M
Spark distributed Yan environment
Modify the hadoop configuration file yarn-site.xml
[root@node01 hadoop]$ vi yarn-site.xml <!--Whether to start a thread to check the amount of physical memory being used by each task. If the task exceeds the allocated value, it will be killed directly. The default is true --> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <!--Whether to start a thread to check the amount of virtual memory being used by each task. If the task exceeds the allocated value, it will be killed directly. The default is true --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
Modify spark-env.sh
[root@node01 conf]# vi spark-env.sh YARN_CONF_DIR=/export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop HADOOP_CONF_DIR=/export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop
client mode
[root@node01 spark]# bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ ./examples/jars/spark-examples_2.11-2.4.7.jar \ 100
cluster mode
[root@node01 spark]# bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ ./examples/jars/spark-examples_2.11-2.4.7.jar \ 100
Spark code writing
WordCount program written by Sparkcore
package com.atguigu import org.apache.spark.{SparkConf, SparkContext} object WordCount{ def main(args: Array[String]): Unit = { //1. Create SparkConf and set App name val conf = new SparkConf().setAppName("WC") //2. Create SparkContext, which is the entry to submit Spark App val sc = new SparkContext(conf) //3. Use sc to create RDD and execute corresponding transformation and action sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1)) //4. Close the connection sc.stop() } }
Package to cluster test
bin/spark-submit \ --class WordCount \ --master spark://hadoop102:7077 \ WordCount.jar \ /word.txt \ /out
Efficient write to database
def saveInfoMySQLByForeachPartition(rdd: RDD[(String, Int)]): Unit = { rdd.foreachPartition(partition => { //This is within the partition and belongs to the local of the partition Class.forName("com.mysql.jdbc.Driver") val url = "jdbc:mysql://localhost:3306/test" val connection = DriverManager.getConnection(url, "mark", "sorry") val sql = """ |insert into wordcounts(word, `count`) Values(?, ?) |""".stripMargin val ps = connection.prepareStatement(sql) partition.foreach{case (word, count) => { ps.setString(1, word) ps.setInt(2, count) ps.execute() }} ps.close() connection.close() }) }
Spark SQL
It provides two programming abstractions: DataFrame and DataSet, and acts as a distributed SQL query engine.
DataFrame has one more header information (Schema: constraint information) than RDD
Dataset
Compared with RDD, dataset provides strong type support (generics) and adds type constraints to each row of data in RDD. Figure 1-7 shows the description of dataset in the official website.
- sparkSql query style
There are two styles: one is DSL style and the other is SQL style.
DSL: using operators for data analysis has certain requirements for programming ability.
SQL: use SQL statements to analyze data.
- schema constraint information
Refers to structured information.
- SparkCore and SparkSql
SparkCore: underlying abstraction: RDD program entry: SparkContext
SparkSql: underlying abstraction: DataFrame and DataSet program entry: SparkSession
- RDD,DataFrame,DataSet
DataFrame=RDD generic + schema+sql + optimization
DataSet=RDD+schma+sql + optimization
File save options
SparkSQL basic programming
Construction of SparkSession
val spark = SparkSession.builder() .appName("SparkSQLOps") .master("local[*]") //. enableHiveSupport() / / supports hive related operations .getOrCreate()
How to build DataFrame
package chapter1 import org.apache.spark.SparkContext import org.apache.spark.sql.{DataFrame, SparkSession} object Create_DataFrame { def main(args: Array[String]): Unit = { //Create program entry val spark: SparkSession = SparkSession.builder().appName("createDF").master("local[*]").getOrCreate() //Call sparkContext val sc: SparkContext = spark.sparkContext //Set console log output level sc.setLogLevel("WARN") //Create DataFrame from data source val personDF: DataFrame = spark.read.json("examples/src/main/resources/people.json") //Display data personDF.show() } }
Convert from RDD:
val personDF: DataFrame = personRDD.toDF("id","name","age")
Create a DataFrame through reflection:
val personDF: DataFrame = personRDD.toDF()
Dynamic programming
val df = spark.createDataFrame(row, schema) val list = List( new Student(1, "Wang Shengpeng", 1, 19), new Student(2, "Li Jinbao", 1, 49), new Student(3, "Zhang Haibo", 1, 39), new Student(4, "Zhang Wenyue", 0, 29) ) import spark.implicits._ val ds = spark.createDataset[Student](list)
Row: represents a row of records in a two-dimensional table, or a Java object
Data loading
Spark. Read. Format (data file format). load(path)
//Guide Package import spark.implicits._ //The first way //Load json file val personDF: DataFrame = spark.read.format("json").load("E:\\data\\people.json") //Load parquet file val personDF1: DataFrame = spark.read.format("parquet").load("E:\\data\\people.parquet") //Load the csv file. The csv file is special. If you want to bring the header, you must call the option method val person2: DataFrame = spark.read.format("csv").option("header","true").load("E:\\data\\people.csv") //Load tables in the database val personDF3: DataFrame = spark.read .format("jdbc") .option("url", "jdbc:mysql://localhost:3306/bigdata") .option("user", "root") .option("password", "root") .option("dbtable", "person") .load()
spark.read.json(path)
//The second way //Load json file val personDF4: DataFrame = spark.read.json("E:\\data\\people.json") //Load parquet file val personDF5: DataFrame = spark.read.parquet("E:\\data\\people.parquet") //Load the csv file. The csv file is special. If you want to bring the header, you must call the option method val person6: DataFrame = spark.read.option("header","true").csv("E:\\data\\people.csv") //Load tables in the database val properties = new Properties() properties.put("user", "root") properties.put("password", "root") val personDF7: DataFrame = spark.read.jdbc("jdbc:mysql://localhost:3306/bigdata", "person", properties)
Data landing
//The first way //Save as json file personDF.write.format("json").save("E:\\data\\json") //Save as parquet file personDF.write.format("parquet").save("E:\\data\\parquet") //Save it as a csv file. If you want to bring the header, call the option method personDF.write.format("csv").option("header","true").save("E:\\data\\csv") //Save as a table in the database personDF.write .format("jdbc") .option("url", "jdbc:mysql://localhost:3306/bigdata") .option("user", "root") .option("password", "root") .option("dbtable", "person").save()
//The second way //Save as parque file personDF.write.parquet("E:\\data\\parquet") //Save as csv file personDF.write.option("header", "true").csv("E:\\data\\csv") //Save as json file personDF.write.format("json").save("E:\\data\\json") //Table saved as database val props = new Properties() props.put("user","root") props.put("password","root") personDF.write.jdbc("jdbc:mysql://localhost:3306/bigdata","person",props)
SparkSQL and Hive integration
1. You need to import hive-site.xml of hive, add it under the classpath directory, or put it in $SPARK_HOME/conf
2. In order to parse the HDFS path in hive-site.xml normally, hdfs-site.xml and core-site.xml need to be under the classpath
package chapter5 import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession object Hive_Support { def main(args: Array[String]): Unit = { //Create sparkSql program entry val spark: SparkSession = SparkSession.builder() .appName("demo") .master("local[*]") .enableHiveSupport() .getOrCreate() //Call sparkContext val sc: SparkContext = spark.sparkContext //Set log level sc.setLogLevel("WARN") //Guide Package import spark.implicits._ //Query tables in hive spark.sql("show tables").show() //Create table spark.sql("CREATE TABLE person (id int, name string, age int) row format delimited fields terminated by ' '") //Import data spark.sql("load data local inpath'./person.txt' into table person") //Data in query table spark.sql("select * from person").show() } }
User defined function
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, SparkSession} object UDF_Demo { def main(args: Array[String]): Unit = { //Create sparkSql program entry val spark: SparkSession = SparkSession.builder().appName("demo").master("local[*]").getOrCreate() //Call sparkContext val sc: SparkContext = spark.sparkContext //Set log level sc.setLogLevel("WARN") //Guide Package import spark.implicits._ //load file val personDF: DataFrame = spark.read.json("E:\\data\\people.json") //Display data //personDF.show() //Register as a table personDF.createOrReplaceTempView("t_person") //What function does it give val fun = (x:String)=>{ "Name:"+x } //If there is no addName function, register it spark.udf.register("addName",fun) //query spark.sql("select name,addName(name) from t_person").show() //Release resources spark.stop() }}
Windowing function
row_number() over (partitin by XXX order by XXX)
rank() jump sort, with two second places followed by the fourth
dense_rank() In continuous sorting, there are two second places, followed by the third place
row_number() is sorted consecutively. The two values are the same, and the sorting is also different
package com.zg.d03 import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} case class StudentScore(name:String,clazz:Int,score:Int) object SparkSqlOverDemo { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[*]").setAppName("sparksqlover") val sc = new SparkContext(conf) val spark = SparkSession.builder().config(conf).getOrCreate() val arr01 = Array(("a",1,88), ("b",1,78), ("c",1,95), ("d",2,74), ("e",2,92), ("f",3,99), ("g",3,99), ("h",3,45), ("i",3,53), ("j",3,78)) import spark.implicits._ val scoreRDD = sc.makeRDD(arr01).map(x=>StudentScore(x._1,x._2,x._3)).toDS scoreRDD.createOrReplaceTempView("t_score") //Query t_score table data spark.sql("select * from t_score").show() //Use the windowing function to find the topN,rank() jump sort, there are two, the second is, followed by the fourth spark.sql("select name,clazz,score, rank() over( partition by clazz order by score desc ) rownum from t_score ").show() //Let's use the query result after the windowing function as a temporary table. This temporary table has the score ranking of each class, and then take the top three spark.sql("select * from (select name,clazz,score, rank() over( partition by clazz order by score desc ) rownum from t_score) t1 where rownum <=3 ").show() } }
SQL burst function
//SQL style operation /*positionDF.createOrReplaceTempView("t_position") val sql = """ |select position.workName as workNames,count(*) as counts |from( |select explode(data.list) as position |from t_position) |group by workNames |order by counts desc """.stripMargin spark.sql(sql).show()*/
SparkStreaming
The second generation of streaming processing framework generates mirco batch in a short time and submits a job. Quasi real time, slightly higher delay, second or sub second delay.
SparkStreaming is an extension of SparkCore's api. It uses dstream (discrete stream or dstream) as the data model to process continuous data streams based on memory. In essence, it is RDD's memory based computing.
DStream is essentially a sequence of RDD S.
- Real time computing and quasi real time computing
Real time computing: event driven. One piece of data drives the immediate processing of one piece of data.
Quasi real time computing: time driven. No matter whether the data is received or not, it will be processed at the time node.
SparkStreaming architecture
SparkStreaming operator
Mainly learn transform, updateByKey and window functions.
SparkStreaming basic programming
Entry class StreamingContext
object SparkStreamingWordCountOps { def main(args: Array[String]): Unit = { /* StreamingContext Initialization of requires at least two parameters, SparkConf and BatchDuration SparkConf Needless to say batchDuration: The time interval between submitting two jobs. One DStream will be submitted each time to convert the data into batch - > RDD Therefore, the calculation of sparkStreaming is how often the data is calculated */ val conf = new SparkConf() .setAppName("SparkStreamingWordCount") .setMaster("local[*]") val duration = Seconds(2) val ssc = new StreamingContext(conf, duration) //business //In order to perform streaming computation, you must call start to start ssc.start() //In order not to end the start startup program, you must call the awaitTermination method to wait for the completion of the program, then call the stop method to terminate the program, or the exception. ssc.awaitTermination() } }
awaitTermination
To continuously perform the streaming calculation, you must call the awaitTermination method so that the driver can reside in the background
Monitor local
Cannot read the manual copy or cut to the file in the specified directory. Only the file written through the stream can be read.
SparkStreaming consolidates HDFS
Under normal circumstances, we can read the files uploaded through put and the files copied through cp, but we can't read the files mv moved.
In this way, no additional Receiver consumes thread resources, so you can specify the master as local
object SparkStreamingHDFS { def main(args: Array[String]): Unit = { Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN) Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.spark_project").setLevel(Level.WARN) val conf = new SparkConf() .setAppName("SparkStreamingHDFS") .setMaster("local") val duration = Seconds(2) val ssc = new StreamingContext(conf, duration) //Reading data in local -- > needs to be written through stream // val lines = ssc.textFileStream("file:///E:/data/monitored") //hdfs val lines = ssc.textFileStream("hdfs://node01:9000/data/spark") lines.print() ssc.start() ssc.awaitTermination() } }
SparkStreaming integrates Kafka
Receiver mode:
Spark extracts data from kafka and saves the data in the executor's memory, but the underlying calculation failure often causes data loss. There are solutions. Open the WAL pre write log and save the data in kafka not only in the executor's memory, but also in the WAL pre write log. In this way, the data added to the memory is lost, It can also be recovered from the WAL, but this leads to data redundancy.
Direct mode:
Spark goes to kafka every batch interval to read the latest offset range in each partition under each topic, and the data is still saved in kafka. If the calculation fails, as long as kafka saves the data long enough, it can recover indefinitely without causing data redundancy.
Features of direct connection mode: batch time reads a batch of data from kafka at regular intervals, and then consumes it
Simplify parallelism. Number of partitions in rdd = number of partitions in topic
The data is stored in kafka without data redundancy
There is no single point problem
efficient
The semantics of only once consumption can be realized
Integrated coding
earliest
When there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, consumption starts from scratch
latest
When there are submitted offsets under each partition, consumption starts from the submitted offset; When there is no committed offset, the newly generated data under the partition is consumed
(1) Read data from kafka
import org.apache.kafka.clients.consumer.ConsumerRecord import org.apache.kafka.common.serialization.StringDeserializer import org.apache.spark.SparkConf import org.apache.spark.serializer.KryoSerializer import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.dstream.{DStream, InputDStream} import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies} /* SparkStremaing Read data from kafka PerPartitionConfig spark.streaming.kafka.maxRatePerPartition spark.streaming.kafka.minRatePerPartition All represent the consumption kafka rate of streaming programs, max: The maximum number of records read from each partition per second max=10,The number of partitions is 3 and the interval is 2s Therefore, the maximum number of records that can be read in this batch is: 10 * 3 * 2 = 60 If the configuration is 0 or not set, there is no upper limit on the starting rate min: The minimum number of records read from each partition per second This means that the rate at which streaming reads data from kafka is between [min, max] Serialization problem during execution: serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord spark There are two ways of serialization in The default method is java serialization, that is to write an implementation serializable interface. This method is very stable, but it is very certain that the performance is poor There is also a high-performance serialization method - kryo serialization. The performance is very high. The official data is 10 times higher than the serialization performance of java. At the same time, you only need to make a declarative registration when using it sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)//Specify how to serialize .registerKryoClasses(Array(classOf[ConsumerRecord[String, String]]))//Register the class to serialize */ object StreamingFromKafkaOps { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local") .setAppName("StreamingFromKafkaOps") // .set("spark.serializer", classOf[KryoSerializer].getName) // .registerKryoClasses(Array(classOf[ConsumerRecord[String, String]])) //Time interval between two flow calculations, batchInterval val batchDuration = Seconds(2) // Submit the sparkstreaming job every 2s val ssc = new StreamingContext(conf, batchDuration) val topics = Set("hadoop") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "node01:9092,node02:9092,node03:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "spark-kafka-grou-0817", "auto.offset.reset" -> "earliest", "enable.auto.commit" -> "false" ) /* Read data from kafka locationStrategy: Location strategy How to allocate consumers to a specific topic and partition for scheduling can be determined through LocationStrategies. After Kafka 0.10, consumers pull data first, so caching the executor operation in an appropriate executor is very important to improve performance. PreferBrokers: This method can be used if the broker instance of your executor in kafka is on the same node. PreferConsistent: In most cases, using this strategy will distribute the partition to all executor s PreferFixed: When the network or device performance is unbalanced, you can configure a specific partition for a specific executor in this way. Partitions that are not in this map map use the PreferConsistent policy consumerStrategy: Consumption strategy Configure the consumers of kafka created on the driver or the executor. The interface encapsulates the consumer process information and related checkpoint data Strategies for consumers when subscribing: Subscribe : Subscribe to multiple topic s for consumption, which are encapsulated in a collection SubscribePattern : You can subscribe to multiple consumers through regular matching. For example, the topic s subscribed include AAA, AAB, AAC and ADC, which can be represented by a[abc](2) Assign : Specifying the partition of a specific topic for consumption is a more fine-grained strategy */ val message:InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe(topics, kafkaParams)) // message.print() / / serialization problem in direct printing message.foreachRDD((rdd, bTime) => { if(!rdd.isEmpty()) { println("-------------------------------------------") println(s"Time: $bTime") println("-------------------------------------------") rdd.foreach(record => { println(record) }) } }) ssc.start() ssc.awaitTermination() } }
transform is a transformation operator, a transformation operator.
/* transform,Is a transformation operation transform(p:(RDD[A]) => RDD[B]):DStream[B] Similar operation foreachRDD(p: (RDD[A]) => Unit) transform A very important operation of DStream is to build operations that are not in DStream. Most operations of DStream can be simulated by transform For example, map (P: (a) = > b) - > Transform (RDD = > RDD. Map (P: (a) = > b)) */
import org.apache.spark.rdd.RDD import org.apache.spark.streaming.dstream.{ DStream, ReceiverInputDStream} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{SparkConf, SparkContext} object Advertising_ranking { def main(args: Array[String]): Unit = { //Create program entry val conf: SparkConf = new SparkConf().setAppName("advertising").setMaster("local[*]") val sc = new SparkContext(conf) val ssc = new StreamingContext(sc,Seconds(5)) //Set log level sc.setLogLevel("WARN") //receive data val data: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999) //Segmentation data val spliData: DStream[String] = data.flatMap(_.split(" ")) //Each click stream log is recorded once val pageAndOne: DStream[(String, Int)] = spliData.map((_,1)) //Aggregate the same Click Stream val pageAndCount: DStream[(String, Int)] = pageAndOne.reduceByKey(_+_) //Traverse the RDD encapsulated in DStream for operation val resultSorted: DStream[(String, Int)] = pageAndCount.transform(rdd => { //Flashback ranking of data in RDD val sorted: RDD[(String, Int)] = rdd.sortBy(_._2, false) //Get top3 from ranking data val topThree: Array[(String, Int)] = sorted.take(3) //Printout topThree.foreach(println) //Because transform needs to return a value sorted }) //Print the overall ranking resultSorted.print() //Start sparkStreaming ssc.start() //Let it start all the time and wait for the program to close ssc.awaitTermination() } }
updateStateByKey
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{SparkConf, SparkContext} object UpdateStateByKey_Demo { def updateFunc(currentValue:Seq[Int],historyValue:Option[Int]): Option[Int] = { val result: Int = currentValue.sum+historyValue.getOrElse(0) Some(result) } def main(args: Array[String]): Unit = { //Create a sparkStreaming program entry val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]") val sc = new SparkContext(conf) val ssc = new StreamingContext(sc,Seconds(5)) //Set log level sc.setLogLevel("WARN") //Set checkpoints to save historical status ssc.checkpoint("./999") //receive data val file: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999) //segmentation val spliFile: DStream[String] = file.flatMap(_.split(" ")) //Remember each word once val wordAndOne: DStream[(String, Int)] = spliFile.map((_,1)) //Conduct stateful conversion operation val wordAndCount: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc) //Printout wordAndCount.print() //Start sparkStreaming ssc.start() //Keep it open and wait for it to close ssc.awaitTermination() } }
Window
/** * window Window operation * The stream is unbounded, so it is certainly impossible for us to make global statistics, so we can segment this unbounded data set, * Each small interval that is segmented can be understood as window, * sparkstreaming is a quasi real-time stream computing, small batch operation, which can be understood as a special window operation. * * Theoretically, there are two cases for the division of this window: one is based on the number of data, and the other is based on time. * However, in spark streaming, only the latter is supported, that is, only time-based windows are supported, and two parameters need to be provided * One parameter is the length of the window: window_length * Another parameter is the calculation frequency of the window: sliding_interval, how often is the window operation calculated * * streaming Another interval in the program is batchInterval. What is the relationship between these two intervals? * batchInterval,How often do you start a spark job, but how often do you submit a batch of data for the program * * Special attention should be paid to: * window_length And sliding_interval must be an integer multiple of batchInterval. * * Summary: * window operation * Every M long time, count the data generated in N long time * M It's called sliding_interval, the sliding frequency of the window * N It is called window_length, the length of the window * The window is a sliding window. * * When sliding_ interval > window_ When length, a gap in the window will appear * When sliding_ interval < window_ When length, the window will coincide * When sliding_interval = window_length, the two windows fit perfectly * * batchInterval=2s * sliding_interval=4s * window_length=6s */ object WindowOps { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("local[*]") .setAppName("WindowOps") //Time interval between two flow calculations, batchInterval val batchDuration = Seconds(2) // Submit the sparkstreaming job every 2s val ssc = new StreamingContext(conf, batchDuration) val lines = ssc.socketTextStream("node01", 9999) val words = lines.flatMap(line => line.split("\\s+")) val pairs = words.map(word => (word, 1)) val ret = pairs.reduceByKeyAndWindow(_+_, windowDuration = Seconds(6), slideDuration = Seconds(4)) ret.print ssc.start() ssc.awaitTermination() } }