Spark learning road 3 - the core of spark - Advanced RDD
1, Spark optimization
1.1 description of common parameters
## driver memory size. Generally, 4g is enough when there is no broadcast variable. If there is a broadcast variable, 6G, 8G, 12G, etc. can be set as appropriate --driver-memory 4g ## The memory of each executor is usually 4g enough, but sometimes it is easy to run out of memory when processing large quantities of data. Apply for more, such as 6G --executor-memory 4g ## The total number of executor s applied for is more than a dozen or dozens of ordinary tasks. You can apply for more, 100200, etc. when dealing with massive data, such as data over 100 G and T --num-executors 15 ## each executor The number of cores in each executor Tasks in task Number, set here as 2, i.e. 2 task Share 6 set above g Memory, each map or reduce Combination of tasks## Line degree is executor number*executor Number of tasks in, yarn There is usually an upper limit of resource application in the cluster, such as, executor-memory*num-executors < 400G Wait, so tune## Pay attention to this when testing parameters --executor-cores 2 ## Spark The default for jobs is 500~1000 One is more appropriate. If it is not set, spark According to the bottom HDFS of block Quantity setting task This will lead to a reduction in parallelism and resources##Underutilized. It is more appropriate to set this parameter to 2 ~ 3 times of num executors * executor core. --spark.default.parallelism 200 ## Sets the maximum proportion of RDD persistent data in the Executor memory. The default is 0.6 --spark.storage.memoryFraction 0.6 ## set up Shuffle One in the process task Pull to previous stage of task Can be used for aggregation operation after the output of Executor The default memory ratio is 0.2.as##If the memory used in Shuffle aggregation exceeds the 20% limit, the excess data will be overflowed to the middle area of the disk file, reducing the performance of Shuffle --spark.shuffle.memoryFraction 0.2 ## executor During execution, the memory used may exceed executor-memory,So I will executor Extra memory is reserved,## spark.yarn.executor.memoryOverhead represents this part of memory --spark.yarn.executor.memoryOverhead 1G
1.2 common programming suggestions for spark
- Avoid creating duplicate RDD S and reuse the same data as much as possible
- Avoid using shuffle operators as much as possible, because shuffle operation is the most performance consuming place in spark. reduceByKey,join,distinct,repartition and other operators will trigger shuffle operation. Try to use non ` ` shuffle 'operators of map class
- Use aggregateByKey and reduceByKey instead of groupByKey, because the first two are pre aggregation operations, which will aggregate the same key locally on each node. When other nodes pull the same key on all nodes, it will greatly reduce disk IO and network overhead.
- repartition is applicable to RDD[V], and partitionBy is applicable to RDD[K,V]
- mapPartitions replace normal map and foreachPartitions replace foreach
- The coalesce operation after the filter operation can reduce the number of RDD partition s
- If there is RDD reuse, especially if the RDD takes a long time, it is recommended to cache the RDD. If each partition of the RDD consumes a lot of memory, it is recommended to turn on the Kryo serialization mechanism (it is said that it can save 2 to 5 times of space). If there is still a large memory overhead, you can cache the storage_ Set level to MEMORY_AND_DISK_SER
- Try to avoid processing all logic in one Transformation, and try to decompose it into operations such as map and filter
- When performing union operations on multiple RDDS, avoid using rdd.union(rdd).union(rdd).union(rdd), which is only suitable for merging two RDDS. When merging multiple RDDS, use SparkContext.union(Array(RDD)). Avoid too many nested layers of unions, resulting in too long call links, too long time-consuming, and StackOverFlow
- Group/join/XXXByKey and other operations in spark can specify the number of partitions without using repartition and partitionBy functions
- Try to ensure that the amount of data processed by each task in each Stage is > 128M
- If two RDDS are joined and one of them has a small amount of data, Broadcast Join can be used to collect the small RDD data into the driver memory and BroadCast it into another RDD,
- When two RDDS are Cartesian products, the small RDD is passed in as a parameter, such as BigRDD.certesian(smallRDD)
- If you need to Broadcast a large object to the remote end as a dictionary query, you can use multiple executor cores and large executor memory. If the object with large memory consumption is stored in the external system, executor cores = 1, executor memory = m (the default value is 2g), it can operate normally. When the occupied space of the large dictionary is size(g), executor memory is 2*size, and executor cores = size / M (rounded up)
- If the object is too large to BroadCast to the remote end, and the requirement is to index the key in the small RDD according to the key in the large RDD, you can use zipPartitions to hash join. For the specific principle, refer to the shuffle process in the next section
- If you need to sort after repartition is repartitioned, you can directly use repartitionAndSortWithinPartitions, which is more efficient than decomposition because it can sort while shuffling
2, Two dependencies in Spark
2.1 wide dependence
Each Partition of the child RDD depends on all partitions of the parent RDD
Reorganize and reduce a single RDD based on key, such as groupByKey and reduceByKey
Join and reorganize two RDD S based on key, such as join
It is recommended to cache the RDD generated by a large number of shuffle s. This avoids the overhead of recalculation after failure.
2.2 narrow dependence
Each Partition of a child RDD depends on only one or part of the Partition
Input and output one-to-one operators, and the partition structure of the resulting RDD remains unchanged. Mainly map/flatmap
Input and output one-to-one operators, but as a result, the partition structure of RDD has changed, such as union/coalesce
Operators that select some elements from the input, such as filter, distinct, substract and sample
2.3 DAG
DAG(Directed Acyclic Graph) is called a directed acyclic graph. The original RDD forms a DAG through a series of transformations. The DAG is divided into different stages according to the dependencies between RDDS. For narrow dependencies, the conversion processing of partition is completed in the Stage. For wide dependency, due to the existence of Shuffle, the next calculation can only be started after the parent RDD processing is completed. Therefore, wide dependency is the basis for dividing stages.
2.4 task division (key points)
RDD Task segmentation is divided into Application, Job, Stage and Task.
-
Application: initialize a SparkContext to generate an application
-
Job: an Action operator will generate a job
-
Stage: the Job is divided into different stages according to different dependencies between RDD S. When a wide dependency is encountered, a stage is divided.
-
-
Task: Stage is a TaskSet. Sending the results of Stage division to different Executor s for execution is a task.
Note: each layer of application - > job - > stage - > task has a 1-to-n relationship.
2.5 RDD caching and checkpointing
2.5.1 RDD cache
- The RDD cache caches the previous calculation results through the persist method or cache method. By default, persist() caches the data in the heap space of the JVM in the form of serialization
- However, these two methods are not cached immediately when called, but when the subsequent action operator is triggered, the RDD will be cached in the memory of the computing node for later use
- The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. The cache fault tolerance mechanism of RDD ensures the correct execution of the calculation even if the cache is lost. Through a series of transformations based on RDD, the lost data will be reused. Since each partition of RDD is relatively independent, only the lost part needs to be calculated, and it is not necessary to recalculate all partitions
2.5.2 RDD checkpoint
- In addition to the persistent operation of data saving and processing in spark, a checkpoint mechanism is also provided. The essence of checkpoint is to write to disk (disk) through RDD as a checkpoint to assist in fault tolerance through lineage. Too long lineage will lead to too high fault tolerance cost. Therefore, it is better to do checkpoint fault tolerance at the middle point. If a node has a problem and loses partitions later, Since the checkpoint RDD does not start redoing the lineage, the overhead will be reduced. The checkpoint implements the RDD check function by writing data to the HDFS file system.
- RDD cache and checkpoints are generally used when RDD blood relationship is long.
- The cache is stored in memory and the checkpoint is stored in disk.
2.6 accumulator and broadcast variables
2.6.1 accumulator
- The accumulator writes only shared variables
- The accumulator is used to aggregate information. Usually, when you want to pass the spark function, such as using map() or filter() to pass conditions, you can use the variables defined by the driver program, but each task running in the cluster will get a new copy of these variables, and updating the values of these copies will not affect the corresponding variables in the driver. If we want to realize the function of updating shared variables during all fragment processing, the tired machine can meet our requirements.
2.6.2 broadcast variables
- Broadcast variables are read-only shared variables
- Broadcast variables are used to efficiently distribute large objects and send a large read-only value to all working nodes for one or more spark operations. For example, if your application needs to send a large read-only query table to all nodes, or even a large feature vector in its learning algorithm, broadcast variables are very easy to use. If the same variable is used in multiple parallel operations, spark will send it separately for each task
3, Principle of Spark
3.1 Spark operation process
The Spark application runs on the distributed cluster in the unit of process collection, and interacts with the cluster through the SparkContext object created by the main method of the driver program.
- Spark requests resources (cpu, memory, etc.) to be executed from cluster manager through SparkContext
- Cluster manager allocates the resources required for application execution and creates an Executor on the Worker node
- SparkContext sends program code (jar package or python file) and Task tasks to the Executor for execution, and collects the results to the Driver
3.2 concepts involved in spark operation
3.2.1 Application: Spark application
It refers to the Spark application written by users, including Driver function code and Executor code running on multiple nodes in the distributed cluster.
Spark application consists of one live JOB and multiple JOB jobs, as shown in the following figure:
3.2.2 Driver: Driver
The Driver in Spark runs the Main() function of the above Application and creates a SparkContext. The purpose of creating a SparkContext is to prepare the running environment of the Spark Application. In Spark, SparkContext is responsible for communication with ClusterManager, resource Application, task allocation and monitoring; When the Executor is partially run, the Driver is responsible for closing the SparkContext. Generally, SparkContext represents Driver, as shown in the following figure.
3.2.3 Cluster Manager: Resource Manager
It refers to the external services that obtain resources on the cluster. Commonly used are: Standalone,Spark's native resource manager, and the Master is responsible for resource allocation; Hadoop Yarn, the resource manager in Yarn is responsible for resource allocation; The Messos Master in Messos is responsible for resource management
3.2.4 Executor: executor
Application is a process running on the Worker node, which is responsible for running tasks and storing data in memory or disk. Each application has its own batch of executors, as shown in the following figure.
3.2.5 Worker: calculation node
Any node in the cluster that can run Application code, similar to the NodeManager node in Yan. In Standalone mode, it refers to the Worker node configured through the Slave file, in spark on yard mode, it refers to the NodeManager node, and in Spark on Messos mode, it refers to the Messos Slave node, as shown in the following figure.
3.2.6 DAGScheduler: directed acyclic graph scheduler
Stage is divided based on DAG and submitted to TaskScheduler in the form of TaskSet; Be responsible for splitting the job into multiple batches of tasks with dependencies in different stages; One of the most important tasks is to calculate the dependency between jobs and tasks, specify the scheduling logic and be instantiated in the process of SparkContext initialization. A SparkContext corresponds to a DAGScheduler.
3.2.7 task scheduler: Task Scheduler
Submit the TaskSet to the worker (cluster) for operation and report the results; Responsible for the physical scheduling of each specific task. If shown:
3.2.8 Job: job
A computing JOB composed of one or more scheduling stages; Parallel computing composed of multiple tasks is often spawned by Spark Action. A JOB contains multiple RDDS and various operations acting on the corresponding RDDS. As shown in the figure:
3.2.9 Stage: scheduling stage
Scheduling phase corresponding to a Task set; Each Job will be divided into many groups of tasks. Each group of tasks is called stage, or TaskSet. A Job is divided into multiple stages; Stage is divided into two types: shufflemapstage and resultstage. As shown in the figure:
Multiple jobs and multiple stages in an Application: a Spark Application can trigger many jobs due to different actions. An Application clock can have many jobs. Each job is composed of one or more stages. The subsequent stages depend on the previous stages, that is, the subsequent stages will run only after the previous dependent stages are calculated.
Division basis: Stage division is based on wide dependency. When wide dependency occurs, reduceByKey,groupByKey and other operators will lead to wide dependency.
Core algorithm: backtrack from the back to the front, join this stage in case of narrow dependency, and perform stage segmentation in case of wide dependency. The Spark kernel will start from the RDD that triggers the Action operation and push back. First, it will create a stage for the last RDD, and then continue to push back. If it is found that there is a wide dependency on an RDD, it will create a new stage for the wide dependency RDD, which is the last RDD of the new stage. Then, by analogy, continue to push backward, and divide the stage according to the narrow dependency or wide dependency until all RDDS are traversed.
3.2.10 TaskSet: task set
A task set composed of a group of tasks that are associated but have no Shuffle dependency on each other. As shown in the figure:
Tips:
1) Create a TaskSet for a Stage;
2) Create a Task for each Rdd partition of the Stage, and multiple tasks are encapsulated into TaskSe
3.2.11 Task: task
Work tasks sent to an Executor; The smallest processing flow unit on a single partitioned dataset (a single stage is divided into multiple tasks according to the number of partitions of operation data). As shown in the figure.
Summary:
4, Spark's Shuffle
5, Spark SQL
5.1 Spark SQL overview
5.1.1 Spark SQL official overview
5.1.1.1 official website address
5.1.1.2 what is Spark SQL
Spark SQL is a module used by spark to process structured data
Spark SQL also provides a variety of usage methods, including DataFrames API and DataSets API
But no matter what kind of API or programming language, they are based on the same execution engine, so you can switch between different APIs at will. They have their own characteristics
5.1.1.3 features of spark SQL
-
Easy integration
sql query is seamlessly mixed with spark program, and API operations of java, scala, python, R and other languages can be used
-
Unified data access
Connect to any data source in the same way
-
Compatible with Hive
Syntax supporting hive HQL
Compatible with hive (metabase, SQL syntax, UDF, serialization and deserialization mechanism)
-
Standard data connection
You can use industry standard JDBC or ODBC connections
5.1.1.4 advantages and disadvantages of spark SQL
-
advantage
Clear expression, low difficulty and easy to learn
-
shortcoming
Complex analysis and SQL nesting; Machine learning is difficult
5.1.2 Spark SQL data abstraction
5.1.2.1 DataFrame
Similar to RDD, DataFrame is also a distributed data container. However, DataFrame is more like the two-dimensional table of traditional database. In addition to processing the table, it also records the data structure information, namely schema.
At the same time, similar to hive, DataFrame also supports nested data types (struct,array,map).
From the perspective of API ease of use, DataFrame API provides a set of high-level relational operations, which is more friendly and lower threshold than functional RDD API.
5.1.2.2 DataSet
Compared with RDD, it saves more description information and is conceptually equivalent to a two-dimensional table in a relational database
It is an extension of DataFrame API and the latest data abstraction of spark
User friendly API style, with both type safety check and query optimization features of DataFrame
DataSet supports codec. When it needs to access data on non heap, it can avoid deserializing the whole object and improve efficiency
The sample class is used to define the structure information of data in the DataSet. The name of each attribute in the sample class is directly mapped to the field name in the DataSet
Calling the DataSet method will generate a logical plan, which will be optimized by the spark optimizer, and finally generate a physical plan, which will be submitted to the cluster for operation
DataSet contains the function of DataFrame, which is unified in Spark 2.0. DataFrame is represented as DataSet[Row], that is, a subset of DataSet. DataFrame is actually DataSet[Row]
5.1.2.3 creation of DF and DS
-
Creation of DS
- Created by Range
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.sql.functions.desc import org.apache.spark.sql.{Dataset, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") // range create DS val ds: Dataset[lang.Long] = spark.range(5, 100, 3) // The show method displays 20 values by default ds.orderBy(desc("id")).show() // statistical information ds.describe().show() // Display statistics using rdd println(ds.rdd.map(_.toInt).stats) // Displays the number of partitions println(ds.rdd.getNumPartitions) // Display schema information ds.printSchema() // close resource spark.stop() } } +---+ | id| +---+ | 98| | 95| | 92| | 89| | 86| | 83| | 80| | 77| | 74| | 71| | 68| | 65| | 62| | 59| | 56| | 53| | 50| | 47| | 44| | 41| +---+ only showing top 20 rows +-------+------------------+ |summary| id| +-------+------------------+ | count| 32| | mean| 51.5| | stddev|28.142494558940577| | min| 5| | max| 98| +-------+------------------+ (count: 32, mean: 51.500000, stdev: 27.699278, max: 98.000000, min: 5.000000) 8 root |-- id: long (nullable = false)
- Create DS from collection
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.sql.functions.desc import org.apache.spark.sql.{Dataset, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ case class OPerson(name: String, age: Int, height: Int) object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") import spark.implicits._ val seq1 = Seq(OPerson("Tom", 22, 180), OPerson("Jackie", 29, 177)) val ds: Dataset[OPerson] = spark.createDataset(seq1) ds.printSchema() ds.show() val seq2 = Seq(("Marry", 23, 172), ("Lily", 22, 167)) val ds2: Dataset[(String, Int, Int)] = spark.createDataset(seq2) ds2.show() ds2.printSchema() // close resource spark.stop() } } root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- height: integer (nullable = false) +------+---+------+ | name|age|height| +------+---+------+ | Tom| 22| 180| |Jackie| 29| 177| +------+---+------+ +-----+---+---+ | _1| _2| _3| +-----+---+---+ |Marry| 23|172| | Lily| 22|167| +-----+---+---+ root |-- _1: string (nullable = true) |-- _2: integer (nullable = false) |-- _3: integer (nullable = false)
-
Creation of DF
- DF created by RDD
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.functions.desc import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ case class OPerson(name: String, age: Int, height: Int) object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") import spark.implicits._ val seq2 = Seq(("Marry", 23, 172), ("Lily", 22, 167)) val value: RDD[(String, Int, Int)] = sc.parallelize(seq2) val frame: DataFrame = value.toDF("name", "age", "height") frame.printSchema() frame.show() // close resource spark.stop() } } root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- height: integer (nullable = false) +-----+---+------+ | name|age|height| +-----+---+------+ |Marry| 23| 172| | Lily| 22| 167| +-----+---+------+
-
Create DF from file
- DF is created from the format file (format file refers to the file containing schema, such as json file)
{"empno":1001,"ename":"zhangsan","job":"salesman","mgr":1002,"hiredate":"2010-09-11","sal":5000,"comm":500,"deptno":1} {"empno":1002,"ename":"lisi","job":"manager","hiredate":"2009-09-01","sal":13000,"comm":10000,"deptno":1} {"empno":1003,"ename":"wangwu","job":"wenyuan","mgr":1008,"hiredate":"2010-09-11","sal":5000,"comm":500,"deptno":2} {"empno":1004,"ename":"zhaoliu","job":"wenyuan","mgr":1008,"hiredate":"2011-09-11","sal":5000,"comm":500,"deptno":2} {"empno":1005,"ename":"zhuqi","job":"salesman","mgr":1002,"hiredate":"2012-08-11","sal":5000,"comm":500,"deptno":1} {"empno":1006,"ename":"ford","job":"analyst","mgr":1002,"hiredate":"2014-07-12","sal":5000,"comm":500,"deptno":1} {"empno":1007,"ename":"adams","job":"clerk","mgr":1008,"hiredate":"2013-06-13","sal":500,"comm":500,"deptno":2} {"empno":1008,"ename":"jack","job":"manager","hiredate":"2007-09-18","sal":13000,"comm":8000,"deptno":2}
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.functions.desc import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ case class OPerson(name: String, age: Int, height: Int) object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") val df: DataFrame = spark.read.json("data/emp.json") // spark.read.json("data/emp.json") == spark.read.format("json").load("data/emp.json") df.show() df.printSchema() // close resource spark.stop() } } +-----+------+-----+--------+----------+--------+----+-----+ | comm|deptno|empno| ename| hiredate| job| mgr| sal| +-----+------+-----+--------+----------+--------+----+-----+ | 500| 1| 1001|zhangsan|2010-09-11|salesman|1002| 5000| |10000| 1| 1002| lisi|2009-09-01| manager|null|13000| | 500| 2| 1003| wangwu|2010-09-11| wenyuan|1008| 5000| | 500| 2| 1004| zhaoliu|2011-09-11| wenyuan|1008| 5000| | 500| 1| 1005| zhuqi|2012-08-11|salesman|1002| 5000| | 500| 1| 1006| ford|2014-07-12| analyst|1002| 5000| | 500| 2| 1007| adams|2013-06-13| clerk|1008| 500| | 8000| 2| 1008| jack|2007-09-18| manager|null|13000| +-----+------+-----+--------+----------+--------+----+-----+ root |-- comm: long (nullable = true) |-- deptno: long (nullable = true) |-- empno: long (nullable = true) |-- ename: string (nullable = true) |-- hiredate: string (nullable = true) |-- job: string (nullable = true) |-- mgr: long (nullable = true) |-- sal: long (nullable = true)
- Create DF from text file
Directly use val frame: DataFrame = spark.read.text("data/emp.data"), the obtained result has only one value field, which is not convenient for subsequent use. Therefore, this method is not recommended. It is recommended to convert RDD to DF
+--------------------+ | value| +--------------------+ |empno,ename,job,m...| |1001,zhangsan,sal...| |1002,lisi,manager...| |1003,wangwu,wenyu...| |1004,zhaoliu,weny...| |1005,zhuqi,salesm...| |1006,ford,analyst...| |1007,adams,clerk,...| |1008,jack,manager...| +--------------------+
There are two ways to operate RDD - > DF
-
Reflection RDD[case class].toDF()
Note: the disadvantage of using this method is that case class supports 22 fields at most, so it does not support too many fields; And know in advance which fields are available.
1001,zhangsan,salesman,lisi,2010-09-11,5000,500,1 1002,lisi,manager,,2009-09-01,13000,10000,1 1003,wangwu,wenyuan,jack,2010-09-11,5000,500,2 1004,zhaoliu,wenyuan,jack,2011-09-11,5000,500,2 1005,zhuqi,salesman,lisi,2012-08-11,5000,500,1 1006,ford,analyst,lisi,2014-07-12,5000,500,1 1007,adams,clerk,jack,2013-06-13,500,500,2 1008,jack,manager,,2007-09-18,13000,8000,2
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.functions.desc import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ case class OPerson(name: String, age: Int, height: Int) case class Emp(empno:Int,ename:String,job:String,mgr:String,hiredate:String,comm:Double,salary:Double) object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") import spark.implicits._ val empDF: DataFrame = sc.textFile("data/emp.data").map(_.split(",")).map(x => Emp(x(0).toInt, x(1), x(2), x(3), x(4), x(5).toDouble, x(6).toDouble)).toDF() empDF.show() empDF.printSchema() // close resource spark.stop() } } +-----+--------+--------+----+----------+-------+-------+ |empno| ename| job| mgr| hiredate| comm| salary| +-----+--------+--------+----+----------+-------+-------+ | 1001|zhangsan|salesman|lisi|2010-09-11| 5000.0| 500.0| | 1002| lisi| manager| |2009-09-01|13000.0|10000.0| | 1003| wangwu| wenyuan|jack|2010-09-11| 5000.0| 500.0| | 1004| zhaoliu| wenyuan|jack|2011-09-11| 5000.0| 500.0| | 1005| zhuqi|salesman|lisi|2012-08-11| 5000.0| 500.0| | 1006| ford| analyst|lisi|2014-07-12| 5000.0| 500.0| | 1007| adams| clerk|jack|2013-06-13| 500.0| 500.0| | 1008| jack| manager| |2007-09-18|13000.0| 8000.0| +-----+--------+--------+----+----------+-------+-------+ root |-- empno: integer (nullable = false) |-- ename: string (nullable = true) |-- job: string (nullable = true) |-- mgr: string (nullable = true) |-- hiredate: string (nullable = true) |-- comm: double (nullable = false) |-- salary: double (nullable = false)
-
Programming RDD[Row]+schema or RDD[Class]+ClassOf[Class]`
1001,zhangsan,salesman,lisi,2010-09-11,5000,500,1 1002,lisi,manager,,2009-09-01,13000,10000,1 1003,wangwu,wenyuan,jack,2010-09-11,5000,500,2 1004,zhaoliu,wenyuan,jack,2011-09-11,5000,500,2 1005,zhuqi,salesman,lisi,2012-08-11,5000,500,1 1006,ford,analyst,lisi,2014-07-12,5000,500,1 1007,adams,clerk,jack,2013-06-13,500,500,2 1008,jack,manager,,2007-09-18,13000,8000,2
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} import java.lang /** * @className Create_DS_DF * @Description $description * @Author liut * @Date 2021/10/28 13:47 */ case class OPerson(name: String, age: Int, height: Int) case class Emp(empno:Int,ename:String,job:String,mgr:String,hiredate:String,comm:Double,salary:Double) object Create_DS_DF { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") // Read file as rdd val value: RDD[Array[String]] = sc.textFile("data/emp.data").map(_.split(",")) // Convert original RDD to RDD[Row] val emp: RDD[Row] = value.map(x => Row(x(0).toInt, x(1), x(2), x(3), x(4), x(5).toDouble, x(6).toDouble,x(7).toInt)) // Define Schema // Structtype) structtype is a case class, and the passed in parameter is Array[StructField] // StructField StructField is a case class, and the passed in parameter is name type nullable (the name, type and whether a field in the table is empty) // string type passed in StringType int type passed in IntegerType val structType = StructType(Array(StructField("empno",IntegerType,nullable = false),StructField("ename",StringType,nullable = true),StructField("job",StringType,nullable = true),StructField("mgr",StringType,nullable = true),StructField("hiredate",StringType,nullable = true),StructField("comm",DoubleType,nullable = false),StructField("salary",DoubleType,nullable = false),StructField("deptid",IntegerType,nullable = false))) // Create DF val df: DataFrame = spark.createDataFrame(emp, structType) df.show() // close resource spark.stop() } } +-----+--------+--------+----+----------+-------+-------+------+ |empno| ename| job| mgr| hiredate| comm| salary|deptid| +-----+--------+--------+----+----------+-------+-------+------+ | 1001|zhangsan|salesman|lisi|2010-09-11| 5000.0| 500.0| 1| | 1002| lisi| manager| |2009-09-01|13000.0|10000.0| 1| | 1003| wangwu| wenyuan|jack|2010-09-11| 5000.0| 500.0| 2| | 1004| zhaoliu| wenyuan|jack|2011-09-11| 5000.0| 500.0| 2| | 1005| zhuqi|salesman|lisi|2012-08-11| 5000.0| 500.0| 1| | 1006| ford| analyst|lisi|2014-07-12| 5000.0| 500.0| 1| | 1007| adams| clerk|jack|2013-06-13| 500.0| 500.0| 2| | 1008| jack| manager| |2007-09-18|13000.0| 8000.0| 2| +-----+--------+--------+----+----------+-------+-------+------+
5.1.2.4 commonalities and differences of RDD, dataframe and dataset
-
generality
All three are distributed elastic databases under spark platform, which provide convenience for processing super large data
All three have inert mechanisms. They will not be executed immediately during creation and conversion, such as map method. They will start traversal and execution only when they encounter Action operator Action, such as foreach
All three will automatically cache operations according to the spark memory condition, so that even if the amount of data is large, they do not worry about memory overflow
All three have the concept of partition
When operating on DataFrame and Dataset, you need to import the related package import spark. Implications_
Both DataFrame and Dataset can use pattern matching to obtain the value and type of each field
-
difference
DataSet = DataFrame + type = RDD + structure + type
DataFrame = RDD + structure
5.1.2.5 conversion among RDD, DS and DF
package cn.lagou.sparksql import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} /** * @className RDD_DF_DS * @Description $description * @Author liut * @Date 2021/10/28 10:20 */ case class Person(name:String,age:Int,height:Int) object RDD_DF_DS { def main(args: Array[String]): Unit = { // Prepare environment val spark: SparkSession = SparkSession .builder() .appName("spark_sql_demo") .master("local[*]") .getOrCreate() val sc: SparkContext = spark.sparkContext sc.setLogLevel("warn") // Load data val lines: RDD[String] = sc.textFile("data/person2.csv") // Processing data val personRDD: RDD[Person] = lines.map(line => { val arr: Array[String] = line.split(",") Person(arr(0), arr(1).toInt, arr(2).toInt) }) // Conversion operation // This statement must be imported when programming in idea // prefix.implicits._ The prefix in is consistent with the name in Val name = sparksession. Builder(). AppName ("Demo1"). Master ("local [*]). Getorcreate() import spark.implicits._ // RDD - DF val df: DataFrame = personRDD.toDF() // RDD - DS val ds: Dataset[Person] = personRDD.toDS() // DF - RDD note that DF has no generics and the Row is used when converting to RDD val rdd: RDD[Row] = df.rdd // DS - RDD val rdd1: RDD[Person] = ds.rdd // DF - DS val ds1: Any = df.as[Person] // DS - DF val df1: DataFrame = ds.toDF() // Output results df.printSchema() df.show() ds.printSchema() ds.show() rdd.foreach(println) rdd1.foreach(println) // close resource spark.stop() } } root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- height: integer (nullable = false) +--------+---+------+ | name|age|height| +--------+---+------+ |zhangsan| 22| 178| | lisi| 25| 175| | wangwu| 22| 170| +--------+---+------+ root |-- name: string (nullable = true) |-- age: integer (nullable = false) |-- height: integer (nullable = false) +--------+---+------+ | name|age|height| +--------+---+------+ |zhangsan| 22| 178| | lisi| 25| 175| | wangwu| 22| 170| +--------+---+------+ [wangwu,22,170] [zhangsan,22,178] [lisi,25,175] Person(wangwu,22,170) Person(zhangsan,22,178) Person(lisi,25,175)
5.2 Spark SQL related operations
5.3 user defined functions in spark SQL
5.3.1 UDF user defined functions
5.3.1.1 definitions
UDF (user defined function), that is, the most basic function, provides the function of field conversion in SQL and does not involve aggregation operation. For example, convert date type to string type, format field, etc.
5.3.1.2 usage
zhangsan,22,178 lisi,25,175 wangwu,22,170
package cn.lagou.sparksql import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, SparkSession} /** * @className UdfDemo * @Description $description * @Author liut * @Date 2021/10/28 16:10 */ object UdfDemo { case class UdfPerson(name: String, age: Int, height: Int) def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.WARN) val spark = SparkSession.builder().appName(this.getClass.getCanonicalName).master("local[*]").getOrCreate() val sc = spark.sparkContext val personRDD: RDD[String] = sc.textFile("data/person.txt") /** * Register a udf function * toString Is the reference name of the custom function * (str:String) => str + "I am the UDF custom function "the function body of the custom function, anonymous function */ spark.udf.register("toString", (str: String) => str + "I am UDF Custom function") // Import implicit conversion import spark.implicits._ // Convert RDD to DF using reflection val df: DataFrame = personRDD.map(_.split(",")).map(line => UdfPerson(line(0), line(1).toInt, line(2).toInt)).toDF() // Register DF as a table df.createOrReplaceTempView("person") // Use Spark SQL to query data, and toString is our custom UDF function spark.sql("select toString(name),age,height from person").show() spark.stop() } } +---------------------------+---+------+ | UDF:toString(name)|age|height| +---------------------------+---+------+ |zhangsan I am UDF Custom function| 22| 178| | lisi I am UDF Custom function| 25| 175| | wangwu I am UDF Custom function| 22| 170| +---------------------------+---+------+
5.3.2 UDAF user defined aggregate function
5.3.2.1 definitions
UDAF function enables user-defined aggregation functions to aggregate data sets for Spark SQL, similar to max(),min(),count() And other functions, but the customized functions are determined according to the specific business functions. Because DataFrame is weakly typed and DataSet is strongly typed, the customized UDAF also provides two implementations, one is weakly typed and the other is strongly typed.
5.3.2.2 usage
- For weak type usage, you need to inherit UserDefinedAggregateFunction and implement its methods
package cn.lagou.sparksql import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types.{DataType, IntegerType, StructField, StructType} /** * @className U Large Demo * @Description $description * @Author liut * @Date 2021/10/28 16:37 */ object UDAFDemo1 extends UserDefinedAggregateFunction{ // : Nil is used to collect common arrays for StructField and put them into it override def inputSchema: StructType = StructType(StructField("age",IntegerType) ::Nil) // Cache field type, that is, each partition shares variables override def bufferSchema: StructType = StructType(StructField("sum",IntegerType) :: StructField("count",IntegerType) :: Nil) // UDF output data type override def dataType: DataType = IntegerType // Is the input type consistent with the output type override def deterministic: Boolean = true // Initializing shared variables in partitions override def initialize(buffer: MutableAggregationBuffer): Unit = { // Initialize the sum of ages on each partition to 0 buffer(0) = 0 // Initialize the total number of people on each partition to 0 buffer(1) = 0 } // This method needs to be called when aggregating each record in each partition override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { // Aggregate the newly entered data with the previously merged structure //buffer(0) is the sum of ages defined above, that is, the sum of ages on each partition buffer(0) = buffer.getInt(0) + input.getInt(0) //buffer(1) is the number of people defined above, count, that is, the number of people in each partition buffer(1) = buffer.getInt(1) + 1 } // Merge partition results override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { // buffer1(0) is the sum of the ages of all partitions //buffer1.getInt(0) + buffer2.getInt(0): adds the ages on the partitions //Subscript 0 is the sum of ages buffer1(0) = buffer1.getInt(0) + buffer2.getInt(0) //buffer(1) is the number of people in all partitions //buffer1.getInt(1) + buffer2.getInt(1): the number of people in each partition is aggregated, //Subscript 1 is the number of people buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1) } // Final settlement result override def evaluate(buffer: Row): Any = { buffer.getInt(0) / buffer.getInt(1) } }
package cn.lagou.sparksql import org.apache.log4j.{Level, Logger} import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, SparkSession} /** * @className UDAFDemo1Main * @Description $description * @Author liut * @Date 2021/10/28 16:47 */ case class UDAFPerson(name:String,age:Int) object UDAFDemo1Main { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.WARN) val spark = SparkSession.builder().appName(this.getClass.getCanonicalName).master("local[*]").getOrCreate() val sc = spark.sparkContext // Get RDD from file val personRDD: RDD[String] = sc.textFile("data/person.txt") import spark.implicits._ //Convert RDD to DataFrame using reflection val personDF: DataFrame = personRDD.map(_.split(",")).map(line => UDAFPerson(line(0), line(1).toInt)).toDF() spark.udf.register("UDAFDemo1Main", UDAFDemo1) personDF.createOrReplaceTempView("person") spark.sql("select UDAFDemo1Main(age) from person").show() spark.stop() } } +---------------+ |udafdemo1$(age)| +---------------+ | 23| +---------------+
- Strongly typed usage: you need to inherit the Aggregator and implement its methods. Since it is strongly typed, it must be designed to the existence of objects
package cn.lagou.sparksql import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.{Column, DataFrame, Dataset, Encoder, Encoders, SparkSession, TypedColumn} import org.apache.spark.sql.expressions.Aggregator /** * @className SafeUDAFDemo * @Description $description * @Author liut * @Date 2021/10/18 16:11 */ // input case class Sales(id: Int, name1: String, sales: Double, discount: Double, name2: String, sTime: String) // Cache variables, that is, logical media case class SalesBuffer(var sales2019: Double, var sales2020: Double) class TypeSafeUDAF extends Aggregator[Sales, SalesBuffer, Double] { // Define initial value override def zero: SalesBuffer = SalesBuffer(0.0, 0.0) // Merge within partition override def reduce(buffer: SalesBuffer, input: Sales): SalesBuffer = { val sales: Double = input.sales val year: String = input.sTime.take(4) year match { case "2019" => buffer.sales2019 += sales case "2020" => buffer.sales2020 += sales case _ => println("!ERROR") } buffer } // Merging between partitions override def merge(b1: SalesBuffer, b2: SalesBuffer): SalesBuffer = { SalesBuffer(b1.sales2019 + b2.sales2019, b1.sales2020 + b2.sales2020) } // Calculate final value override def finish(reduction: SalesBuffer): Double = { if (math.abs(reduction.sales2019) < 0.0000001) 0.0 else (reduction.sales2020 - reduction.sales2019)/reduction.sales2019 } // Define encoder override def bufferEncoder: Encoder[SalesBuffer] = Encoders.product override def outputEncoder: Encoder[Double] = Encoders.scalaDouble } object SafeUDAFDemo { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.WARN) val spark = SparkSession.builder().appName(this.getClass.getCanonicalName).master("local[*]").getOrCreate() val sc = spark.sparkContext val sales = Seq( Sales(1, "Widget Co", 1000.00, 0.00, "AZ", "2019-01-02"), Sales(2, "Acme Widgets", 1000.00, 500.00, "CA", "2019-02-01"), Sales(3, "Widgetry", 1000.00, 200.00, "CA", "2020-01-11"), Sales(4, "Widgets R Us", 2000.00, 0.00, "CA", "2020-02-19"), Sales(5, "Ye Olde Widget", 3000.00, 0.00, "MA", "2020-02-28") ) import spark.implicits._ val ds = spark.createDataset(sales) // ds.show() val rate: TypedColumn[Sales, Double] = new TypeSafeUDAF().toColumn.name("rate") ds.select(rate).show() } } +----+ |rate| +----+ | 2.0| +----+
5.4 data sources in spark SQL
The default data source of Spark SQL is Parquet format. When the data source is Parquet file, Spark SQL can easily perform all operations. Modify spark.sql.sources.default to modify the default data source format.
5.4.1 reading and saving of documents
- read file
// The usual way to read json files spark.read.json("File path") // Common way to read files (local files) spark.read.format("json").load("File path") // Read files on hdfs spark.read.format("json").load("htdfs://ip of hdfs: port number / file path / file name ")
- Write file to local or HDFS
// Write files locally df.write.format("json").save("Storage path") // Write files to HDFS df.write.format("json").save("hdfs route") // When writing a file, you can specify how to write the file df.write.format("json").mode("append").save("route") // mode: error (default), aSpend (best), overwrite (overwrite), ignore (ignore if data exists)
5.4.2 MYSQL reading and saving
- Read data from database
spark.read.format("jdbc") // Set the connection url of the database .option("url", "jdbc:mysql://8.140.21.123:3306/ebiz") // Set database tables to access .option("dbtable", "user") // Set login user name .option("user", "hive") // Set login password .option("password", "12345678") .load()
package cn.lagou.sparksql import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.{DataFrame, SparkSession} object AccessMysql { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.WARN) val spark = SparkSession.builder().appName(this.getClass.getCanonicalName).master("local[*]").getOrCreate() val sc = spark.sparkContext val df: DataFrame = spark.read.format("jdbc") .option("url", "jdbc:mysql://8.140.21.123:3306/ebiz") .option("dbtable", "user") .option("user", "hive") .option("password", "12345678") .load() df.show() spark.stop() } } +---+--------+--------------------+----------------+-----------+-------------------+-------------------+ | id|username| password| email| phone| create_time| update_time| +---+--------+--------------------+----------------+-----------+-------------------+-------------------+ | 22| admin|21232F297A57A5A74...| admin@lagou.com|13811110000|2020-09-02 14:34:12|2020-09-02 14:34:12| | 23| Zhang San|202CB962AC59075B9...|zhangsan@163.com|13211111111|2020-09-02 14:36:16|2020-09-02 14:36:16| | 24| adsf|202CB962AC59075B9...| 1@qq.com|01081290817|2020-09-02 14:37:12|2020-09-02 14:37:12| +---+--------+--------------------+----------------+-----------+-------------------+-------------------+
- Write data to database
package cn.lagou.sparksql import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} import java.util.Properties object AccessMysql { def main(args: Array[String]): Unit = { Logger.getLogger("org").setLevel(Level.WARN) val spark = SparkSession.builder().appName(this.getClass.getCanonicalName).master("local[*]").getOrCreate() val sc = spark.sparkContext val df: DataFrame = spark.read.format("jdbc") .option("url", "jdbc:mysql://8.140.21.123:3306/ebiz") .option("dbtable", "user") .option("user", "hive") .option("password", "12345678") .load() df.show() val url = "jdbc:mysql://linux123:3306/ebiz?useUnicode=true&characterEncoding=UTF-8" val conn = new Properties() val driver = "com.mysql.jdbc.Driver" conn.setProperty("user","hive") conn.setProperty("password","12345678") conn.setProperty("driver",driver) df.write.mode(saveMode = SaveMode.Append) // When the data table does not exist, the table will be created automatically, but the coding format of the created table is latin1 by default +---+--------+--------------------+----------------+-----------+-------------------+-------------------+ | id|username| password| email| phone| create_time| update_time| +---+--------+--------------------+----------------+-----------+-------------------+-------------------+ | 22| admin|21232F297A57A5A74...| admin@lagou.com|13811110000|2020-09-02 14:34:12|2020-09-02 14:34:12| | 23| ??|202CB962AC59075B9...|zhangsan@163.com|13211111111|2020-09-02 14:36:16|2020-09-02 14:36:16| | 24| adsf|202CB962AC59075B9...| 1@qq.com|01081290817|2020-09-02 14:37:12|2020-09-02 14:37:12| +---+--------+--------------------+----------------+-----------+-------------------+-------------------+ // There will be errors when inserting Chinese into the table, which can be inserted only after modification .jdbc(url,"user_bak",conn) spark.stop() } }
5.4.3 Spark operation hive database
5.4.3.1 basic introduction
By default, spark comes with hive. You can directly write spark.sql("...") to operate the built-in hive database
5.4.3.2 using external hive
-
Using spark shell to operate hive
- Delete hive built in spark, that is, delete metastore_db and spark warehouse folders
- Copy the hive-site.xml file from the external hive to the spark/conf directory of the project
- Restart spark shell
- At this time, spark.sql("...") accesses the external hive
- You can use the bin / spark SQL command to manipulate hive
-
Using compiled software to access hive
- Increase the dependency of hive
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <version>2.4.5</version> </dependency>
- Copy hive's configuration file to the resources directory
<configuration> <!-- hive metastore Service address --> <property> <name>hive.metastore.uris</name> <!--Configuring 121 and 123--> <value>thrift://linux121:9083,thrift://linux123:9083</value> </property> </configuration>
- Start hive's metastore service
- Create a program to access hive
package cn.lagou.sparksql import org.apache.spark.sql.SparkSession object AccessHive { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("Demo1").master("local[*]") .enableHiveSupport() // spark uses the same Convention as hive to write parquet data .config("spark.sql.parquet.writeLegacyFormat","true") .getOrCreate() val sc = spark.sparkContext sc.setLogLevel("warn") spark.sql("show databases").show() +------------+ |databaseName| +------------+ | ads| | default| | dim| | dwd| | dws| | ods| | test| | tmp| +------------+ spark.close() } }
Ask hive
- Increase the dependency of hive
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <version>2.4.5</version> </dependency>
- Copy hive's configuration file to the resources directory
<configuration> <!-- hive metastore Service address --> <property> <name>hive.metastore.uris</name> <!--Configuring 121 and 123--> <value>thrift://linux121:9083,thrift://linux123:9083</value> </property> </configuration>
- Start hive's metastore service
- Create a program to access hive
package cn.lagou.sparksql import org.apache.spark.sql.SparkSession object AccessHive { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("Demo1").master("local[*]") .enableHiveSupport() // spark uses the same Convention as hive to write parquet data .config("spark.sql.parquet.writeLegacyFormat","true") .getOrCreate() val sc = spark.sparkContext sc.setLogLevel("warn") spark.sql("show databases").show() +------------+ |databaseName| +------------+ | ads| | default| | dim| | dwd| | dws| | ods| | test| | tmp| +------------+ spark.close() } }