spark Learning Journey: spark SQL

Article directory

1. Spark SQL overview

  • Spark SQL is a module used by spark to process structured data. It provides a programming abstraction called DataFrame and serves as a distributed SQL query engine
  • Why learn Spark SQL
    We have learned Hive, which is to convert Hive SQL into MapReduce and submit it to the cluster for execution, greatly simplifying the complexity of programming MapReduce, because the execution efficiency of MapReduce is relatively slow. So Spark SQL came into being. It transforms Spark SQL into RDD, and then submits it to the cluster for execution. The execution efficiency is very fast!
    • Easy integration
    • Unified data access
    • Compatible with hive
    • Standard data links

2. The relationship and difference of RDD, DataFrame and Dataset in spark

3. Overview of dataframe

  • Like RDD, DataFrame is a distributed data container. However, DataFrame is more like a two-dimensional table of a traditional database. In addition to the data, it also records the structural information of the data, that is, schema. At the same time, like Hive, DataFrame supports nested data types (struct, array, and map). From the perspective of API usability, DataFrame API provides a set of high-level relational operations, which is more friendly and lower threshold than functional RDD API. Similar to R and panda's DataFrame, Spark DataFrame inherits the development experience of traditional stand-alone data analysis.
  • RDD is a collection of distributed Java objects, and DataFrame is a collection of distributed Row objects

3.1. Official api

3.2. Infer Schema by reflection

import org.apache.spark.sql.{DataFrame, SparkSession}

case class Person(id: Int, name: String, age: Int)

/**
 * Infer Schema by reflection
 */
object SparkSqlDemo {
  def main(args: Array[String]): Unit = {
    // Initialization
    val spark = SparkSession.builder()
      .appName("spark sql basic example")
      .master("local")
      .getOrCreate()
    // RDD --> DataFrame
    import spark.implicits._
    val peopleDF : DataFrame = spark.sparkContext.textFile("hdfs://192.168.1.28:9000/person.txt")
      .map(_.split(","))
      .map(attributes => Person(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt))
      .toDF()
    // Register as a temporary form
    peopleDF.createOrReplaceTempView("people")

    val valueDF: DataFrame = spark.sql("select id, name,age from people")
    valueDF.map(p => "name: " + p(1)).show()
    valueDF.map(p => "name: " + p.getAs[String]("name")).show()
    
  }
}

3.3. Specify Schema directly through StructType

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._

/**
 * Specify Schema directly through StructType
 */
object SparkSqlDemo {
  def main(args: Array[String]): Unit = {
    // Initialize environment
    val spark = SparkSession.builder().master("local").appName("spark sql").getOrCreate()
    // Convert file content to RDD
    val peopleRDD: RDD[String] = spark.sparkContext.textFile("hdfs://192.168.1.28:9000/person.txt")
    // Clear schema
    val schemaString = "id name age"

    val fields: Array[StructField] = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)

    val rowRDD: RDD[Row] = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim, attributes(2)))
    val peopleDF: DataFrame = spark.createDataFrame(rowRDD, schema)

    peopleDF.createOrReplaceTempView("people")

    spark.sql("select * from people").show()

  }
}

4. Spark SQL docking MySql

Spark SQL can create a DataFrame by reading data from a relational database through JDBC. After a series of calculations on the DataFrame, the data can also be written back to the relational database.

4.1. Load data from MySQL (Spark Shell mode)

4.2. Write data to MMySql

package cn.xiaofan.spark

import java.util.Properties

import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object JdbcRDD {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("MySQL-Demo").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    //Creating RDD by parallelization
    val personRDD = sc.parallelize(Array("1 tom 5", "2 jerry 3", "3 kitty 6")).map(_.split(" "))
    //Directly specify the schema of each field through StructType
    val schema = StructType(
      List(
        StructField("id", IntegerType, true),
        StructField("name", StringType, true),
        StructField("age", IntegerType, true)
      )
    )
    //Mapping RDD to rowRDD
    val rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))
    //Apply schema information to rowRDD
    val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
    //Create Properties store database related Properties
    val prop = new Properties()
    prop.put("user", "root")
    prop.put("password", "123456")
    //Append data to database
    personDataFrame.write.mode("append").jdbc("jdbc:mysql://mini3:3306/bigdata", "bigdata.person", prop)
    //Stop SparkContext
    sc.stop()
  }
}

5. Spark SQL docking Hive

  • link

  • gradle file: note that the error is reported when the Hadoop client is adjusted to 2.9.2

version '1.0.0'

apply plugin: 'java' //Specifying java plug-ins
apply plugin: 'scala' //Specify the scala plug-in

sourceCompatibility = 1.8

repositories {
    //Since it may not be possible to access maven Central Library in China, here are two basically normal access maven library addresses
    maven {
        url 'http://maven.aliyun.com/nexus/content/groups/public/'
    }
    maven {
        url 'https://maven.ibiblio.org/maven2/'
    }
}
dependencies {

    compile group: 'org.scala-lang', name: 'scala-library', version: '2.12.6'
    compile group: 'org.apache.spark', name: 'spark-core_2.12', version: '2.4.1'
    compile group: 'org.apache.spark', name: 'spark-sql_2.12', version: '2.4.1'
    compile group: 'org.apache.spark', name: 'spark-hive_2.12', version: '2.4.1'
    compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.7.6'
    compile group: 'mysql', name: 'mysql-connector-java', version: '5.1.35'
    compile group: 'org.postgresql', name: 'postgresql', version: '9.4-1201-jdbc41'
    testCompile group: 'org.scalatest', name: 'scalatest_2.12', version: '3.0.5'
}
  • main program
import org.apache.spark.sql.SparkSession

object TestSpark {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("Spark Hive")
      .master("local")
      .config("hadoop.home.dir", "/user/hive/warehouse")
      .enableHiveSupport()
      .getOrCreate()
    spark.sql("select * from default.s_province").show()
  }
}

  • Some errors encountered during reading

    • MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection
    • Solution
    • IDEA output scrambling: https://blog.csdn.net/qq_/article/details/80945140
  • spark reads the small case 2 of hive

import org.apache.spark.sql.SparkSession

object TestSpark {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("Spark Hive")
      .master("local")
      .config("hadoop.home.dir", "/user/hive/warehouse")
      .enableHiveSupport()
      .getOrCreate()
//    spark.sql("select * from default.s_province").show()

    import spark.implicits._
    import spark.sql

    sql("create table if not exists db_test.src (sid int, sname string) row format delimited fields terminated by '-' stored as textfile")
    sql("LOAD DATA LOCAL INPATH 'D://FifthScala/src/main/resources/kv1.txt' INTO TABLE db_test.src")

    // Queries are expressed in HiveQL
    sql("SELECT * FROM db_test.src").show()

  }
}

6. The relationship between SparkRDD, SchemaRDD (dataFrame) and SparkSQL

7. The sword is sharpened, and the plum blossom fragrance comes from the bitter cold.

76 original articles published, 10 praised, 2550 visited
Private letter follow

Tags: Spark SQL Apache hive

Posted on Fri, 21 Feb 2020 04:49:14 -0500 by alohatofu