Spark RDD creates API MySQL HBase

Generally speaking, each Spark application contains a Driver, which runs the user's main method and performs various parallel operations on the cluster.
Spark provides the main abstract concept, which is the elastic distributed data set (RDD), which is an element divided across clusters
Can be operated in parallel.
RDD can start and transform RDD from the existing Scala set in file or driver of Hadoop file system (or any other Hadoop supported file system), and then call RDD operator to transform RDD. You can also ask Spark to persist RDD in memory, so that it can be reused efficiently in parallel operations. Finally, RDD automatically recovers from a node failure.

development environment

  • Import Maven dependency
<!--Spark RDD rely on-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!--and HDFS Integrate-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.9.2</version>
</dependency>
  • Scala compilation plug-in
 <!--scala Compiler plug-in-->
<plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>4.0.1</version>
    <executions>
    <execution>
        <id>scala-compile-first</id>
        <phase>process-resources</phase>
        <goals>
            <goal>add-source</goal>
            <goal>compile</goal>
        </goals>
    </execution>
    </executions>
</plugin>
  • Package the fat jar plug-in
<!--Establish fatjar Plug-in unit-->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.4.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            	<goals>
            <goal>shade</goal>
            </goals>
            <configuration>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
        </execution>
    </executions>
</plugin>
  • JDK compiled version plug-in (optional)
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>3.2</version>
    <configuration>
        <source>1.8</source>
        <target>1.8</target>
    	<encoding>UTF-8</encoding>
    </configuration>
    <executions>
        <execution>
            <phase>compile</phase>
            <goals>
            	<goal>compile</goal>
            </goals>
        </execution>
    </executions>
</plugin>
  • Driver compiling
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCountApplication1 {
    // Driver
    def main(args: Array[String]): Unit = {
        //1. Create a SparkContext
        val conf = new SparkConf()
        .setMaster("spark://CentOS:7077")
        .setAppName("SparkWordCountApplication")
        val sc = new SparkContext(conf)
        
        //2. create RDD - Refine
        val linesRDD: RDD[String] = sc.textFile("hdfs:///demo/words")
        
        //3.RDD->RDD Transformation lazy And - Refine
        var resultRDD:RDD[(String,Int)]=linesRDD.flatMap(line=> line.split("\\s+"))
                                                .map(word=>(word,1))
                                                .reduceByKey((v1,v2)=>v1+v2)
        
        //4.RDD-> Unit Or local collection Array|List Action conversion trigger job Execution
        val resutlArray: Array[(String, Int)] = resultRDD.collect()
        
        //Scala local set operation and Spark
        resutlArray.foreach(t=>println(t._1+"->"+t._2))
        //5. Close SparkContext
        sc.stop()
    }
}
  • Use maven package to package and upload fatjar to CentOS
  • Using spark submit to submit tasks
[root@CentOS spark-2.4.5]# ./bin/spark-submit
                            --master spark://CentOS:7077
                            --deploy-mode client
                            --class com.baizhi.quickstart.SparkWordCountApplication1
                            --name wordcount
                            --total-executor-cores 6 /root/spark-rdd-1.0-SNAPSHOT.jar

Spark provides local testing methods

object SparkWordCountApplication2 {
    // Driver
    def main(args: Array[String]): Unit = {
        //1. Create a SparkContext
        val conf = new SparkConf()
                        .setMaster("local[6]")
                        .setAppName("SparkWordCountApplication")
        val sc = new SparkContext(conf)
        
        
        //Turn off log display
        sc.setLogLevel("ERROR")
        
        //2. Create RDD - refine
        val linesRDD: RDD[String] = sc.textFile("hdfs://CentOS:9000/demo/words")
        //3. RDD - > RDD transform lazy parallel refinement
        var resultRDD:RDD[(String,Int)]=linesRDD.flatMap(line=> line.split("\\s+"))
                                        .map(word=>(word,1))
                                        .reduceByKey((v1,v2)=>v1+v2)
        
        //4. RDD - > unit or local set Array|List action conversion triggers job execution
        val resutlArray: Array[(String, Int)] = resultRDD.collect()
        
        //Scala local set operation and Spark
        resutlArray.foreach(t=>println(t._1+"->"+t._2))
        
        //5. Close SparkContext
        sc.stop()
    }
}

resource is required to import log4j.porperties

log4j.rootLogger = FATAL,stdout
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = %p %d{yyyy-MM-dd HH:mm:ss} %c %m%n

Parallelized Collections

Create and run a collection by calling the parallelize or makeRDD method of SparkContext on an existing collection (Scala Seq) in the Driver program. Copy the elements of the collection to form a distributed dataset that can operate in parallel. For example, the following is a way to create a union set containing numbers 1 to 5:

scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
<console>:26	

The parallel set can specify a partition parameter to specify the calculated parallel degree. The Spark cluster runs a task for each partition. When the user does not specify a partition, sc will automatically partition according to the resources allocated by the system. Such as:

[root@CentOS spark-2.4.5]# ./bin/spark-shell --master spark://CentOS:7077 --total-
executor-cores 6
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app-
20200208013551-0006).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

The system will automatically specify a partition number of 6 when merging the set. You can also specify the number of partitions manually

scala> val distData = sc.parallelize(data,10)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at
<console>:26
scala> distData.getNumPartitions
res1: Int = 10

External Datasets

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, HBase, Amazon S3, RDBMS, etc.

  • Local file system
      scala> sc.textFile("file:///root/t_word").collect
      res6: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day
      up ", come on baby)
  • Read HDFS
    textFile
    The file will be converted to an RDD[String] collection object. Each line represents an element in the RDD collection
      scala> sc.textFile("hdfs:///demo/words/t_word").collect
      res7: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)

This parameter can also be used to specify the number of partitions, but it needs to be > = the number of file system data blocks, so in general, the user can save or not.

wholeTextFiles

It will convert the file to RDD[(String,String)] collection object, and each tuple element in RDD represents a file. Where 1 file name 2 file content

scala> sc.wholeTextFiles("hdfs:///demo/words",1).collect
res26: Array[(String, String)] =
Array((hdfs://CentOS:9000/demo/words/t_word,"this is a demo
hello spark
good good study
day day up
come on baby
"))
scala> sc.wholeTextFiles("hdfs:///demo/words",1).collect
res26: Array[(String, String)] =
Array((hdfs://CentOS:9000/demo/words/t_word,"this is a demo
hello spark
good good study
day day up
come on baby
"))
scala>
sc.wholeTextFiles("hdfs:///demo/words",1).map(t=>t._2).flatMap(context=>context.split(
"\n")).collect
res25: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day
up ", come on baby)

newAPIHadoopRDD

MySQL

<!--MySQL rely on-->
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.38</version>
</dependency>
object SparkNewHadoopAPIMySQL {
    // Driver
    def main(args: Array[String]): Unit = {
        //1. Create a SparkContext
        val conf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("SparkWordCountApplication")
        val sc = new SparkContext(conf)
        val hadoopConfig = new Configuration()
        DBConfiguration.configureDB(hadoopConfig, //Link parameters for configuration database
        "com.mysql.jdbc.Driver",
        "jdbc:mysql://localhost:3306/test",
        "root",
        "root"
    )
    //Set query related properties
    hadoopConfig.set(DBConfiguration.INPUT_QUERY,"select id,name,password,birthDay
    from t_user")
    hadoopConfig.set(DBConfiguration.INPUT_COUNT_QUERY,"select count(id) from t_user")
  hadoopConfig.set(DBConfiguration.INPUT_CLASS_PROPERTY,"com.baizhi.createrdd.UserDBWri
    table")
    //Read external data source through InputFormat provided by Hadoop
    val jdbcRDD:RDD[(LongWritable,UserDBWritable)] = sc.newAPIHadoopRDD(
    hadoopConfig, //hadoop configuration information
    classOf[DBInputFormat[UserDBWritable]], //Input format class
    classOf[LongWritable], //Key type read in by Mapper
    classOf[UserDBWritable] //Mapper read in Value type
    )
    jdbcRDD.map(t=>(t._2.id,t._2.name,t._2.password,t._2.birthDay))
    .collect() //The remote data of the action operator is obtained from the Driver side, which is generally tested in a small batch of data
    .foreach(t=>println(t))
    //JDBC RDD. Foreach (t = > println (T)) / / action operator, remote ok
    //JDBC RDD. Collect(). Foreach (t = > println (T)) because neither UserDBWritable nor LongWritable can sequence error
    //5. Close SparkContext
    sc.stop()
    }
}
class UserDBWritable extends DBWritable {
    var id:Int=_
    var name:String=_
    var password:String=_
    var birthDay:Date=_
    //It is mainly used for DBOutputFormat, because it is used for reading, which can be ignored
    override def write(preparedStatement: PreparedStatement): Unit = {}
    //When using DBInputFormat, you need to encapsulate the read result set to the member property
    override def readFields(resultSet: ResultSet): Unit = {
        id=resultSet.getInt("id")
        name=resultSet.getString("name")
        password=resultSet.getString("password")
        birthDay=resultSet.getDate("birthDay")
    }
}

Hbase

<!--Hbase rely on,Order of attention-->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-auth</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.2.4</version>
</dependency>
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.client.{Result, Scan}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableMapReduceUtil}
import org.apache.hadoop.hbase.protobuf.ProtobufUtil
import org.apache.hadoop.hbase.util.{Base64, Bytes}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object SparkNewHadoopAPIHbase {
// Driver
    def main(args: Array[String]): Unit = {
        //1. Create a SparkContext
        val conf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("SparkWordCountApplication")
        val sc = new SparkContext(conf)
        val hadoopConf = new Configuration()
        hadoopConf.set(HConstants.ZOOKEEPER_QUORUM,"CentOS")//hbase link parameters
        hadoopConf.set(TableInputFormat.INPUT_TABLE,"baizhi:t_user")
        val scan = new Scan() //Build query item
        val pro = ProtobufUtil.toScan(scan)
        hadoopConf.set(TableInputFormat.SCAN,Base64.encodeBytes(pro.toByteArray))
        val hbaseRDD:RDD[(ImmutableBytesWritable,Result)] = sc.newAPIHadoopRDD(
        hadoopConf, //hadoop configuration
        classOf[TableInputFormat],//Input format
        classOf[ImmutableBytesWritable], //Mapper key type
        classOf[Result]//Mapper Value type
        )
        hbaseRDD.map(t=>{
        val rowKey = Bytes.toString(t._1.get())
        val result = t._2
        val name = Bytes.toString(result.getValue("cf1".getBytes(), "name".getBytes()))
        (rowKey,name)
        }).foreach(t=> println(t))
        //5. Close SparkContext
        sc.stop()
    }
}
Published 9 original articles, praised 0, visited 90
Private letter follow

Tags: Spark Apache Scala Hadoop

Posted on Sun, 23 Feb 2020 06:15:35 -0500 by ramesh_iridium