1. Overview of spark SQL
1.1 What is spark SQL
Spark SQL is a module Spark uses to process structured data. It provides a programming abstraction called DataFrame and functions as a distributed SQL query engine.Similar to what hive does.
Features of 1.2 spark SQL
1. Easy to integrate: When you installed Spark, it was already integrated.N ...
Posted on Sat, 16 Nov 2019 01:24:56 -0500 by HokieTracks
1. Overview of spark-streaming
1.1 Common real-time computing engines
Real-time computing engines, also known as streaming computing engines, currently have three commonly used:1. Apache Storm: True Streaming2. Spark Streaming: Strictly speaking, it's not really streaming (real-time computing)Processing continuous streaming data as discrete RD ...
Posted on Sat, 16 Nov 2019 01:20:27 -0500 by Snart
spark machine learning decision tree
--------Only for personal learning knowledge arrangement and R language / python code arrangement
The project uses the decision tree in spark environment, and uses the functions in ml of r and python to learn python sklearn package when returning.
ml is not convenient for draw ...
Posted on Tue, 12 Nov 2019 15:04:59 -0500 by InternetX
Spark Task Scheduling
Several key components of Spark
Concepts and characteristics of RDD
Two types of RDD generation
Two types of RDD operators
DAG: Directed Acyclic Graph
Custom Partiti ...
Posted on Mon, 11 Nov 2019 22:46:13 -0500 by 2oMst
Spark SQL supports two ways to convert an existing RDD to a DataFrame.The first method uses reflection to infer the RDD schema, create a DataSet, and then convert it to a DataFrame. This reflection based approach is simple, but only if you know the schema type of RDD when you write your Spark application.The second approach is to use the Struc ...
Posted on Sun, 03 Nov 2019 04:10:18 -0500 by truCido
Next chapter flume+springboot+kafka integration In this paper, sparkStream is also integrated. As kafka's consumer, sparkStream accepts kafka's data and realizes real-time calculation of log error and warning data.
(1) the environment is the environment in the previous article. Only one sparkStream ...
Posted on Tue, 29 Oct 2019 17:32:40 -0400 by zushiba
This is my first time to write a blog. During this period of work, I will learn new things almost every day. As an ordinary person, I also feel the general memory of myself, so I want to record the technical points I encounter in my daily work through the way of blog, and share them when I am ...
Posted on Tue, 29 Oct 2019 12:59:38 -0400 by fahrvergnuugen
First, we go to the maven repository to find the dependency of BoneCP:
Add to pom.xml
My kafka version is 0.10 spark version is 2.4.2 because it's experimental, so I don't pack it in linux and run it directly on IDEA.
First, I set up a new table Kafka? Test? TBL in the G6 database ...
Posted on Mon, 28 Oct 2019 14:27:34 -0400 by Mr P!nk
spark reads data from the database through jdbc. If the data is too large, it must be partitioned. Otherwise, it runs slowly. The number of partitions can be seen from webui. The number of partitions is the number of tasks. If some tasks are completed quickly and some tasks are completed slowly after pa ...
Posted on Tue, 22 Oct 2019 15:04:42 -0400 by Sven70
Big data day 33 - spark Java related operator practice
import org.a ...
Posted on Sun, 20 Oct 2019 14:42:09 -0400 by TCovert