Generally speaking, each Spark application contains a Driver, which runs the user's main method and performs various parallel operations on the cluster.
Spark provides the main abstract concept, which is the elastic distributed data set (RDD), which is an element divided across clusters
Can be operated ...
Posted on Sun, 23 Feb 2020 06:15:35 -0500 by ramesh_iridium
The example of this article comes from Delta Lake Official course . Because the official tutorial is based on the commercial software Databricks Community Edition. Although the software features used in the tutorial are all possessed by the open-source Delta Lake version, considering the domestic network environment, the threshold for register ...
Posted on Sun, 23 Feb 2020 00:35:26 -0500 by GroundZeroStudios
1. Spark SQL overview
2. The relationship and difference of RDD, DataFrame and Dataset in spark
3. Overview of dataframe
3.1. [official api](http://spark.apache.org/docs/2.4.1/sql-getting-started.html)
3.2. Infer Schema by reflection
3.3. Specify Schema directly through StructType
Posted on Fri, 21 Feb 2020 04:49:14 -0500 by alohatofu
Recently, we need to do a visual presentation project for the base station. We first tried three-dimensional visualization technology to program, but the customer feedback is that their client computers are relatively poor, performance and efficiency will be poor, and even some of them are virtual machines. So we first made a 3Ddemo with ...
Posted on Mon, 17 Feb 2020 22:19:44 -0500 by sara_kovai
1, Theoretical basis
2, Code test wordCount
2. Test data
3. Results display
1, Theoretical basis
1. In flow computing, there is usually a need for state computing, that is, the current computing results not only depend on the current received data, but also need to merge the p ...
Posted on Sun, 16 Feb 2020 01:04:55 -0500 by godwisam
1, Dependent package configuration
The related dependency packages of scala and spark. The number of versions underlined after spark package should be consistent with the first two digits of scala version, that is, 2.11
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http: ...
Posted on Thu, 30 Jan 2020 10:39:11 -0500 by max_power
When a job is separated as a stage in DAGScheduler, the entire job is sorted out into a ShuffleMapStage based on its internal shuffle relationship, and the resulting ResultStage iterates through its parent stage when submitted, adding itself to the DAGScheduler's waiting set and executing the child s ...
Posted on Fri, 24 Jan 2020 21:03:30 -0500 by douceur
Elastic distributed data set (RDD) is not only an immutable set of JVM objects, but also the core of Apache Spark. The dataset is divided into blocks based on keywords and distributed to the actuator node. This enables such datasets to perform operations at high speed. In addition, RDD applies tracin ...
Posted on Wed, 22 Jan 2020 00:59:17 -0500 by gassaz
Spark is a lightning fast unified analysis engine (Computing Framework) for large-scale data set processing. Spark is doing batch computing of data, and its computing performance is about 10-100 times that of Hadoop MapReduce. Because spark uses advanced DAG based task scheduli ...
Posted on Sun, 19 Jan 2020 02:17:34 -0500 by varsha
Spark Core of big data technology spark (3)
reduce(func): use func function to gather all elements in RDD, first aggregate the data in the partition, and then aggregate the data between partitions.
collect(): in the driver, all elements of the dataset are returned as arrays.
count(): functio ...
Posted on Sat, 18 Jan 2020 00:39:06 -0500 by poltort