sparkmllib algorithm operation - Part 2

0. sparkmllib basic statistics

  • Relevance
  • hypothesis test
  • Summing up device

1. Relevance

Calculating the correlation between two series of data is a common operation in statistics. In, we provide a lot of flexibility in the series to calculate the correlation between the two. At present, Pearson and Spearman are supported.

Correlation computes the correlation matrix for the input vector data set using the specified method. The output will be a DataFrame containing the correlation matrix of the vector columns.

import{Matrix, Vectors}
import org.apache.spark.sql.Row

val data = Seq(
//sparse matrix
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
//Dense matrix
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
//Dense matrix
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
//sparse matrix
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))

//Data structure to DataFrame
val df ="features")

//Calculate correlation - default to pearman coefficient
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n $coeff1")

//Calculate correlation - set to spearman factor
val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println(s"Spearman correlation matrix:\n $coeff2")

2. Hypothesis test

Hypothesis testing is a powerful statistical tool, which can be used to determine whether the results are statistically significant and whether the results happen accidentally. currently supports Pearson's chi square (χ 2 χ 2) test independence.

ChiSquareTest conducts Pearson's independence test for each function on the label. For each feature, the (feature, label) pair is transformed into a column matrix for which chi square statistics are calculated. All labels and eigenvalues must be classified.

import{Vector, Vectors}

//Preparation data
val data = Seq(
  (0.0, Vectors.dense(0.5, 10.0)),
  (0.0, Vectors.dense(1.5, 20.0)),
  (1.0, Vectors.dense(1.5, 30.0)),
  (0.0, Vectors.dense(3.5, 30.0)),
  (0.0, Vectors.dense(3.5, 40.0)),
  (1.0, Vectors.dense(3.5, 40.0))

//data conversion
val df = data.toDF("label", "features")
//hypothesis test
val chi = ChiSquareTest.test(df, "features", "label").head

//Output inspection value
println(s"pValues = ${chi.getAs[Vector](0)}")
println(s"degreesOfFreedom ${chi.getSeq[Int](1).mkString("[", ",", "]")}")
println(s"statistics ${chi.getAs[Vector](2)}")

3. Summarizer

We have summarized the Dataframe for providing vector column summary statistics. The available metrics are maximum, minimum, average, variance, and nonzero by column, and total.

The following example shows how to use Summarizer The vector columns of input data frames with and without weight columns calculate the mean and variance.

import{Vector, Vectors}

//Loading data
val data = Seq(
  (Vectors.dense(2.0, 3.0, 5.0), 1.0),
  (Vectors.dense(4.0, 6.0, 7.0), 2.0)

//Structural transfer
val df = data.toDF("features", "weight")

//Selection data
val (meanVal, varianceVal) ="mean", "variance")
  .summary($"features", $"weight").as("summary"))
  .select("summary.mean", "summary.variance")
  .as[(Vector, Vector)].first()

println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")

//Selection data
val (meanVal2, varianceVal2) =$"features"), variance($"features"))
  .as[(Vector, Vector)].first()

println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")

4. Example - chi square verification selection feature

 def main(args: Array[String]): Unit = {
    //Preparation environment
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("test").getOrCreate()
    //Import data
    val data = Seq(
      (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
      (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
      (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
    var df = spark.createDataFrame(data).toDF("id", "features", "label")
    //Chi square verification selection feature
    val chisquare = new ChiSqSelector().setFeaturesCol("features").setLabelCol("label").setNumTopFeatures(1)


If you have any questions, please contact QQ:765120845

Published 65 original articles, won praise 282, visited 710000+
Private letter follow

Tags: Spark Apache SQL

Posted on Mon, 16 Mar 2020 05:32:04 -0400 by nuklehed