# 0. sparkmllib basic statistics

• Relevance
• hypothesis test
• Summing up device

# 1. Relevance

Calculating the correlation between two series of data is a common operation in statistics. In spark.ml, we provide a lot of flexibility in the series to calculate the correlation between the two. At present, Pearson and Spearman are supported.

Correlation computes the correlation matrix for the input vector data set using the specified method. The output will be a DataFrame containing the correlation matrix of the vector columns.

```import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val data = Seq(
//sparse matrix
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
//Dense matrix
Vectors.dense(4.0, 5.0, 0.0, 3.0),
//Dense matrix
Vectors.dense(6.0, 7.0, 0.0, 8.0),
//sparse matrix
Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

//Data structure to DataFrame
val df = data.map(Tuple1.apply).toDF("features")

//Calculate correlation - default to pearman coefficient
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n \$coeff1")

//Calculate correlation - set to spearman factor
val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println(s"Spearman correlation matrix:\n \$coeff2")```

# 2. Hypothesis test

Hypothesis testing is a powerful statistical tool, which can be used to determine whether the results are statistically significant and whether the results happen accidentally. spark.ml currently supports Pearson's chi square (χ 2 χ 2) test independence.

ChiSquareTest conducts Pearson's independence test for each function on the label. For each feature, the (feature, label) pair is transformed into a column matrix for which chi square statistics are calculated. All labels and eigenvalues must be classified.

```import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.ChiSquareTest

//Preparation data
val data = Seq(
(0.0, Vectors.dense(0.5, 10.0)),
(0.0, Vectors.dense(1.5, 20.0)),
(1.0, Vectors.dense(1.5, 30.0)),
(0.0, Vectors.dense(3.5, 30.0)),
(0.0, Vectors.dense(3.5, 40.0)),
(1.0, Vectors.dense(3.5, 40.0))
)

//data conversion
val df = data.toDF("label", "features")
//hypothesis test
val chi = ChiSquareTest.test(df, "features", "label").head

//Output inspection value
println(s"pValues = \${chi.getAs[Vector](0)}")
println(s"degreesOfFreedom \${chi.getSeq[Int](1).mkString("[", ",", "]")}")
println(s"statistics \${chi.getAs[Vector](2)}")```

# 3. Summarizer

We have summarized the Dataframe for providing vector column summary statistics. The available metrics are maximum, minimum, average, variance, and nonzero by column, and total.

The following example shows how to use Summarizer The vector columns of input data frames with and without weight columns calculate the mean and variance.

```import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizer

val data = Seq(
(Vectors.dense(2.0, 3.0, 5.0), 1.0),
(Vectors.dense(4.0, 6.0, 7.0), 2.0)
)

//Structural transfer
val df = data.toDF("features", "weight")

//Selection data
val (meanVal, varianceVal) = df.select(metrics("mean", "variance")
.summary(\$"features", \$"weight").as("summary"))
.select("summary.mean", "summary.variance")
.as[(Vector, Vector)].first()

println(s"with weight: mean = \${meanVal}, variance = \${varianceVal}")

//Selection data
val (meanVal2, varianceVal2) = df.select(mean(\$"features"), variance(\$"features"))
.as[(Vector, Vector)].first()

println(s"without weight: mean = \${meanVal2}, sum = \${varianceVal2}")```

# 4. Example - chi square verification selection feature

``` def main(args: Array[String]): Unit = {
//Preparation environment
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("test").getOrCreate()
//Import data
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
var df = spark.createDataFrame(data).toDF("id", "features", "label")
//Chi square verification selection feature
val chisquare = new ChiSqSelector().setFeaturesCol("features").setLabelCol("label").setNumTopFeatures(1)
chisquare.fit(df).transform(df).show(false)
}```  