Details of Spark's five JOIN strategies

JOIN operations are very common data processing operations. Spark, as a unified big data processing engine, provides very rich JOIN scenarios. This article will introduce the five JOIN strategies provided by spark, hoping to help you. This paper mainly includes the following contents:

  • Factors affecting JOIN operation
  • Five strategies implemented by JOIN in Spark
  • How does Spark choose the JOIN strategy

Factors affecting JOIN operation

The size of the dataset
The size of the data set participating in the join will directly affect the execution efficiency of the join operation. Similarly, it will also affect the selection of join mechanism and the execution efficiency of join.
JOIN conditions
JOIN conditions involve logical comparisons between fields. According to the conditions of JOIN, JOIN can be divided into two categories: equivalent connection and non equivalent connection. Equivalent connection involves one or more equality conditions that need to be satisfied at the same time. Each equivalence condition is applied between the attributes of two input datasets. When other operators are used (the operation connector is not =), it is called non equivalent connection.
Type of JOIN
After the JOIN condition is applied between the records of the input dataset, the JOIN type affects the result of the JOIN operation. There are mainly the following JOIN types:

  • Inner join: only records matching the connection conditions are output from the input dataset.
  • Outer join: it is also divided into left outer join, right outer join and all outer join.
  • Semi join: the right table is only used to filter the data of the left table and does not appear in the result set.
  • Cross join: cross join returns all rows in the left table. Each row in the left table is combined with all rows in the right table. Cross joins are also called Cartesian products.

Five strategies implemented by JOIN in Spark

Spark provides five JOIN strategies to perform specific JOIN operations. The five JOIN strategies are as follows:

  • Shuffle Hash Join
  • Broadcast Hash Join
  • Sort Merge Join
  • Cartesian Join
  • Broadcast Nested Loop Join

Shuffle Hash Join

brief introduction
When the amount of table data to be joined is large, you can select Shuffle Hash Join. In this way, large tables can be repartitioned according to the JOIN key to ensure that each same JOIN key is sent to the same partition. As shown below:

As shown in the figure above, the basic steps of Shuffle Hash Join mainly include the following two points:

  • First, the two tables participating in the join are repartitioned according to the join key. This process involves Shuffle. Its purpose is to send the data of the same join key to the same partition to facilitate join in the partition.
  • Secondly, for the partition after each Shuffle, the partition data of the small table will be built into a Hashtable, and then matched with the partition data records of the large table according to the join key.
    Conditions and characteristics
  • Only equivalent connection is supported, and the join key does not need sorting
  • All join types except full outer joins are supported
  • It is a memory intensive operation to build a Hash map for a small table. If the data on one side of the Hash table is large, it may cause OOM
  • Set the parameter spark.sql.join.prefersortmergeJoin to false (the default is true)

Broadcast Hash Join

brief introduction
It is also called Map side JOIN. When a table is small, we usually choose Broadcast Hash Join, which can avoid the overhead caused by Shuffle and improve performance. For example, when joining a fact table and a dimension table, because the data of the dimension table is usually very small, you can use Broadcast Hash Join to Broadcast the dimension table. This can avoid the Shuffle of data (the Shuffle operation in Spark is very time-consuming), so as to improve the efficiency of JOIN. Before Broadcast Join, Spark needs to send the data on the Executor side to the Driver side first, and then the Driver side broadcasts the data to the Executor side. If we need to Broadcast more data, it will cause OOM on the Driver side. The details are shown as follows:

Broadcast Hash Join mainly includes two stages:

  • Broadcast stage: the small table is cached in the executor
  • Hash Join phase: execute Hash Join in each executor
    Conditions and characteristics
  • Only equivalent connection is supported, and the join key does not need sorting
  • All join types except full outer joins are supported
  • Broadcast Hash Join is more efficient than other JOIN mechanisms. However, Broadcast Hash Join is a network intensive operation (redundant data transmission). In addition, it needs to cache data on the Driver side. Therefore, when the amount of data in a small table is large, OOM will occur
  • The data volume of the broadcast small table should be smaller. spark.sql.autoBroadcastJoinThreshold value is 10MB(10485760) by default
  • The size threshold of the broadcast table cannot exceed 8GB. The spark 2.4 source code is as follows: BroadcastExchangeExec.scala
longMetric("dataSize") += dataSize
 if (dataSize >= (8L << 30)) {
    throw new SparkException(s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB")
}
  • The base table cannot be broadcast. For example, when a left connection is made, only the right table can be broadcast. Such as: fact_table.join(broadcast(dimension_table), you can not use the broadcast prompt. When the conditions are met, it will automatically switch to the JOIN mode.

Sort Merge Join

brief introduction
The JOIN mechanism is Spark's default and can be configured through the parameter spark.sql.join.preferSortMergeJoin. The default is true, which means that Sort Merge Join is preferred. This method is generally used when joining two large tables. Sort Merge Join can reduce data transmission in the cluster. This method does not load all data into memory first, and then hashjoin, but it is used at the end of the JOIN You need to sort the JOIN keys before. The specific figure is as follows:

Sort Merge Join mainly includes three stages:

  • Shuffle Phase: two large tables are repartitioned according to the Join key
  • Sort Phase: sort the data in each partition
  • Merge Phase: JOIN the sorted partition data from different tables, and merge the data sets by traversing the elements and connecting rows with the same Join key value
    Conditions and characteristics
  • Only equivalent connections are supported
  • All join types are supported
  • Join Keys are sorted
  • Parameter spark.sql.join.prefersortmergeJoin
    (default true) set to true

Cartesian Join

brief introduction
If the two tables participating in the Join in Spark do not specify the join key (ON condition), a Cartesian product join will be generated. The result of this Join is actually the product of the number of rows of the two tables.
condition

  • Only internal connections are supported
  • Support equivalent and unequal connection
  • Enable parameter spark.sql.crossJoin.enabled=true

Broadcast Nested Loop Join

brief introduction
In this way, when there is no suitable join mechanism to choose from, the join policy will be selected finally. The priority is: Broadcast hash join > sort merge join > shuffle hash join > Cartesian join > broadcast nested loop join
If there is an internal connection or non equivalent connection between Cartesian and Broadcast Nested Loop Join, the Broadcast Nested Loop policy is preferred. When the non equivalent connection and a table can be broadcast, Cartesian Join will be selected.

Conditions and characteristics

  • Support equivalent and non equivalent connections
  • All JOIN types are supported. The main optimization points are as follows:
  • Broadcast the left table when the right external connection is made, and broadcast the right table when the left external connection is made
  • When internal connection, broadcast the left and right tables

How does Spark choose the JOIN strategy

Equivalent connection
If there are join hints, follow the order below

  1. . Broadcast Hint: if the join type supports, select broadcast hash join
  2. . Sort merge hint: if the join key is sorted, select sort merge join
  3. . shuffle hash hint: if the join type supports, select shuffle hash join
  4. . shuffle replicate NL hint: if it is an inner connection, select the Cartesian product method

If there are no join hints, check the following rules one by one

  • 1. If the join type is supported and one of the tables can be broadcast (spark.sql.autoBroadcastJoinThreshold value, 10MB by default), select
    broadcast hash join
  • 2. If the parameter spark.sql.join.preferSortMergeJoin is set to false and a table is small enough (a hash map can be built), select shuffle hash join
  • 3. If the join keys are sorted, select sort merge join
  • 4. If it is an internal connection, select cartesian join
  • 5. If OOM may occur or there is no alternative execution strategy, finally select broadcast nested loop join

Non equivalent connection

There are join hints, in the following order

  • 1.broadcast hint: select broadcast nested loop join
  • 2.shuffle replicate NL hint: if it is an internal connection, select cartesian product join

If there are no join hints, check the following rules one by one

  • 1. If a table is small enough (can be broadcast), select broadcast nested loop join
  • 2. If it is an internal connection, select cartesian product join
  • 3. If OOM may occur or there is no alternative execution strategy, finally select broadcast nested loop join

Source code fragment of join policy selection

 object JoinSelection extends Strategy
    with PredicateHelper
    with JoinSelectionHelper {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      case j @ ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, nonEquiCond, left, right, hint) =>
        def createBroadcastHashJoin(onlyLookingAtHint: Boolean) = {
          getBroadcastBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
            buildSide =>
              Seq(joins.BroadcastHashJoinExec(
                leftKeys,
                rightKeys,
                joinType,
                buildSide,
                nonEquiCond,
                planLater(left),
                planLater(right)))
          }
        }

        def createShuffleHashJoin(onlyLookingAtHint: Boolean) = {
          getShuffleHashJoinBuildSide(left, right, joinType, hint, onlyLookingAtHint, conf).map {
            buildSide =>
              Seq(joins.ShuffledHashJoinExec(
                leftKeys,
                rightKeys,
                joinType,
                buildSide,
                nonEquiCond,
                planLater(left),
                planLater(right)))
          }
        }

        def createSortMergeJoin() = {
          if (RowOrdering.isOrderable(leftKeys)) {
            Some(Seq(joins.SortMergeJoinExec(
              leftKeys, rightKeys, joinType, nonEquiCond, planLater(left), planLater(right))))
          } else {
            None
          }
        }

        def createCartesianProduct() = {
          if (joinType.isInstanceOf[InnerLike]) {
            Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), j.condition)))
          } else {
            None
          }
        }

        def createJoinWithoutHint() = {
          createBroadcastHashJoin(false)
            .orElse {
              if (!conf.preferSortMergeJoin) {
                createShuffleHashJoin(false)
              } else {
                None
              }
            }
            .orElse(createSortMergeJoin())
            .orElse(createCartesianProduct())
            .getOrElse {
              val buildSide = getSmallerSide(left, right)
              Seq(joins.BroadcastNestedLoopJoinExec(
                planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
            }
        }

        createBroadcastHashJoin(true)
          .orElse { if (hintToSortMergeJoin(hint)) createSortMergeJoin() else None }
          .orElse(createShuffleHashJoin(true))
          .orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
          .getOrElse(createJoinWithoutHint())

    
          if (canBuildLeft(joinType)) BuildLeft else BuildRight
        }

        def createBroadcastNLJoin(buildLeft: Boolean, buildRight: Boolean) = {
          val maybeBuildSide = if (buildLeft && buildRight) {
            Some(desiredBuildSide)
          } else if (buildLeft) {
            Some(BuildLeft)
          } else if (buildRight) {
            Some(BuildRight)
          } else {
            None
          }

          maybeBuildSide.map { buildSide =>
            Seq(joins.BroadcastNestedLoopJoinExec(
              planLater(left), planLater(right), buildSide, joinType, condition))
          }
        }

        def createCartesianProduct() = {
          if (joinType.isInstanceOf[InnerLike]) {
            Some(Seq(joins.CartesianProductExec(planLater(left), planLater(right), condition)))
          } else {
            None
          }
        }

        def createJoinWithoutHint() = {
          createBroadcastNLJoin(canBroadcastBySize(left, conf), canBroadcastBySize(right, conf))
            .orElse(createCartesianProduct())
            .getOrElse {
              Seq(joins.BroadcastNestedLoopJoinExec(
                planLater(left), planLater(right), desiredBuildSide, joinType, condition))
            }
        }

        createBroadcastNLJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))
          .orElse { if (hintToShuffleReplicateNL(hint)) createCartesianProduct() else None }
          .getOrElse(createJoinWithoutHint())
      case _ => Nil
    }
  }

summary
This paper mainly introduces five JOIN strategies provided by Spark, and graphically analyzes three important JOIN strategies. Firstly, this paper combs the factors affecting the JOIN, then introduces five Spark JOIN strategies, expounds the specific meaning and trigger conditions of each JOIN strategy, and finally gives the corresponding source code fragment for the selection of JOIN strategy. I hope this article can help you.

Tags: Java Maven Spring

Posted on Mon, 29 Nov 2021 17:23:47 -0500 by Pigmaster