Spark_ Correct use of checkpoint in spark and its difference from cache

1.Spark performance tuning: use of checkPoint



  Checkpoint means to establish checkpoints, similar to snapshots. For example, in spark computing, the computing process DAG is very long, and the server needs to complete the whole DAG calculation to get the result. However, if the data calculated in the middle of this long computing process is lost, spark will calculate it again from beginning to end according to the dependency of RDD, which is very expensive for performance However, we can put the intermediate calculation results into memory or disk through cache or persist, but this does not guarantee that the data will not be lost at all. If the memory stored is faulty or the disk is broken, it will also cause spark to calculate again from scratch according to RDD, so there is a checkpoint. The function of checkpoint is to do the important intermediate data in DAG A checkpoint stores results in a highly available place (usually HDFS).


Use Checkpoint

To use checkpoint, you need to set the directory of checkpoint, such as the following code:

val sparkConf = new SparkConf

      .set("spark.sql.autoBroadcastJoinThreshold", "1048576") //1M broadcastJOIN
      //.set("spark.sql.autoBroadcastJoinThreshold", "104857600") //100M broadcastJOIN
      .set("spark.sql.shuffle.partitions", "3")

    if (args.length > 0 && args(0).equals("ide")) {

    val spark = SparkSession.builder()

    val sparkContext = spark.sparkContext

Setup codes for different environments

Sometimes you need to debug locally and set it to the local directory of windows or linux








Use checkpoint

When using checkpoint, you need to make function calls on the rdd where checkpoint is established



be careful:

When using checkpoint, it is recommended to first rdd.cache Once, because checkpoint is a transform operator,

When executing, it is equivalent to going through two processes, which are calculated once before, and then checkpoint will be calculated once again, so generally, we will go through the process of cache first and then checkpoint, and when checkpoint, we will take data from the cache to the memory and write it into hdfs, as follows:





Difference between Checkpoint and cache

checkpoint is different from cache. checkpoint will remove rdd dependency of previous operators, while cache is to temporarily store data in a specific location.


checkpoint implementation of rdd

   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir` and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.
  def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new ReliableRDDCheckpointData(this))


checkpoint implementation of dataframe

   * Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate
   * the logical plan of this Dataset, which is especially useful in iterative algorithms where the
   * plan may grow exponentially. It will be saved to files inside the checkpoint
   * directory set with `SparkContext#setCheckpointDir`.
   * @group basic
   * @since 2.1.0
  def checkpoint(): Dataset[T] = checkpoint(eager = true)



Tags: Spark SQL Windows Linux

Posted on Sun, 14 Jun 2020 00:48:31 -0400 by brokeDUstudent