Getting started with Delta Lake with Jupiter Notebook

The example of this article comes from Delta Lake Official course . Because the official tutorial is based on the commercial software Databricks Community Edition. Although the software features used in the tutorial are all possessed by the open-source Delta Lake version, considering the domestic network environment, the threshold for registering and using Databricks Community Edition is high. So this article attempts to rebuild this tutorial based on the open source Jupiter Notebook.

Prepare an environment to install Spark and jupyter

This paper builds the development environment based on Linux. The software used at the same time, such as conda, jupyter and pyspark, can be found on Windows and MacOS. Theoretically, this tutorial can also be completed on these two systems.

Assuming that anaconda or minionda has been installed in the system, we can use conda to build the development environment, which is very convenient to install pyspark and jupyter notebook

conda create --name spark
conda activate spark

conda install pyspark 
conda install -c conda-forge jupyterlab

Environment variable settings

After setting some environment variables, we can use the pyspark command to create the jupyter notebook service

export SPARK_HOME=$HOME/miniconda3/envs/spark/lib/python3.7/site-packages/pyspark

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Start the service (note that the package of Delta Lake is specified in the parameter here, and Spark will help to download the dependency automatically):

pyspark --packages io.delta:delta-core_2.11:0.5.0

And then all the code runs in the notebook

parquet file required for download

%%bash
rm -fr /tmp/delta_demo

mkdir -p /tmp/delta_demo/loans/ 

wget -O /tmp/delta_demo/loans/SAISEU19-loan-risks.snappy.parquet https://pages.databricks.com/rs/094-YMS-629/images/SAISEU19-loan-risks.snappy.parquet 
ls -al /tmp/delta_demo/loans/

Batch flow processing of Delta Lake

Here we go to the main topic and start to introduce Delta Lake's batch flow processing capability.

First of all, we create a Delta Lake table in the form of batch processing. The data comes from the parquet file we downloaded earlier. We can easily convert a parquet table into a Delta Lake table

import os
import shutil
from pyspark.sql.functions import * 

delta_path = "/tmp/delta_demo/loans_delta"

# Delete a new delta table with the parquet file
if os.path.exists(delta_path):
    print("Deleting path " + delta_path)
    shutil.rmtree(delta_path)

# Create a new delta table with the parquet file
spark.read.format("parquet").load("/tmp/delta_demo/loans") \
  .write.format("delta").save(delta_path)
print("Created a Delta table at " + delta_path)

Let me check this table to see if the data is correct:

# Create a view on the table called loans_delta
spark.read.format("delta").load(delta_path).createOrReplaceTempView("loans_delta")
print("Defined view 'loans_delta'")

spark.sql("select count(*) from loans_delta").show()

Defined view 'loans_delta'
+--------+
|count(1)|
+--------+
|   14705|
+--------+

Next, we will use Spark Streaming to stream this Delta Lake table, and show the Schema enforcement capability of Delta Lake (this article omits the demonstration part of stream writing parquet table, which points out the shortcomings of parquet file, such as the inability to force the specification of Schema)

import random
from pyspark.sql.functions import *
from pyspark.sql.types import *

def random_checkpoint_dir(): 
    return "/tmp/delta_demo/chkpt/%s" % str(random.randint(0, 10000))

# User-defined function to generate random state

states = ["CA", "TX", "NY", "IA"]

@udf(returnType=StringType())
def random_state():
    return str(random.choice(states))

# Generate a stream of randomly generated load data and append to the delta table
def generate_and_append_data_stream_fixed(table_format, table_path):
    
    stream_data = spark.readStream.format("rate").option("rowsPerSecond", 50).load() \
        .withColumn("loan_id", 10000 + col("value")) \
        .withColumn("funded_amnt", (rand() * 5000 + 5000).cast("integer")) \
        .withColumn("paid_amnt", col("funded_amnt") - (rand() * 2000)) \
        .withColumn("addr_state", random_state()) \
        .select("loan_id", "funded_amnt", "paid_amnt", "addr_state")   # *********** FIXED THE SCHEMA OF THE GENERATED DATA *************

    query = stream_data.writeStream \
        .format(table_format) \
        .option("checkpointLocation", random_checkpoint_dir()) \
        .trigger(processingTime="10 seconds") \
        .start(table_path)
    
    return query

Start two streaming jobs:

stream_query_1 = generate_and_append_data_stream_fixed(table_format = "delta", table_path = delta_path)
stream_query_2 = generate_and_append_data_stream_fixed(table_format = "delta", table_path = delta_path)

Because of Delta Lake's optimistic locking mechanism, multiple streams can write to a table at the same time, and ensure data integrity.

Query the data quantity in the current table by batch processing. We find that some data has been inserted:

spark.sql("select count(*) from loans_delta").show()
+--------+
|count(1)|
+--------+
|   17605|
+--------+

Next, we stop writing to all streams. Next, we will show other features of Delta Lake

# Function to stop all streaming queries 
def stop_all_streams():
    # Stop all the streams
    print("Stopping all streams")
    for s in spark.streams.active:
        s.stop()
    print("Stopped all streams")
    print("Deleting checkpoints")  
    shutil.rmtree("/tmp/delta_demo/chkpt/", True)
    print("Deleted checkpoints")

stop_all_streams()

Schema evolution

Delta Lake supports Schema evolution, which means we can add or change table fields. The next batch SQL will add some new data, and these data will have an additional "closed" field than the previous one. We set the new DF configuration parameter mergeSchema to true to show the evolution of the delta Lake table Schema

cols = ['loan_id', 'funded_amnt', 'paid_amnt', 'addr_state', 'closed']

items = [
  (1111111, 1000, 1000.0, 'TX', True), 
  (2222222, 2000, 0.0, 'CA', False)
]

loan_updates = spark.createDataFrame(items, cols) \
  .withColumn("funded_amnt", col("funded_amnt").cast("int"))
  
loan_updates.write.format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .save(delta_path)

Take a look at the table content after inserting new data. The closed field has been added. The field of the previous old data row is null by default.

spark.read.format("delta").load(delta_path).show()
+-------+-----------+---------+----------+------+
|loan_id|funded_amnt|paid_amnt|addr_state|closed|
+-------+-----------+---------+----------+------+
|      0|       1000|   182.22|        CA|  null|
|      1|       1000|   361.19|        WA|  null|
|      2|       1000|   176.26|        TX|  null|
|      3|       1000|   1000.0|        OK|  null|
|      4|       1000|   249.98|        PA|  null|
|      5|       1000|    408.6|        CA|  null|
|      6|       1000|   1000.0|        MD|  null|
|      7|       1000|   168.81|        OH|  null|
|      8|       1000|   193.64|        TX|  null|
|      9|       1000|   218.83|        CT|  null|
|     10|       1000|   322.37|        NJ|  null|
|     11|       1000|   400.61|        NY|  null|
|     12|       1000|   1000.0|        FL|  null|
|     13|       1000|   165.88|        NJ|  null|
|     14|       1000|    190.6|        TX|  null|
|     15|       1000|   1000.0|        OH|  null|
|     16|       1000|   213.72|        MI|  null|
|     17|       1000|   188.89|        MI|  null|
|     18|       1000|   237.41|        CA|  null|
|     19|       1000|   203.85|        CA|  null|
+-------+-----------+---------+----------+------+
only showing top 20 rows

The new data row has a closed field:

spark.read.format("delta").load(delta_path).filter(col("closed") == True).show()
+-------+-----------+---------+----------+------+
|loan_id|funded_amnt|paid_amnt|addr_state|closed|
+-------+-----------+---------+----------+------+
|1111111|       1000|   1000.0|        TX|  true|
+-------+-----------+---------+----------+------+

Delete operation of Delta Lake table

In addition to the normal insert operation, Delta Lake also supports the functions of update and delete, which can update the table content. The deletion operation is shown below. We want to delete the record that the loan in the form has been fully paid off. The following commands can show the deletion process simply and clearly.

First, let's look at how many records are eligible:

spark.sql("SELECT COUNT(*) FROM loans_delta WHERE funded_amnt = paid_amnt").show()

+--------+
|count(1)|
+--------+
|    5134|
+--------+

Then, we execute a delete command:

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, delta_path)
deltaTable.delete("funded_amnt = paid_amnt")

Finally, we look at the results after deletion and find that all the records that meet the conditions have been deleted:

spark.sql("SELECT COUNT(*) FROM loans_delta WHERE funded_amnt = paid_amnt").show()
+--------+
|count(1)|
+--------+
|       0|
+--------+

Version history and backtracking

Delta Lake also has powerful historical version recording and backtracking capabilities. The history() method clearly shows the modification record of the table just now, including the last Delete operation.

deltaTable.history().show()

+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+
|version|          timestamp|userId|userName|       operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+
|     10|2020-02-22 22:14:06|  null|    null|          DELETE|[predicate -> ["(...|null|    null|     null|          9|          null|        false|
|      9|2020-02-22 22:13:57|  null|    null|           WRITE|[mode -> Append, ...|null|    null|     null|          8|          null|         true|
|      8|2020-02-22 22:13:52|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          6|          null|         true|
|      7|2020-02-22 22:13:50|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          6|          null|         true|
|      6|2020-02-22 22:13:42|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          4|          null|         true|
|      5|2020-02-22 22:13:40|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          4|          null|         true|
|      4|2020-02-22 22:13:32|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          2|          null|         true|
|      3|2020-02-22 22:13:30|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          2|          null|         true|
|      2|2020-02-22 22:13:22|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          1|          null|         true|
|      1|2020-02-22 22:13:20|  null|    null|STREAMING UPDATE|[outputMode -> Ap...|null|    null|     null|          0|          null|         true|
|      0|2020-02-22 22:13:18|  null|    null|           WRITE|[mode -> ErrorIfE...|null|    null|     null|       null|          null|         true|
+-------+-------------------+------+--------+----------------+--------------------+----+--------+---------+-----------+--------------+-------------+

If we want to see the status of the data table just before the deletion, we can easily trace back to the previous snapshot point and query again (we can see that the deleted record appears again).

previousVersion = deltaTable.history(1).select("version").collect()[0][0] - 1

spark.read.format("delta") \
  .option("versionAsOf", previousVersion) \
  .load(delta_path) \
  .createOrReplaceTempView("loans_delta_pre_delete") \

spark.sql("SELECT COUNT(*) FROM loans_delta_pre_delete WHERE funded_amnt = paid_amnt").show()

+--------+
|count(1)|
+--------+
|    5134|
+--------+

conclusion

This article demonstrates the official tutorial of Delta Lake through the jupyter notebook tool. You can download the complete notebook file at the end of this article.

Tags: Python Spark SQL jupyter network

Posted on Sun, 23 Feb 2020 00:35:26 -0500 by GroundZeroStudios