Welcome to WeChat official account: ApacheHudi
1. Introduction
Most modern data lakes are built on some kind of distributed file system (DFS), such as HDFS or cloud based storage, such as AWS S3. One of the basic principles to follow is the "write once read many" access model for files. This is very useful for processing massive data, such as hundreds of gigabytes to terabytes of data.
However, it is not uncommon to update data when building an analysis data lake. Depending on the scenario, these updates may be hourly or even daily or weekly. You may also need to run the analysis on the latest view, the history view with all updates, or even just the latest incremental view.
Often this results in the use of multiple systems for flow and batch processing, the former for incremental data and the latter for historical data.
When processing data stored on HDFS, a common workflow for maintaining incremental updates is here The ingest reconcile compact purge policy described in.
Frameworks like Apache Hudi work here. It manages this workflow for us in the background, which makes our core application code more concise. Hudi supports queries on the latest data views and incremental changes of queries at a certain point in time.
This article introduces the core concepts of Hudi and how to operate in copy on write mode.
The source code of this article is placed in github.
2. Outline
- Prerequisites and Framework version
- Hudi core concepts
- Initial settings and dependencies
- Using the CoW table
2.1 prerequisites and Framework version
This article will be easy to understand if you know in advance how to write spark jobs using scala and read and write parquet files.
The Framework version is as follows
- JDK: openjdk 1.8.0_242
- Scala: 2.12.8
- Spark: 2.4.4
- Hudi Spark bundle: 0.5.2-incubating
Note: at the time of writing this article, AWS EMR is integrated with Hudi v0.5.0-emerging. This software package has a bug that will cause the upsert operation to be stuck or take a long time to complete. You can check the relevant information issue To learn more, this issue has been fixed in the current version of Hudi (0.5.2-emerging and later). If you plan to run code on AWS EMR, you may want to consider overriding the default integration version with the latest version.
2.2 core concepts of Hudi
Start with some core concepts that need to be understood.
1. Table type
Hudi supports two types of tables
-
Copy on write (CoW): when a CoW table is written, the ingest reconcile compact purge cycle runs. After each write operation, the data in the CoW table is always the latest record. This mode is preferred for scenarios that need to read the latest data as soon as possible. Data is only stored in the CoW table in the column file format (parquet). Since each write operation involves compression and overwrite, this mode produces the least files.
-
Read time merge (MoR): the MoR table focuses on fast write operations. Writing these tables creates an incremental file, which is then compressed to generate the latest data at the time of reading. The compression operation can be completed synchronously or asynchronously. The data is stored in a combination of column file format (parquet) and row based file format (avro).
This is a trade-off between the two table formats mentioned in the Hudi document.
Trade-off | CoW | MoR |
---|---|---|
Data delay | Higher | Lower |
Update cost (I/O) | Higher (rewrite the entire parquet file) | Lower (append to delta log file) |
Parquet file size | Small (high update(I/0) overhead) | Larger (low update overhead) |
Write Amplification | Higher | Lower (determined by the competition policy) |
2. Query type
Hudi supports two main types of queries: snapshot query and incremental query. In addition to the two main query types, the MoR table also supports read optimized queries.
-
Snapshot query: for the CoW table, the snapshot query returns the latest view of the data, while for the MoR table, it returns the near real-time view. For the MoR table, the snapshot query merges the base and incremental files in real time, so there may be some read latency. With CoW, since the write is responsible for the merge, the read is fast and only the basic file needs to be read.
-
Incremental queries: incremental queries enable you to view data after a specific submission time by specifying a start time or by specifying a start and end time at a specific point in time.
-
Read optimization query: for the MoR table, the read optimization query returns a view that contains only the data in the basic file, not the incremental file.
3. Key attributes when writing in Hudi format
-
hoodie.datasource.write.table.type , define the type of table - the default is COPY_ON_WRITE. For the MoR table, set this value to MERGE_ON_READ.
-
hoodie.table.name , which is a required field, each table should have a unique name.
-
hoodie.datasource.write.recordkey.field , treat this as the primary key of the table. The value of this property is the name of the column in the DataFrame, which is the primary key.
-
hoodie.datasource.write.precombine.field , when updating data, if there are two records with the same primary key, the value in this column determines which record to update. Selecting a column such as a time stamp ensures that records with the latest time stamp are selected.
-
hoodie.datasource.write.operation, which defines the type of write operation. Value can be upsert, insert, bulk_insert and delete. The default value is upsert.
2.3 initial settings and dependencies
1. Dependency description
In order to use Hudi in Spark jobs, you need to use Spark SQL, Hudi Spark bundle and Spark Avro dependencies. In addition, you need to configure Spark to use KryoSerializer.
pom.xml The general content is as follows
<properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <encoding>UTF-8</encoding> <scala.version>2.12.8</scala.version> <scala.compat.version>2.12</scala.compat.version> <spec2.version>4.2.0</spec2.version> </properties> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.compat.version}</artifactId> <version>2.4.4</version> </dependency> <dependency> <groupId>org.apache.hudi</groupId> <artifactId>hudi-spark-bundle_${scala.compat.version}</artifactId> <version>0.5.2-incubating</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-avro_${scala.compat.version}</artifactId> <version>2.4.4</version> </dependency> </dependencies>
2. Set Schema
We use the following Album class to represent the schema of the table.
case class Album(albumId: Long, title: String, tracks: Array[String], updateDate: Long)
3. Generate test data
Create some data for the upsert operation.
- INITIAL_ALBUM_DATA has two records. The key is 801.
- UPSERT_ALBUM_DATA contains an updated record and two new records.
def dateToLong(dateString: String): Long = LocalDate.parse(dateString, formatter).toEpochDay private val INITIAL_ALBUM_DATA = Seq( Album(800, "6 String Theory", Array("Lay it down", "Am I Wrong", "68"), dateToLong("2019-12-01")), Album(801, "Hail to the Thief", Array("2+2=5", "Backdrifts"), dateToLong("2019-12-01")), Album(801, "Hail to the Thief", Array("2+2=5", "Backdrifts", "Go to sleep"), dateToLong("2019-12-03")) ) private val UPSERT_ALBUM_DATA = Seq( Album(800, "6 String Theory - Special", Array("Jumpin' the blues", "Bluesnote", "Birth of blues"), dateToLong("2020-01-03")), Album(802, "Best Of Jazz Blues", Array("Jumpin' the blues", "Bluesnote", "Birth of blues"), dateToLong("2020-01-04")), Album(803, "Birth of Cool", Array("Move", "Jeru", "Moon Dreams"), dateToLong("2020-02-03")) )
4. Initialize SparkContext
Finally, initialize the Spark context. An important point to note here is the use of KryoSerializer.
val spark: SparkSession = SparkSession.builder() .appName("hudi-datalake") .master("local[*]") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.hive.convertMetastoreParquet", "false") // Uses Hive SerDe, this is mandatory for MoR tables .getOrCreate()
2.4 use of CoW table
This section deals with records of the CoW table, such as reading and deleting records.
1. Basepath and Upsert methods
Define a basePath, the upsert method will write the table data to the path, and the method will org.apache.hudi Format is written to Dataframe, make sure that all the Hudi properties discussed above are set.
val basePath = "/tmp/store" private def upsert(albumDf: DataFrame, tableName: String, key: String, combineKey: String) = { albumDf.write .format("hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey) .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) // Ignore this property for now, the default is too high when experimenting on your local machine // Set this to a lower value to improve performance. // I'll probably cover Hudi tuning in a separate post. .option("hoodie.upsert.shuffle.parallelism", "2") .mode(SaveMode.Append) .save(s"$basePath/$tableName/") }
2. Initialize upsert
Insert original_ ALBUM_ Data, we should create 2 records. For 801, the date of this record is 2019-12-03.
val tableName = "Album" upsert(INITIAL_ALBUM_DATA.toDF(), tableName, "albumId", "updateDate") spark.read.format("hudi").load(s"$basePath/$tableName/*").show()
Reading a CoW table is like using the usual format ("hudl") spark.read Just as simple.
// Output +-------------------+--------------------+------------------+----------------------+--------------------+-------+-----------------+--------------------+----------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name|albumId| title| tracks|updateDate| +-------------------+--------------------+------------------+----------------------+--------------------+-------+-----------------+--------------------+----------+ | 20200412182343| 20200412182343_0_1| 801| default|65841d0a-0083-447...| 801|Hail to the Thief|[2+2=5, Backdrift...| 18233| | 20200412182343| 20200412182343_0_2| 800| default|65841d0a-0083-447...| 800| 6 String Theory|[Lay it down, Am ...| 18231| +-------------------+--------------------+------------------+----------------------+--------------------+-------+-----------------+--------------------+----------+
Another way to determine is to view the log output of Workload profile, which is roughly as follows
Workload profile :WorkloadProfile {globalStat=WorkloadStat {numInserts=2, numUpdates=0}, partitionStat={default=WorkloadStat {numInserts=2, numUpdates=0}}}
3. Update records
upsert(UPSERT_ALBUM_DATA.toDF(), tableName, "albumId", "updateDate")
View the log output of the Workload profile and verify that it meets expectations
Workload profile :WorkloadProfile {globalStat=WorkloadStat {numInserts=2, numUpdates=1}, partitionStat={default=WorkloadStat {numInserts=2, numUpdates=1}}}
The query output is as follows
spark.read.format("hudi").load(s"$basePath/$tableName/*").show() //Output +-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name|albumId| title| tracks|updateDate| +-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+ | 20200412183510| 20200412183510_0_1| 801| default|65841d0a-0083-447...| 801| Hail to the Thief|[2+2=5, Backdrift...| 18233| | 20200412184040| 20200412184040_0_1| 800| default|65841d0a-0083-447...| 800|6 String Theory -...|[Jumpin' the blue...| 18264| | 20200412184040| 20200412184040_0_2| 802| default|65841d0a-0083-447...| 802| Best Of Jazz Blues|[Jumpin' the blue...| 18265| | 20200412184040| 20200412184040_0_3| 803| default|65841d0a-0083-447...| 803| Birth of Cool|[Move, Jeru, Moon...| 18295| +-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+
4. Query record
The way we view data above is called "snapshot query", which is the default setting, and also supports "incremental query".
4.1 incremental query
To perform incremental queries, we need to hoodie.datasource.query The. Type attribute is set to incremental and specifies hoodie.datasource.read.begin.instanttime Property. This will read all records after the specified instant time. For this example, we specify the instantTime as 20200412183510.
spark.read .format("hudi") .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "20200412183510") .load(s"$basePath/$tableName") .show()
This will return all records after submission time 20200412183510.
+-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name|albumId| title| tracks|updateDate| +-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+ | 20200412184040| 20200412184040_0_1| 800| default|65841d0a-0083-447...| 800|6 String Theory -...|[Jumpin' the blue...| 18264| | 20200412184040| 20200412184040_0_2| 802| default|65841d0a-0083-447...| 802| Best Of Jazz Blues|[Jumpin' the blue...| 18265| | 20200412184040| 20200412184040_0_3| 803| default|65841d0a-0083-447...| 803| Birth of Cool|[Move, Jeru, Moon...| 18295| +-------------------+--------------------+------------------+----------------------+--------------------+-------+--------------------+--------------------+----------+
5. Delete record
The last operation we want to view is to delete. Deleting is similar to upsert, which requires a DataFrame of the record to be deleted. As shown in the following example code, you don't need a whole row, just a primary key.
val deleteKeys = Seq( Album(803, "", null, 0l), Album(802, "", null, 0l) ) import spark.implicits._ val df = deleteKeys.toDF() df.write.format("hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "albumId") .option(HoodieWriteConfig.TABLE_NAME, tableName) // Set the option "hoodie.datasource.write.operation" to "delete" .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.DELETE_OPERATION_OPT_VAL) .mode(SaveMode.Append) // Only Append Mode is supported for Delete. .save(s"$basePath/$tableName/") spark.read.format("hudi").load(s"$basePath/$tableName/*").show()
This is the whole content of this part. Later, we will discuss how to operate in the MERGE-ON-READ table.