How to use Apache Flink to query Pulsar flow

In the previous blog, we introduced Apache Pulsar and its differences from other message systems, and explained how to integrate Pulsar and Flink to work together to provide a seamless developer experience for large-scale elastic data processing.

This article will introduce the integration and latest research and development progress of Apache Pulsar and Apache Flink, and explain in detail how to use the built-in schema of Pulsar to query the Pulsar flow in real time using Apache Flink.

About Apache Pulsar

Apache Pulsar is a flexible publish / subscribe message system that supports persistent log storage. Pulsar's architecture advantages include multi tenant, unified message model, structured event flow, cloud native architecture, etc. these advantages enable pulsar to be perfectly applicable to a variety of user scenarios, from billing, payment, transaction services to the integration of different message architectures in the organization.

Existing pulsar & Flink integration

(Apache Flink 1.6+)

In the existing Pulsar and Flink integration, Pulsar is used as a message queue in the Flink application. Flink developers can select a specific Pulsar source, connect to the desired Pulsar cluster and topic, and use Pulsar as the flow source and flow sink of Flink:

// create and configure Pulsar consumerPulsarSourceBuilder<String>builder = PulsarSourceBuilder .builder(new SimpleStringSchema()) .serviceUrl(serviceUrl) .topic(inputTopic) .subsciptionName(subscription);SourceFunction<String> src = builder.build();// ingest DataStream with Pulsar consumerDataStream<String> words = env.addSource(src);

The Pulsar stream can then be connected to Flink's processing logic.

// perform computation on DataStream (here a simple WordCount)DataStream<WordWithCount> wc = words .flatmap((FlatMapFunction<String, WordWithCount>) (word, collector) -> { collector.collect(new WordWithCount(word, 1)); }) .returns(WordWithCount.class) .keyBy("word") .timeWindow(Time.seconds(5)) .reduce((ReduceFunction<WordWithCount>) (c1, c2) -> new WordWithCount(c1.word, c1.count + c2.count));

Then write out the data to Pulsar through sink.

// emit result via Pulsar producer wc.addSink(new FlinkPulsarProducer<>( serviceUrl, outputTopic, new AuthentificationDisabled(), wordWithCount -> wordWithCount.toString().getBytes(UTF_8), wordWithCount -> wordWithCount.word));

This is an important first step for integration, but the existing design is not enough to take full advantage of the full capabilities of Pulsar.

There are some shortcomings in the integration of Pulsar and Flink 1.6.0, including: it is neither used as persistent storage nor integrated with Flink in schema, which results in manual input when adding description for application schema registration.

Pulsar and Flink 1.9 integration

Using Pulsar as a Flink catalog

Flink 1.9.0's latest integration with Pulsar solves the problems mentioned earlier. Alibaba Blink's contribution to Flink warehouse not only strengthens the processing architecture, but also adds new functions, making the integration of Flink and Pulsar more powerful and effective.

Flink 1.9.0:
https://flink.apache.org/downloads.html#apache-flink-191

In the implementation of the new connector, the Pulsar schema integration is introduced, and the support for Table API is increased. At the same time, the Pulsar reading of exactly once semantics and the Pulsar writing of at least once semantics are provided.

Moreover, through schema integration, Pulsar can be registered as Flink catalog, and only a few commands are needed to run Flink queries on the Pulsar stream. Next, we will detail the new integration and give an example of how to query the Pulsar stream using Flink SQL.

Integrate with Flink < > pulsar schema

Before we expand the integration details and specific usage, let's take a look at how Pulsar schema works.

Apache Pulsar has built-in support for schema without additional management of schema. Pulsar's data schema is associated with each topic. Therefore, producer and consumer can send data using predefined schema information, while broker can verify schema and manage schema multi versioning and schema evolution in compatibility check.

The following are examples of Pulsar schema for producer and consumer, respectively. On the producer side, you can specify to use schema, and Pulsar can send a POJO class without performing serialization / deserialization.

Similarly, on the consumer side, you can also specify a data schema. After receiving the data, pulsar will automatically verify the schema information, obtain the given version of the schema, and then deserialize the data to the POJO structure. Pulsar stores schema information in the metadata of topic.

// Create producer with Struct schema and send messagesProducer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();producer.newMessage() .value(User.builder() .userName("pulsar-user") .userId(1L) .build()) .send();// Create consumer with Struct schema and receive messagesConsumer<User> consumer = client.newCOnsumer(Schema.AVRO(User.class)).create();consumer.receive();

Suppose an application specifies a schema for producer and / or consumer. When receiving the schema information, the producer (or consumer) connected to the broker transmits such information, so that the broker can register the schema, verify the schema, and check the schema compatibility before returning or rejecting the schema, as shown in the following figure:

Pulsar can not only process and store schema information, but also process schema evolution when necessary. Pulsar can effectively manage the schema evolution in the broker, and track all versions of the schema in the necessary compatibility checks.

In addition, when the message is published on the producer side, Pulsar will mark the schema version in the metadata of the message; when the consumer receives the message and finishes deserializing the metadata, Pulsar will check the schema version associated with the message and obtain the schema information from the broker.

Therefore, when Pulsar is integrated with Flink application, Pulsar uses pre-existing schema information and maps a single message with schema information to different lines of Flink type system.

When Flink users do not directly interact with the schema or use the original schema (for example, using topic to store string or long value), Pulsar will convert the message to Flink row, that is, "value"; or in the structured schema types (for example, JSON and AVRO), Pulsar will extract the single field information from the schema information and map the field to Flink's type system.

Finally, all metadata information related to the message (for example, message key, topic, publishing time, event time, etc.) will be converted to the metadata field in the Flink row. Here are two examples of using raw and structured schema s to explain how to convert data from Pulsar topic to Flink type systems.

Original schema:

root|-- value: DOUBLE|-- __key: BYTES|-- __topic: STRING|-- __messageId: BYTES|-- __publishTime: TIMESTAMP(3)|-- __eventTime: TIMESTAMP(3)

Structural schema (Avor Schema):

@Data@AllArgsConstructor@NoArgsConstructorpublic static class Foo { public int i; public float f; public Bar bar;}@Data@AllArgsConstructor@NoArgsConstructorpublic static class Bar { public boolean b; public String s;}Schema s = Schema.AVRO(Foo.getClass());

root |-- i: INT |-- f: FLOAT |-- bar: ROW<`b` BOOLEAN, `s` STRING> |-- __key: BYTES |-- __topic: STRING |-- __messageId: BYTES |-- __publishTime: TIMESTAMP(3) |-- __eventTime: TIMESTAMP(3)

When all schema information is mapped to Flink type system, you can build Pulsar source, sink or catalog in Flink according to the specified schema information, as shown below:

Flink & Pulsar: reading data from Pulsar

1. Create a Pulsar source for stream query

val env = StreamExecutionEnvironment.getExecutionEnvironmentval props = new Properties()props.setProperty("service.url", "pulsar://...")props.setProperty("admin.url", "http://...")props.setProperty("partitionDiscoveryIntervalMillis", "5000")props.setProperty("startingOffsets", "earliest")props.setProperty("topic", "test-source-topic")val source = new FlinkPulsarSource(props)// you don't need to provide a type information to addSource since FlinkPulsarSource is ResultTypeQueryableval dataStream = env.addSource(source)(null) // chain operations on dataStream of Row and sink the output// end method chaining env.execute()

2. Register topic in Pusar as streaming tables

val env = StreamExecutionEnvironment.getExecutionEnvironmentval tEnv = StreamTableEnvironment.create(env) val prop = new Properties()prop.setProperty("service.url", serviceUrl)prop.setProperty("admin.url", adminUrl)prop.setProperty("flushOnCheckpoint", "true")prop.setProperty("failOnWrite", "true")props.setProperty("topic", "test-sink-topic") tEnv .connect(new Pulsar().properties(props)) .inAppendMode() .registerTableSource("sink-table") val sql = "INSERT INTO sink-table ....."tEnv.sqlUpdate(sql)env.execute()

Flink & Pulsar: write data to Pulsar

1. Create Pulsar sink for stream query

val env = StreamExecutionEnvironment.getExecutionEnvironmentval stream = ..... val prop = new Properties()prop.setProperty("service.url", serviceUrl)prop.setProperty("admin.url", adminUrl)prop.setProperty("flushOnCheckpoint", "true")prop.setProperty("failOnWrite", "true")props.setProperty("topic", "test-sink-topic") stream.addSink(new FlinkPulsarSink(prop, DummyTopicKeyExtractor))env.execute()

2. Write streaming table to Pulsar

val env = StreamExecutionEnvironment.getExecutionEnvironmentval tEnv = StreamTableEnvironment.create(env) val prop = new Properties()prop.setProperty("service.url", serviceUrl)prop.setProperty("admin.url", adminUrl)prop.setProperty("flushOnCheckpoint", "true")prop.setProperty("failOnWrite", "true")props.setProperty("topic", "test-sink-topic") tEnv .connect(new Pulsar().properties(props)) .inAppendMode() .registerTableSource("sink-table") val sql = "INSERT INTO sink-table ....."tEnv.sqlUpdate(sql)env.execute()

In the above example, Flink developers do not need to worry about schema registration, serialization / deserialization, and register the Pulsar cluster as a source, sink or streaming table in Flink.

When these three elements exist at the same time, Pulsar will be registered as a catalog in Flink, which can greatly simplify data processing and query, for example, writing programs to query data from Pulsar, using Table API and SQL to query Pulsar data flow, etc.

Future plans

The goal of the integration of Pulsar and Flink is to simplify how to use both frameworks to build a unified data processing stack, which is convenient for developers. Compared with the classic Lamda architecture (online high-speed layer and offline batch layer are combined to run data computing together), the combination of Flink and Pulsar provides a truly unified data processing stack.

Flink, as a unified computing engine, handles online (streaming) and offline (batch) workloads, while Pulsar, as a unified data storage layer of the unified data processing stack, simplifies the work of developers.

There is still a lot of work to be done on the way to improve integration. For example, the new source API (FLIP-27) that can take advantage of the Pulsar connector's contribution to the Flink community, and the 'key' that allows effective source parallel extension in Pulsar_ Shared 'subscription type, etc.

For details, please refer to:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Discussion-Flink-Pulsar-Connector-td22019.html

In addition, the direction that can be improved includes: end-to-end exactly once guarantee (currently only used in Pulsar source, not in Pulsar sink), and the use of Pulsar/BookKeeper as Flink status backend, etc.

You can also watch the video below to see a detailed introduction to the integrated development of Flink and Pulsar. This video is from Flink Forward Europe 2019. You can also subscribe to the Flink development mailing list to get the latest news about the contribution and integration of Flink and Pulsar.

Video link: https://youtu.be/3sBXXfgl5vs

Mailing list:
https://flink.apache.org/community.html#mailing-lists

https://mp.weixin.qq.com/s?__biz=MzUyMjkzMjA1Ng==&mid=2247484678&idx=1&sn=17c3c0df732a455808652db3187ff053&scene=19#wechat_redirect

How to use Apache Flink to query Pulsar flow

30 May 2020, 02:54 | Views: 1915

Add new comment

0 comments