This article is translated from the streamnational blog. Original author of the blog: Ioannis Polyzos, StreamNative solution engineer. Original link: https://streamnative.io/blog/...
background
Building a modern data infrastructure has always been a difficult problem for today's enterprises. Today's enterprises need to manage large amounts of heterogeneous data generated and delivered around the clock. However, there is no "one size fits all" solution because enterprises have various requirements for the amount and speed of data. Instead, enterprises need to move data between different systems in order to store, process and provide data.
Looking at the history of building infrastructure, enterprises have used many different tools to try to move data, such as Apache Kafka for streaming workload and RabbitMQ for message workload. Now, the birth of Apache Pulsar simplifies this process for enterprises.
Apache Pulsar is a cloud native distributed message flow platform. Pulsar is designed to meet the needs of modern data and support flexible messaging semantics, tiered storage, multi tenant and offsite replication (cross regional data replication). Since graduating as the top project of Apache Software Foundation in 2018, pulsar project has experienced rapid development Community growth The development of surrounding ecology and the growth of global users. Using Pulsar as the backbone of the data infrastructure, companies can move data in a fast and scalable manner. In this blog post, we will introduce how to use Pulsar IO to easily import and export data between Pulsar and external systems.
1. Introduction to pulsar IO
Pulsar IO is a complete toolkit for creating, deploying and managing pulsar connectors integrated with external systems such as key / value storage, distributed file system, search index, database, data warehouse, other messaging systems, etc. Since pulsar IO is built in pulsar's serverless computing layer (called Pulsar Function )Therefore, writing a Pulsar IO connector is as simple as writing a Pulsar Function.
With pulsar IO, users can easily move data into and out of pulsar using existing pulsar connectors or writing their own custom connectors. Pulsar IO has the following advantages:
- Diverse connectors: there are many in the current Pulsar ecology Existing Pulsar IO connector For external systems such as Apache Kafka, Cassandra, and Aerospike. Using these connectors helps reduce production time because all the components needed to create integration are in place. The developer only needs to provide configuration (such as connection url and credentials) to run the connector.
- Managed runtime: Pulsar IO has a managed runtime, which is responsible for execution, scheduling, expansion and fault tolerance. Developers can focus on configuration and business logic.
- Multiple interfaces: through the interface provided by Pulsar IO, users can reduce the template code used to generate and use applications.
- High scalability: when more instances are needed to handle incoming traffic, users can easily expand horizontally by changing a simple configuration value; If users use Kubernetes runtime, they can flexibly expand according to traffic requirements.
- Make full use of Schema: pulsar IO helps users make full use of schema by specifying schema types on the data model. Pulsar IO supports JSON, Avro, Protobufs and other schema types.
2. Pulsar IO runtime
Since Pulsar IO is built on Pulsar Function, Pulsar IO and Pulsar Function have the same runtime options. When deploying Pulsar IO connector, the user has the following options:
- Thread: runs in the same JVM as the worker thread. (it is usually used for testing and local operation, and is not recommended for production deployment.)
- Process: running in different processes, users can use multiple worker threads to scale horizontally across multiple nodes.
- Kubernetes: runs as a Pod in the kubernetes cluster, and the worker coordinates with kubernetes. This runtime approach ensures that users can make full use of the advantages provided by cloud native environments such as kubernetes, such as easy horizontal expansion. Advantages provided by cloud native environment, such as easy horizontal expansion.
3. Pulsar IO interface
As mentioned earlier, Pulsar IO reduces the boilerplate code required to generate and consume applications. It does this by providing different basic interfaces that abstract the boilerplate code and allow us to focus on business logic.
Pulsar IO supports the basic interfaces of Source and Sink. The Source connector allows users to bring data into the pulsar from an external system, while the Sink Connector can be used to move data out of the pulsar and into an external system, such as a database.
There is also a special type of Source connector called push Source. Push Source connector can easily realize the integration of some data that needs to be pushed. For example, push Source can be a change data capture Source system, which will automatically push the change to Pulsar after receiving the new change.
Source interface
public interface Source<T> extends AutoCloseable { /** * Open connector with configuration. * * @param config initialization config * @param sourceContext environment where the source connector is running * @throws Exception IO type exceptions when opening a connector */ void open(final Map<String, Object> config, SourceContext sourceContext) throws Exception; /** * Reads the next message from source. * If source does not have any new messages, this call should block. * @return next message from source. The return result should never be null * @throws Exception */ Record<T> read() throws Exception; }
Push Source interface
public interface BatchSource<T> extends AutoCloseable { /** * Open connector with configuration. * * @param config config that's supplied for source * @param context environment where the source connector is running * @throws Exception IO type exceptions when opening a connector */ void open(final Map<String, Object> config, SourceContext context) throws Exception; /** * Discovery phase of a connector. This phase will only be run on one instance, i.e. instance 0, of the connector. * Implementations use the taskEater consumer to output serialized representation of tasks as they are discovered. * * @param taskEater function to notify the framework about the new task received. * @throws Exception during discover */ void discover(Consumer<byte[]> taskEater) throws Exception; /** * Called when a new task appears for this connector instance. * * @param task the serialized representation of the task */ void prepare(byte[] task) throws Exception; /** * Read data and return a record * Return null if no more records are present for this task * @return a record */ Record<T> readNext() throws Exception; }
Sink interface
public interface Sink<T> extends AutoCloseable { /** * Open connector with configuration. * * @param config initialization config * @param sinkContext environment where the sink connector is running * @throws Exception IO type exceptions when opening a connector */ void open(final Map<String, Object> config, SinkContext sinkContext) throws Exception; /** * Write a message to Sink. * * @param record record to write to sink * @throws Exception */ void write(Record<T> record) throws Exception; }
4. Summary
Apache Pulsar can be the backbone of modern data infrastructure, which enables enterprises to move data in a fast and scalable way. Pulsar IO is a connector framework that provides developers with all the necessary tools to create, deploy and manage pulsar connectors integrated with different systems. Pulsar IO abstracts all boilerplate code so that developers can focus on application logic.
5. Extended reading
If you are interested in learning more and building your own connectors, check out the following resources:
- View all Pulsar IO connectors in the surrounding ecosystem of Pulsar
- Build and deploy Source connector
- Writing custom Sink connectors for Pulsar IO
- Monitoring and troubleshooting connector
Introduction to the translator
Song Bo, working in Beijing Baiguan Technology Co., Ltd., is a senior development engineer, focusing on the fields of micro services, cloud computing and big data.
Join Apache Pulsar Chinese communication group 👇🏻
click link , check out the Apache Pulsar dry goods collection