Blog recommendation | effective management of data security - Pulsar Schema management

About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage and lightweight functional computing. It adopts a separate architecture design of computing and storage, supports multi tenant, persistent storage, multi machine room cross regional data replication, and has strong consistency, high throughput Stream data storage features such as low latency and high scalability.
GitHub address: http://github.com/apache/pulsar/

Before Schema management, ensure that there is no problem with the normal sending and receiving of Pulsar. First, what is Schema?

In a database, Schema is the organization and structure of data. If Pulsar is compared to a relational database, Topic stores the bytes in the disk file of the relational database, and Schema plays the same role as converting the bytes in the disk file of the relational database into a specific type database table, which belongs to the meta information of the data table. Why do we need Schema management in message queues? Let's look at the use of Pulsar Schema with questions.

Problem background

At present, the overall system availability of message queuing tends to be stable, but the security of upstream and downstream data has not been effectively guaranteed in the process of use. For example:

type TestCodeGenMsg struct {
-    Orderid     int64     `json:"orderid"`
+    Orderid     string    `json:"orderid"`
     Uid         int64     `json:"uid"`
     Flowid      string    `json:"flowid"`
}

This "incompatible" format will destroy most downstream services because they expect numeric types but now get a string. It is impossible for us to know in advance how much damage will be caused. For example, it is easy to blame "poor communication" or "lack of appropriate processes".

First of all, in the development, API is regarded as a first-class citizen in the micro service architecture, because API is a contract and has strong constraints. Any protocol change can be quickly perceived in advance, but the event consumption of message queue often can not respond and test quickly. When the model is repaired on a large scale, especially when it involves writing a database, It is likely to cause the same negative results as API failure. Here I recommend Gwen Shapira, who wrote an article before, Data contract and schema management are introduced , we expect to manage Schema changes based on simple compatibility policies, allow data to evolve safely, decouple teams, and allow them to develop independently and quickly. This is why we need Schema management.

Desired goal

Based on the compatibility policy, manage the schema to make the data evolve safely, such as:

type TestCodeGenMsg struct {
    Orderid     int64     `json:"orderid"`
    Uid         int64     `json:"uid"`
    Flowid      string    `json:"flowid"`
+   Username    string    `json:"username"`
}

The following shall not pass:

//Verification failed
type TestCodeGenMsg struct {
-    Orderid     int64     `json:"orderid"`
+    Orderid     string    `json:"orderid"`
     Uid         int64     `json:"uid"`
     Flowid      string    `json:"flowid"`
}

How do we use it

The main difference between the message model and the API is that events and their models take a long time to store. Once you upgrade all applications calling this API from v1 -- > V2, you can safely assume that the services using v1 have disappeared. This may take some time, but is usually measured in weeks rather than years. However, this is not the case for events that can store old versions of message queues forever. The following questions need to be considered: who should we upgrade first - consumers or producers? Can new consumers handle old events that are still stored in Pulsar? Do we need to wait before upgrading consumers? Can old consumers handle events written by new producers?

Pulsar Schema defines some compatibility rules, which relate to what changes we can make to the Schema without damaging consumers, and how to deal with the upgrading of different types of Schema changes. How to do it? We need to first confirm on the broker whether we support automatic evolution and Schema compatibility policies in the current namespace. The compatibility policies include: Click Details , or refer to the following table:

We operate through the CLI

// Query whether the current namespace supports schema auto evolution
./pulsar-admin namespaces get-is-allow-auto-update-schema tenant/namespace
 
// Open if not supported
./pulsar-admin namespaces set-is-allow-auto-update-schema --enable tenant/namespace
 
// Query the schema evolution strategy of the current namespace
./pulsar-admin namespaces get-schema-compatibility-strategy tenant/namespace
 
// With so many strategies, there is always one for you
./pulsar-admin namespaces set-schema-compatibility-strategy -c FORWARD_TRANSITIVE tenant/namespace

producer

Then access the producer. First look at the following example:

package main
import (
    "context"
    "fmt"
    "github.com/apache/pulsar-client-go/pulsar"
)
type TestSchema struct {
    Age   int    `json:"age"`
    Name  string `json:"name"`
    Addr  string `json:"addr"`
}
const AvroSchemaDef = "{"type":"record","name":"test","namespace":"CodeGenTest","fields":[{"name":"age","type":"int"},{"name":"name","type":"string"},{"name":"addr","type":"string"}]}"
var client *pulsar.Client
func main() {
     // Create client
    cp := pulsar.ClientOptions{
        URL:              "pulsar://xxx.xxx.xxx.xxx:6650",
        OperationTimeout: 30 * time.Second,
    }
    
    var err error
    client, err = pulsar.NewClient(cp)
    if err != nil {
        fmt.Println("NewClient error:", err.Error())
        return
    }
    defer client.Close()
    
    if err := Produce(); err != nil{
        fmt.Println("Produce error:", err.Error())
        return
    }
    
    if err := Consume(); err != nil{
        fmt.Println("Consume error:", err.Error())
        return
    }
}

func Produce() error {
    
    // Create schema
    properties := make(map[string]string)
    pas := pulsar.NewAvroSchema(AvroSchemaDef, properties)
    po := pulsar.ProducerOptions{
        Topic:       "persistent://test/schema/topic",
        Name:        "test_group",
        SendTimeout: 30 * time.Second,
        Schema:      pas,
    }
    
    // Create producer
    producer, err := client.CreateProducer(po)
    if err != nil {
        fmt.Println("CreateProducer error:", err.Error())
        return err
    }
    defer producer.Close()
    
    // Write message
    t := TestSchema{
            Age: 10,
            Name: "test",
            Addr: "test_addr",
    }
    
    id, err := producer.Send(context.Background(), &pulsar.ProducerMessage{
            Key:       t.Age,
            Value:     t,
            EventTime: time.Now(),
        })
    if err != nil {
            fmt.Println("Send error:", err.Error())
             return err
    }
    
    fmt.Println("msgId:", id)
}

The above demo completes a producer with a schema. We look through the producer producer options class (struct) and find that there are schema members, so we know that we need to pass in a schema object. We then go to new and a schema object is displayed by:

properties := make(map[string]string)
jas := pulsar.NewAvroSchema(jsonAvroSchemaDef, properties)

In addition to creating an Avro type schema, there are many other schemas, such as json and pb, which can be selected according to your needs. If you are interested in reading more relevant content, Martin Kleppmann wrote a good blog post, Compare model evolution in different data formats . Then let's look at what limits the data structure. One of the constants is as follows:

  const jsonAvroSchemaDef = "{"type":"record","name":"test","namespace":"CodeGenTest","fields":[{"name":"age","type":"int"},{"name":"name","type":"string"},{"name":"addr","type":"string"}]}"

Expand:

{
    "type":"record",
    "name":"test",
    "namespace":"Test",
    "fields":[
        {
            "name":"age",
            "type":"int
        },
        {
            "name":"name",
            "type":["null","string"] // Represents an optional field
        },
        {
            "name":"addr",
            "type":"string"
            "default":"beijing", // Represents the default field
        }
    ]
}

This is an avro schema (all verification types are written in this way), where fields represents the required field name and type. At the same time, set the name of the schema and specify the namespace before using the compatibility policy. Introduction to avro syntax Reference column [[4]] (#), and the following table types:

consumer

First look at the code:

func Consume(ctx context.Context) error {
    cas := pulsar.NewAvroSchema(AvroSchemaDef, properties)
    consumer, err := client.Subscribe(pulsar.ConsumerOptions{
        Topic: "persistent://base/test/topic",
        SubscriptionName: "test",
        Type: pulsar.Failover,
        Schema: cas,
})
if err != nil {
    return err
}
defer consumer.Close()
 
    for {
        msg, err := consumer.Receive(ctx)
        if err != nil {
            return err
         }
 
        t := TestSchema{}
        if err := msg.GetSchemaValue(&t);err != nil{
            continue
        }
 
        consumer.Ack(msg)
        fmt.Println("msgId:", msg.ID(), " Payload:", string(msg.Payload()), " t:", t)
    }
}

We can see that if we use schema, we finally need to deserialize the message with GetSchemaValue() method to truly ensure security. This is generally the case for the whole production and consumption framework. After that, we come to the concept of schema evolution: schema principle and the workflow of schema, as shown in the figure:

Confluent developed a schema registry server in Kafka that is independent of broker coordination. Its workflow is:

  • When sending data to Kafka, we need to register the schema with the Schema Registry first, and then send it serially to Kafka;
  • The Schema Registry server provides a globally unique ID for each registered schema. The assigned ID is guaranteed to increase monotonically, but not necessarily continuously;
  • When we need to consume data from Kafka, the consumer will first judge whether the schema is in local memory before deserialization. If it is not in local memory, we need to obtain the schema from the Schema Registry. Otherwise, we don't need to obtain it.

Pulsar differs in that:

  • Manage the schema evolution of Pulsar, and store the relevant schema information on bookie;
  • schema information is not in Pulsar's message protocol;
  • The consumer needs to input its own schema.

Although its principle is similar to Kafka, Pulsar adopts the design that schema server and broker are not separated, and schema information is stored in bookie, which solves the problem of high availability of schema server. The compatibility detection of schema evolution is carried out on the broker side (not serialization and deserialization here).

What did the client do? According to the above, we know that the final guarantee for schema security is actually a check of corresponding types of decode and encode. From the source code, the incoming schema will be detected in the process of creating producers and consumers. Here is an independent message structure.

The method used by consumers is actually the decode() method we just mentioned.

The corresponding type only needs to implement the schema interface:

type Schema interface {
    Encode(v interface{}) ([]byte, error)
    Decode(data []byte, v interface{}) error
    Validate(message []byte) error
    GetSchemaInfo() *SchemaInfo
}

Refer to Pulsar Go Client for specific implementation   Related documents , there are several implementations of serialized data types.

supplement

As the metadata of Pulsar Topic, schema can be provided to Pulsar SQL for use. The Pulsar SQL storage layer implements the interface of Presto connector. Schema will be displayed in the SQL layer as the metadata of Presto payload, which greatly facilitates us to view messages and data analysis. The above mentioned in my supplement is the reason why we need schema management. Thanks for reading.

Introduction to the author

My name is Hou Shengxin or Dayun. At present, I work in the fish accompanying infrastructure, responsible for the maintenance and related development of message queues. I am a member of the vegetable chicken team of Rust daily, and like to study storage, service governance and other directions. When I first came into contact with Pulsar, I was attracted by the structure of separation of storage and computing. Smooth producer consumer access and high throughput made me curious about the implementation of this project. I hope to make some contributions to the relevant functions of Pulsar in the future.

Recommended reading

click link , see more Pulsar highlights

Tags: Big Data Apache JSON pulsar Open Source

Posted on Thu, 02 Dec 2021 18:26:41 -0500 by vishakh369