Introduction and principle analysis of flagger

brief introduction

Flagger It can make the application publishing process running on k8s system fully automatic (unattended). It can reduce the human attention time of publishing, and automatically identify some risks (such as RT, success rate, custom metrics) and roll back during the publishing process

Main characteristics

Overall structure

Briefly introduce the meaning of the figure above:
• primary service: service stable version. It can be understood as a published online service
• canary service: a new version of service to be released
• Ingress: service gateway
• Flagger: we will adjust the primary and canary traffic strategies by using the specification of ingress/service mesh through the Flagger spec (described below), so as to achieve the A / B testing, blue / green, canary (canary) release effect. In the process of adjusting the traffic, we will prometheus Collect various indicators (RT, success rate, etc.) to decide whether to roll back the release or continue to adjust the traffic proportion. In this process, you can customize whether to intervene, approve, receive notice, etc

Implementation principle

Note: the following principle introduction is mainly based on the following official example description:

apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: podinfo namespace: test spec: # service mesh provider (optional) # can be: kubernetes, istio, linkerd, appmesh, nginx, contour, gloo, supergloo provider: istio # deployment reference targetRef: apiVersion: apps/v1 kind: Deployment name: podinfo # the maximum time in seconds for the canary deployment # to make progress before it is rollback (default 600s) progressDeadlineSeconds: 60 # HPA reference (optional) autoscalerRef: apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler name: podinfo service: # service name (defaults to targetRef.name) name: podinfo # ClusterIP port number port: 9898 # container port name or number (optional) targetPort: 9898 # port name can be http or grpc (default http) portName: http # add all the other container ports # to the ClusterIP services (default false) portDiscovery: true # HTTP match conditions (optional) match: - uri: prefix: / # HTTP rewrite (optional) rewrite: uri: / # request timeout (optional) timeout: 5s # promote the canary without analysing it (default false) skipAnalysis: false # define the canary analysis timing and KPIs analysis: # schedule interval (default 60s) interval: 1m # max number of failed metric checks before rollback threshold: 10 # max traffic percentage routed to canary # percentage (0-100) maxWeight: 50 # canary increment step # percentage (0-100) stepWeight: 5 # validation (optional) metrics: - name: request-success-rate # builtin Prometheus check # minimum req success rate (non 5xx responses) # percentage (0-100) thresholdRange: min: 99 interval: 1m - name: request-duration # builtin Prometheus check # maximum req duration P99 # milliseconds thresholdRange: max: 500 interval: 30s - name: "database connections" # custom Prometheus check templateRef: name: db-connections thresholdRange: min: 2 max: 100 interval: 1m # testing (optional) webhooks: - name: "conformance test" type: pre-rollout url: http://flagger-helmtester.test/ timeout: 5m metadata: type: "helmv3" cmd: "test run podinfo -n test" - name: "load test" type: rollout url: http://flagger-loadtester.test/ metadata: cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/" # alerting (optional) alerts: - name: "dev team Slack" severity: error providerRef: name: dev-slack namespace: flagger - name: "qa team Discord" severity: warn providerRef: name: qa-discord - name: "on-call MS Teams" severity: info providerRef: name: on-call-msteams

Briefly introduce the meaning of the above configuration, and the following will be separately detailed:
• targetRef: the currently deployed new version service (either Deployment or DaemonSet)
• progressDeadlineSeconds: canary, primary deployment timeout. If the deployment is not completed after this time, no traffic adjustment will be made.
• autoscalerRef: K8s native HPA (automatic retraction)
Service: it can be understood as k8s service concept. When the provider is Istio, it corresponds to virtualservice (with the ability to adjust the traffic proportion, routing strategy, etc.)
Skip analysis: whether to skip the metrcis analysis. If it is true, it is equivalent to replacing the primary with the canary service at one time
• analysis:
• include some adjustment of primary and Canary traffic policy configuration
• metrics: indicator source. For example: avg RT, success rate, custom metrics (Prometheus PQL can be configured directly), etc
• webhook: it can be used to manually audit access, stress test, etc
• alerts: progress details, alert notifications, etc

Overall process

explain:
• the whole process from Start to End in the figure above is executed in a timer
• cancary and canary in the above figure do not have the same meaning. Canary generally refers to Canary(Kind) object or Canary deployment strategy, and Canary refers to targetRef's object (deployment, service).
• VirtualSerivice (primary is Istio): it is the key to the implementation of a / B testing, blue / green, and can release. For details, please refer to Istio's introduction to virtual service
• about A/B testing, Blue/Green, Canary, which will be described in detail below

Deployment strategy

A/B testing

analysis: # schedule interval (default 60s) interval: 1m # total number of iterations iterations: 10 # max number of failed iterations before rollback threshold: 2 # canary match condition match: - headers: x-canary: regex: ".*insider.*" - headers: cookie: regex: "^(.*?;)?(canary=always)(;.*)?$"

Take the above code example:
• multiple httproutes will be set during the creation of virtual service
• default traffic, access primary service
• route traffic to canary service through http header or cookie regular matching
• the whole process will be executed 10 times, with an interval of 1 minute. At most 2 times of metrics verification failure are allowed. If more than 2 times, rollback will be performed
After the normal end, the "confirm promotion" webhook will be executed to confirm whether to replace the primary with cannay

• if yes, the primary will be replaced with Canary's spec (deployment spec, configmap) related information • if not, continue to wait

Blue/Green

analysis: # schedule interval (default 60s) interval: 1m # total number of iterations iterations: 10 # max number of failed iterations before rollback threshold: 2 webhooks: - name: "load test" type: rollout url: http://flagger-loadtester.test/ metadata: cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"

Take the above code example:
• the whole process will be executed 10 times, with an interval of 1 minute. At most 2 times of metrics verification failure are allowed. If more than 2 times, rollback will be performed
• during this period, the canary service will be stress tested
After normal completion, the "confirm promotion" webhook will be executed to confirm whether to replace primary with cannay

• if yes, the primary will be replaced with Canary's spec (deployment spec, configmap) related information • if not, continue to wait

If mirror = true is configured (this feature is only supported when provider=istio), the mirror feature of istio will be used to copy traffic to primary and canary respectively, and use primary's reponse as the return value. Pay special attention to whether the service is idempotent at this time

Canary

analysis: # schedule interval (default 60s) interval: 1m # max number of failed metric checks before rollback threshold: 2 # max traffic percentage routed to canary # percentage (0-100) maxWeight: 50 # canary increment step # percentage (0-100) stepWeight: 2 # deploy straight to production without # the metrics and webhook checks skipAnalysis: false

Take the above code example:
• the whole process will execute 25 times (maxweight / maxweight), with an interval of 1 minute each time. At most 2 times of metrics verification failure are allowed. If more than 2 times, rollback will be performed
• every time primary reduces stepWeight% traffic, canary increases stepWeight% traffic until canary reaches maxWeight
• execute "confirm promotion" webhook to confirm whether to replace primary with cannay

If yes, the primary will be replaced with the spec (deployment spec, configmap) of canary • if not, continue to wait

other

Webhooks

webhooks: during the whole publishing process, the corresponding extension points are defined:
Confirm roll out: it is executed before canary receives traffic. It can be used in scenarios such as manual audit and release, automatic test passing, etc
If the webhook does not return success (for example, request return status code 200), the publication waits
• pre rollout: the webhook executed before the first cutover to canary. If the number of execution failures exceeds the threshold, rollback is performed
• rollout: execute before metrics analysis in each cycle of publishing (for example, every stepWeight). If the number of execution failures exceeds the threshold, rollback is performed
Confirm promotion: execute before the primary is changed to canary configuration
If it fails, it will wait. In the process of waiting, flag will continue to perform metrics verification until the final rollback
• post rollout: execute after rollback or finish. If the execution fails, only the Event log will be recorded.
• rollback: when Canary is in the progress or Waiting state, it provides the ability to perform rollback manually
• event: in each life cycle, some related k8s events will be generated. If event webhook is configured, relevant event information will be sent at the same time of k8s event

Metrics

Metrics: used to determine whether a / b (blue / green, Canary) traffic verification fails. If the threshold is exceeded, the publication will be rolled back
• default metrics

analysis: metrics: - name: request-success-rate interval: 1m # minimum req success rate (non 5xx responses) # percentage (0-100) thresholdRange: min: 99 - name: request-duration interval: 1m # maximum req duration P99 # milliseconds thresholdRange: max: 500

Request success rate. The above example shows that the success rate cannot be lower than 99%
Request duration (AVG RT): RT mean value cannot exceed 500ms
Request success rate and request duration are the metrics provided by Flagger by default

Different provider s cannot be implemented. For example, applications can provide us metrics

• custom metrics

Create MetricTemplate. For example, customized business metrics, such as order payment failure rate

apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: not-found-percentage namespace: istio-system spec: provider: type: prometheus address: http://promethues.istio-system:9090 query: | 100 - sum( rate( istio_requests_total{ reporter="destination", destination_workload_namespace="{{ namespace }}", destination_workload="{{ target }}", response_code!="404" }[{{ interval }}] ) ) / sum( rate( istio_requests_total{ reporter="destination", destination_workload_namespace="{{ namespace }}", destination_workload="{{ target }}" }[{{ interval }}] ) ) * 100

Reference MetricTemplate

analysis: metrics: - name: "404s percentage" templateRef: name: not-found-percentage namespace: istio-system thresholdRange: max: 5 interval: 1m

The above example shows that canary's metrics for 404 errors / s cannot exceed 5%

Alter

Alter: used for information notification during publishing
1. Define alterprovider (it can be slack or dinging)

apiVersion: flagger.app/v1beta1 kind: AlertProvider metadata: name: on-call namespace: flagger spec: type: slack channel: on-call-alerts username: flagger # webhook address (ignored if secretRef is specified) address: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK # secret containing the webhook address (optional) secretRef: name: on-call-url --- apiVersion: v1 kind: Secret metadata: name: on-call-url namespace: flagger data: address: <encoded-url>

2. Use Alter

analysis: alerts: - name: "on-call Slack" severity: error providerRef: name: on-call namespace: flagger

• serverity: the level of notification information, similar to the log level, including info, warn, error
In the whole deployment process, alter will be used in different stages to send notification information, such as successful publishing, webhook execution failure and other scenarios.

summary

Flag provides a good abstraction of application automatic release process and rich extension mechanisms (webhook, alter, metrics, etc.)
These features are attractive. Can they be used directly within the group?
The answer is No. Flagger requires applications to be built on the basis of k8s, such as service discovery mechanism. In addition, it requires the deployment of ingress / service mesh (both of which have the ability to adjust the traffic policy). Taking HSF as an example, its service discovery mechanism is based on configserver, and the service is interface oriented rather than application oriented.
It is estimated that it can not be used without some modification

• in addition, Flager has some improvements (my own people):
The expansion and contraction of canary instance in the process of cutting flow are based on HPA (if configured), and the delay of HPA expansion and contraction will affect the business
Improvement scheme: the number of canary instances can be dynamically adjusted according to the change of stepWeight, which is only for Canary release
For blue / green, a / B testing can prepare capacity in advance through webhook

• Flagger is planning primary and canary traffic comparison features, which seems to be the same thing that group Doom does. Look forward to the future