alertmanager source code analysis

  1. home page
  2. special column
  3. golang
  4. Article details
0

alertmanager source code analysis (I)

focus Published 6 minutes ago

Monitoring alarm is generally taken as a whole, including data collection, storage, display, rule calculation, alarm message processing, etc. Alertmanager (hereinafter referred to as am) is an alarm message management component, including message routing, silence, suppression, de duplication and other functions. In short, other components responsible for rule calculation can send messages to am without brain, which can process messages and send high-quality alarm messages as much as possible.

Let's take a look at the overview diagram. This is based on the architecture drawing in the original open source library. The architecture diagram in the original warehouse has many differences from the actual source code, so this diagram is richer and more accurate than the original one.

This article starts with the first part: alarm writing

The writing of alarms to the final processing can be abstracted into a production consumption model. The production side is the api to receive alarms, the consumption side is the dispatcher in the figure, and the provider.Alerts in the middle is used as the buffer.

The following is the logic during writing, which is mainly to judge the alarm status. The alarm status of am is judged by alert.StartsAt and alert.EndsAt, and there are many subsequent logics that need this attribute, so the start and end time needs to be confirmed at this position.

func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*types.Alert) {
    now := time.Now()
    
    api.mtx.RLock()
    resolveTimeout := time.Duration(api.config.Global.ResolveTimeout)
    api.mtx.RUnlock()

    // Determine the start and end time of an alarm message
    // The alarm status needs to be defined according to the start and end time. If the end time is before the current time, it is Resolved
    for _, alert := range alerts {

        // The newly received alarm marks the receiving time, so that if two alarm label s are consistent, you can judge which is the latest received
        alert.UpdatedAt = now

        // Ensure StartsAt is set.
        if alert.StartsAt.IsZero() {
            if alert.EndsAt.IsZero() {
                alert.StartsAt = now
            } else {
                alert.StartsAt = alert.EndsAt
            }
        }
        // If there is no ending time, you need to use resolveTimeout to calculate one
        if alert.EndsAt.IsZero() {
            alert.Timeout = true
            alert.EndsAt = now.Add(resolveTimeout)
        }
        if alert.EndsAt.After(time.Now()) {
            api.m.Firing().Inc()
        } else {
            api.m.Resolved().Inc()
        }
    }

    // Make a best effort to insert all alerts that are valid.
    var (
        validAlerts    = make([]*types.Alert, 0, len(alerts))
        validationErrs = &types.MultiError{}
    )

    // Verify alert, such as cleaning up the label of null value, start and end time, at least one label, kv naming rules in label, etc
    for _, a := range alerts {
        removeEmptyLabels(a.Labels)

        if err := a.Validate(); err != nil {
            validationErrs.Add(err)
            api.m.Invalid().Inc()
            continue
        }
        validAlerts = append(validAlerts, a)
    }
    // Write alertsProvider, which is equivalent to the producer
    if err := api.alerts.Put(validAlerts...); err != nil {
        api.respondError(w, apiError{
            typ: errorInternal,
            err: err,
        }, nil)
        return
    }
}

And provider.Alerts is an interface

// All methods should be underground safe
type Alerts interface {
    Subscribe() AlertIterator
    GetPending() AlertIterator
    Get(model.Fingerprint) (*types.Alert, error)
    Put(...*types.Alert) error
}

A memory based implementation is given in the source code, so all alarm receiving will first write this structure, and other processes will get their own alarms from here. Later, this memory based implementation will be called AlertsProvider

// Alerts management structure is the structure used by alerts in the architecture diagram
type Alerts struct {
    cancel context.CancelFunc

    mtx       sync.Mutex
    alerts    *store.Alerts                // Store map[fingerprint]*Alert
    listeners map[int]listeningAlerts    // All listeners
    next      int                        // Listener count

    callback AlertStoreCallback
    logger log.Logger
}

First look at the Put of AlertsProvider to see how the alarm is written

func (a *Alerts) Put(alerts ...*types.Alert) error {
    for _, alert := range alerts {
        // Make a unique ID based on label name and label value in LabelSets
        fp := alert.Fingerprint()

        existing := false

        // If the same alert already exists, the labelSets are the same
        if old, err := a.alerts.Get(fp); err == nil {
            existing = true

            // If the old and new alarm sections overlap, they shall be merged, and the newer alarm contents shall be used according to certain strategies
            if (alert.EndsAt.After(old.StartsAt) && alert.EndsAt.Before(old.EndsAt)) ||
                (alert.StartsAt.After(old.StartsAt) && alert.StartsAt.Before(old.EndsAt)) {
                alert = old.Merge(alert)
            }
        }
        // This Set method writes the current alert into map[fp]*Alert using the fp created above
        if err := a.alerts.Set(alert); err != nil {
            level.Error(a.logger).Log("msg", "error on set alert", "err", err)
            continue
        }
        
        // Other modules in the program will register a listener to AlertsProvider by calling Subscribe
        // AlertsProvider broadcasts to all listener s every time an alert is successfully stored
        // This process ensures that all listeners receive consistent broadcasts
        a.mtx.Lock()
        for _, l := range a.listeners {
            select {
            case l.alerts <- alert:
            case <-l.done:
            }
        }
        a.mtx.Unlock()
    }

    return nil
}

Other parts of the program (dispatcher, Inhibitor) listen for newly written alarm messages by calling Subscribe

func (a *Alerts) Subscribe() provider.AlertIterator {
    // groutine-safe
    a.mtx.Lock()
    defer a.mtx.Unlock()

    var (
        done   = make(chan struct{})
        alerts = a.alerts.List()                                               // Get all alerts
        ch     = make(chan *types.Alert, max(len(alerts), alertChannelLength)) // Create a buffer chan to ensure that the capacity is either surplus or just right
    )
    // Write the alerts that already exist when calling to a buffered chan
    // In fact, other component subscriptions received from Alerts are completed during program startup
    // The possibility of receiving an alarm in the middle is very small, even if it is received, it will not be many
    for _, a := range alerts {
        ch <- a
    }

    // Create a new listener for AlertsProvider. The structure is buffered chan and a shutdown signal chan
    // Obviously, buffered chan is used by the caller to obtain the alarm, and the shutdown signal chan is used by the caller to monitor the end signal
    // Using next as the count, there are currently next listener s
    a.listeners[a.next] = listeningAlerts{alerts: ch, done: done}
    a.next++
    
    // Here, buffered chan and shutdown signal chan are repackaged into alertIterator and returned to the caller
    return provider.NewAlertIterator(ch, done, nil)
}

Therefore, the caller will use some methods provided by alertIterator

type alertIterator struct {
    ch   <-chan *types.Alert
    done chan struct{}
    err  error
}
func (ai alertIterator) Next() <-chan *types.Alert { return ai.ch }
func (ai alertIterator) Err() error { return ai.err }
func (ai alertIterator) Close()     { close(ai.done) }

The implementation of alertIterator is similar to the iterator protocol, which allows the caller to use the for loop. Let's take a look at how the Dispatcher uses it first

// First, main.go instantiates the Dispatcher and starts it
go disp.Run()

// Secondly, the exportable Run in the Dispatcher obtains an alertIterator from the AlertsProvider by calling Subscribe
func (d *Dispatcher) Run() {
    d.done = make(chan struct{})

    d.mtx.Lock()
    d.aggrGroupsPerRoute = map[*Route]map[model.Fingerprint]*aggrGroup{}
    d.aggrGroupsNum = 0
    d.metrics.aggrGroups.Set(0)
    d.ctx, d.cancel = context.WithCancel(context.Background())
    d.mtx.Unlock()

    d.run(d.alerts.Subscribe())
    close(d.done)
}

// Finally, the non exportable run of the Dispatcher, 
// It is a fo select structure that monitors the alert, gc signals and exit signals newly received in the AlertsProvider at the same time
func (d *Dispatcher) run(it provider.AlertIterator) {
    cleanup := time.NewTicker(30 * time.Second)
    defer cleanup.Stop()

    defer it.Close()

    for {
        select {
        // The Next() method of alertIterator returns a chan, that is, the buffered chan provided in the AlertsProvider above
        case alert, ok := <-it.Next():
            // The next step is how the Dispatcher should handle when it receives a new alarm
            if !ok {
                // Iterator exhausted for some reason.
                if err := it.Err(); err != nil {
                    level.Error(d.logger).Log("msg", "Error on alert update", "err", err)
                }
                return
            }

            level.Debug(d.logger).Log("msg", "Received alert", "alert", alert)

            // Log errors but keep trying.
            if err := it.Err(); err != nil {
                level.Error(d.logger).Log("msg", "Error on alert update", "err", err)
                continue
            }
            
            // Find which routes match this alert from the Dispatcher. There may be multiple routes matching this alert
            // Use each router to handle this alert
            now := time.Now()
            for _, r := range d.route.Match(alert.Labels) {
                d.processAlert(alert, r)
            }
            d.metrics.processingDuration.Observe(time.Since(now).Seconds())

        case <-cleanup.C:
            // Dispatcher will have a gc process to clean up some unused memory containers
            d.mtx.Lock()

            for _, groups := range d.aggrGroupsPerRoute {
                for _, ag := range groups {
                    if ag.empty() {
                        ag.stop()
                        delete(groups, ag.fingerprint())
                        d.aggrGroupsNum--
                        d.metrics.aggrGroups.Dec()
                    }
                }
            }

            d.mtx.Unlock()

        case <-d.ctx.Done():
            return
        }
    }
}

Based on the broadcast process in Put, the Subscribe method, the design of alertIterator and the monitoring of Dispatcher, we can see this publish Subscribe mode: the subscriber obtains a buffered chan when subscribing. At the same time, the buffered chan already has the messages of the publisher before subscribing, and the buffered chan will be recorded in the listeners by the publisher, Subscribers listen to the buffered chan returned during their subscription. Every time the publisher receives a message, broadcast all buffered chan in listeners, so that each subscriber will receive a message.

Now that the alarm has been written to the AlertsProvider, other modules can listen to the latest alarm through subscription. Next, go to Dispatcher.processAlert to see how the alarm is handled

Reading 8 was released 6 minutes ago
Like collection

I conquered the innate spiritual field inch by inch, and I cultivated the swamp that trapped me bit by bit

0 reputation
1 fans
Focus on the author
Submit comments
You know what?

Register login

I conquered the innate spiritual field inch by inch, and I cultivated the swamp that trapped me bit by bit

0 reputation
1 fans
Focus on the author
Article catalog
follow
Billboard

Tags: Go source code analysis

Posted on Fri, 29 Oct 2021 23:28:59 -0400 by saviiour