- home page
- special column
- golang
- Article details
alertmanager source code analysis (I)

Monitoring alarm is generally taken as a whole, including data collection, storage, display, rule calculation, alarm message processing, etc. Alertmanager (hereinafter referred to as am) is an alarm message management component, including message routing, silence, suppression, de duplication and other functions. In short, other components responsible for rule calculation can send messages to am without brain, which can process messages and send high-quality alarm messages as much as possible.
Let's take a look at the overview diagram. This is based on the architecture drawing in the original open source library. The architecture diagram in the original warehouse has many differences from the actual source code, so this diagram is richer and more accurate than the original one.
This article starts with the first part: alarm writing
The writing of alarms to the final processing can be abstracted into a production consumption model. The production side is the api to receive alarms, the consumption side is the dispatcher in the figure, and the provider.Alerts in the middle is used as the buffer.
The following is the logic during writing, which is mainly to judge the alarm status. The alarm status of am is judged by alert.StartsAt and alert.EndsAt, and there are many subsequent logics that need this attribute, so the start and end time needs to be confirmed at this position.
func (api *API) insertAlerts(w http.ResponseWriter, r *http.Request, alerts ...*types.Alert) { now := time.Now() api.mtx.RLock() resolveTimeout := time.Duration(api.config.Global.ResolveTimeout) api.mtx.RUnlock() // Determine the start and end time of an alarm message // The alarm status needs to be defined according to the start and end time. If the end time is before the current time, it is Resolved for _, alert := range alerts { // The newly received alarm marks the receiving time, so that if two alarm label s are consistent, you can judge which is the latest received alert.UpdatedAt = now // Ensure StartsAt is set. if alert.StartsAt.IsZero() { if alert.EndsAt.IsZero() { alert.StartsAt = now } else { alert.StartsAt = alert.EndsAt } } // If there is no ending time, you need to use resolveTimeout to calculate one if alert.EndsAt.IsZero() { alert.Timeout = true alert.EndsAt = now.Add(resolveTimeout) } if alert.EndsAt.After(time.Now()) { api.m.Firing().Inc() } else { api.m.Resolved().Inc() } } // Make a best effort to insert all alerts that are valid. var ( validAlerts = make([]*types.Alert, 0, len(alerts)) validationErrs = &types.MultiError{} ) // Verify alert, such as cleaning up the label of null value, start and end time, at least one label, kv naming rules in label, etc for _, a := range alerts { removeEmptyLabels(a.Labels) if err := a.Validate(); err != nil { validationErrs.Add(err) api.m.Invalid().Inc() continue } validAlerts = append(validAlerts, a) } // Write alertsProvider, which is equivalent to the producer if err := api.alerts.Put(validAlerts...); err != nil { api.respondError(w, apiError{ typ: errorInternal, err: err, }, nil) return } }
And provider.Alerts is an interface
// All methods should be underground safe type Alerts interface { Subscribe() AlertIterator GetPending() AlertIterator Get(model.Fingerprint) (*types.Alert, error) Put(...*types.Alert) error }
A memory based implementation is given in the source code, so all alarm receiving will first write this structure, and other processes will get their own alarms from here. Later, this memory based implementation will be called AlertsProvider
// Alerts management structure is the structure used by alerts in the architecture diagram type Alerts struct { cancel context.CancelFunc mtx sync.Mutex alerts *store.Alerts // Store map[fingerprint]*Alert listeners map[int]listeningAlerts // All listeners next int // Listener count callback AlertStoreCallback logger log.Logger }
First look at the Put of AlertsProvider to see how the alarm is written
func (a *Alerts) Put(alerts ...*types.Alert) error { for _, alert := range alerts { // Make a unique ID based on label name and label value in LabelSets fp := alert.Fingerprint() existing := false // If the same alert already exists, the labelSets are the same if old, err := a.alerts.Get(fp); err == nil { existing = true // If the old and new alarm sections overlap, they shall be merged, and the newer alarm contents shall be used according to certain strategies if (alert.EndsAt.After(old.StartsAt) && alert.EndsAt.Before(old.EndsAt)) || (alert.StartsAt.After(old.StartsAt) && alert.StartsAt.Before(old.EndsAt)) { alert = old.Merge(alert) } } // This Set method writes the current alert into map[fp]*Alert using the fp created above if err := a.alerts.Set(alert); err != nil { level.Error(a.logger).Log("msg", "error on set alert", "err", err) continue } // Other modules in the program will register a listener to AlertsProvider by calling Subscribe // AlertsProvider broadcasts to all listener s every time an alert is successfully stored // This process ensures that all listeners receive consistent broadcasts a.mtx.Lock() for _, l := range a.listeners { select { case l.alerts <- alert: case <-l.done: } } a.mtx.Unlock() } return nil }
Other parts of the program (dispatcher, Inhibitor) listen for newly written alarm messages by calling Subscribe
func (a *Alerts) Subscribe() provider.AlertIterator { // groutine-safe a.mtx.Lock() defer a.mtx.Unlock() var ( done = make(chan struct{}) alerts = a.alerts.List() // Get all alerts ch = make(chan *types.Alert, max(len(alerts), alertChannelLength)) // Create a buffer chan to ensure that the capacity is either surplus or just right ) // Write the alerts that already exist when calling to a buffered chan // In fact, other component subscriptions received from Alerts are completed during program startup // The possibility of receiving an alarm in the middle is very small, even if it is received, it will not be many for _, a := range alerts { ch <- a } // Create a new listener for AlertsProvider. The structure is buffered chan and a shutdown signal chan // Obviously, buffered chan is used by the caller to obtain the alarm, and the shutdown signal chan is used by the caller to monitor the end signal // Using next as the count, there are currently next listener s a.listeners[a.next] = listeningAlerts{alerts: ch, done: done} a.next++ // Here, buffered chan and shutdown signal chan are repackaged into alertIterator and returned to the caller return provider.NewAlertIterator(ch, done, nil) }
Therefore, the caller will use some methods provided by alertIterator
type alertIterator struct { ch <-chan *types.Alert done chan struct{} err error } func (ai alertIterator) Next() <-chan *types.Alert { return ai.ch } func (ai alertIterator) Err() error { return ai.err } func (ai alertIterator) Close() { close(ai.done) }
The implementation of alertIterator is similar to the iterator protocol, which allows the caller to use the for loop. Let's take a look at how the Dispatcher uses it first
// First, main.go instantiates the Dispatcher and starts it go disp.Run() // Secondly, the exportable Run in the Dispatcher obtains an alertIterator from the AlertsProvider by calling Subscribe func (d *Dispatcher) Run() { d.done = make(chan struct{}) d.mtx.Lock() d.aggrGroupsPerRoute = map[*Route]map[model.Fingerprint]*aggrGroup{} d.aggrGroupsNum = 0 d.metrics.aggrGroups.Set(0) d.ctx, d.cancel = context.WithCancel(context.Background()) d.mtx.Unlock() d.run(d.alerts.Subscribe()) close(d.done) } // Finally, the non exportable run of the Dispatcher, // It is a fo select structure that monitors the alert, gc signals and exit signals newly received in the AlertsProvider at the same time func (d *Dispatcher) run(it provider.AlertIterator) { cleanup := time.NewTicker(30 * time.Second) defer cleanup.Stop() defer it.Close() for { select { // The Next() method of alertIterator returns a chan, that is, the buffered chan provided in the AlertsProvider above case alert, ok := <-it.Next(): // The next step is how the Dispatcher should handle when it receives a new alarm if !ok { // Iterator exhausted for some reason. if err := it.Err(); err != nil { level.Error(d.logger).Log("msg", "Error on alert update", "err", err) } return } level.Debug(d.logger).Log("msg", "Received alert", "alert", alert) // Log errors but keep trying. if err := it.Err(); err != nil { level.Error(d.logger).Log("msg", "Error on alert update", "err", err) continue } // Find which routes match this alert from the Dispatcher. There may be multiple routes matching this alert // Use each router to handle this alert now := time.Now() for _, r := range d.route.Match(alert.Labels) { d.processAlert(alert, r) } d.metrics.processingDuration.Observe(time.Since(now).Seconds()) case <-cleanup.C: // Dispatcher will have a gc process to clean up some unused memory containers d.mtx.Lock() for _, groups := range d.aggrGroupsPerRoute { for _, ag := range groups { if ag.empty() { ag.stop() delete(groups, ag.fingerprint()) d.aggrGroupsNum-- d.metrics.aggrGroups.Dec() } } } d.mtx.Unlock() case <-d.ctx.Done(): return } } }
Based on the broadcast process in Put, the Subscribe method, the design of alertIterator and the monitoring of Dispatcher, we can see this publish Subscribe mode: the subscriber obtains a buffered chan when subscribing. At the same time, the buffered chan already has the messages of the publisher before subscribing, and the buffered chan will be recorded in the listeners by the publisher, Subscribers listen to the buffered chan returned during their subscription. Every time the publisher receives a message, broadcast all buffered chan in listeners, so that each subscriber will receive a message.
Now that the alarm has been written to the AlertsProvider, other modules can listen to the latest alarm through subscription. Next, go to Dispatcher.processAlert to see how the alarm is handled
I conquered the innate spiritual field inch by inch, and I cultivated the swamp that trapped me bit by bit
0 comments
I conquered the innate spiritual field inch by inch, and I cultivated the swamp that trapped me bit by bit