Alerts

Alerts are an important part of our NMS. We highly recommend the operators to run their networks with alerts always turned on. Without alerts, it is impossible to debug any potential issues happening on the network in a timely fashion. In this guide we will discuss

Viewing Alerts
Alert receiver configuration
Alert rules configuration
- Predefined alerts
- Custom Alerts
Troubleshooting

Viewing alerts

Top Level Network Dashboard

Alert dashboard displays the current firing alerts in a table tabbed by severity. In each of the columns we additionally display the labels passed along with the alert. viewing_alerts_1

Alarm component’s Alert Tab

Alerts can also be viewed from the alert tab in the Alarm table. viewing_alerts_2 viewing_alerts_3

Alert Receivers

An alert Receiver is created to push the alert notification in real time so that the operator is notified when the alert is fired. Following example details the steps involved in creating a slack alert receiver and configuring it in NMS.

Example: Adding Slack Channel as Alert Receiver

Generate Slack Webhook URL

Create an App: Go to https://api.slack.com/apps?new_app=1 and click on “Create New App”. Enter the App Name and the Slack Workspace.
Click on “Incoming Webhooks” and change “Active Incoming Webhooks” to On
Scroll down and create a new Webhook by clicking on “Add New Webhook to Workspace”. Select the Slack Channel name.
Copy the “Webhook URL” once it is generated.

Create a new Alert Receiver in NMS

alert_recv4

Testing the newly added alert receiver

Add a dummy alert to verify if the alert receiver is indeed working. A dummy alert expression can be constructed with a PromQL advanced expresssion of vector(1) as shown below.

alert_recv5

Look for the notification on the slack channel alert_recv6

Configuring Alert rules

Alert rule configuration lets us define rules for triggering alerts.

Predefined Alerts

Magma NMS comes loaded with some default set of alert rules. These default rules aren’t configured automatically. If an operator chooses to use these default rules, they can do it by clicking “Sync Predefined Alerts” button in Alert rule configuration tab. As shown below alerts1 alerts2

Currently predefined alerts configure following alerts

CPU percent on the gateway is running > 75% in last 5 minutes
Unsuccessful S1 setup in last 5 minutes
S6A authorization failures in last 5 minutes
Upon exceptions when bootstrapping a gateway in last 5 minutes
When services were restarted unexpectedly in last 5 minutes
When a UE attach resulted in a failure in last 5 minutes
When there were any service restarts in last 5 minutes

Note Operator will have to go and additionally specify the receiver in each of these alerts when they want to be notified of the alerts as follows

alerts3

Custom Alert Rules

Operators can create custom alert rules by creating an expression based on metrics.

Metrics Overview

Magma gateways collect various metrics at a regular intervals and push them into Orchestrator. Orchestrator stores these metrics in a Prometheus instance. The Prometheus instance along with AlertManager provides us support in querying various metrics on the system and setting alerts based on that.

We currently support following metrics on our Access gateways.

Custom Alert Configuration

An alert configuration consists of

Name/Description of the alert
Alert Definition
Alert receiver to be notified when the alert is fired (optional)
Additional labels which can be added to provide more information about the alert. (optional)

Alert definition consists of a metric expression (a Prometheus PromQL expression) and a duration attribute which specifies the time for the expression to be true, following which the alert is fired. alerts5 alerts6

We can create a custom alert either from a simple expression or an advanced expression. In case of a simple expression, we can choose a metric from the dropdown and construct an expression based on that as shown below For e.g. simple_alert

In case of an advanced expression(PromQL cheatsheet) which might involve applying different functions on metric, we can type the advanced promQL expression directly in the textbox advanced_alert

The following examples show how we can create custom alerts on the above mentioned metrics.

eNB Down Alert

This alert will fire if eNB Rf Tx is down in any of the gateways in your network.

Expression:

sum by(gatewayID) (enodeb_rf_tx_enabled{networkID="<your network ID>"} < 1)

Duration: 10 minutes

No Connected eNB Alert

This alert will fire if the connected eNB count falls to ‘0’ for any of the gateways in your network.

Simple: Select “enb_connected” metric from the dropdown and construct the if statement as “if enb_connected < 1”

Advanced Expression:

enb_connected{networkID="<your network ID>"} < 1

Duration: 5 minutes

Free Memory is < 10 Alert

This alert will fire if the free memory of any of the gateways in your network is less than 10%

Advanced Expression:

((1 - avg_over_time(mem_available{networkID="mpk_dogfooding"}[5m]) / avg_over_time(mem_total{networkID="mpk_dogfooding"}[5m])) * 100) > 90

Duration: 15 minutes

High Disk Usage Alert

This alert will fire if the disk usage of any of the gateways in your network is more than 80% Simple Way: Select “disk_percent” metric from the dropdown and construct the if statement as “if disk_percent > 80”** Advanced Expression:

(disk_percent{networkID="<your network ID>"}) > 80

Duration: 15 minutes

Attach Success Rate Alert

This alert will fire if the attach success rate of any of the gateways in your network is less than 50% for a 3h window. Expression:

(sum by(gatewayID) (increase(ue_attach{action="attach_accept_sent",networkID="<your network ID>"}[3h]))) * 100 / (sum by(gatewayID) (increase(ue_attach{action=~"attach_accept_sent|attach_reject_sent|attach_abort",networkID="<your network ID>"}[3h]))) < 50

Duration: 15 minutes

Brief Explanation: ue_attach metric is tagged with action, networkID labels. Action labels can contain be either "attach_accept_sent", "attach_reject_sent", "attach_abort". Here we are computing the percentage of the increase in ue_attach counter for a successful attach against all ue_attach actions including rejected and aborted actions and triggering an alert if the success rate for the ue_attach action is less than 50%.

Dip in User Plane Throughput Alert

This alert will fire if for any of the gateway in your network, the User Plane throughput dips by over 70% when compared day-over-day.

Expression:

(sum by(gatewayID) (((rate(pdcp_user_plane_bytes_dl{networkID="<your network ID>"}[1h])) - (rate(pdcp_user_plane_bytes_dl{networkID="<your network ID>"}[1h] offset 1d))) / (rate(pdcp_user_plane_bytes_dl{networkID="<your network ID>"}[1h] offset 1d))) < -0.7)

Duration: 15 minutes

Dip in Connected UEs Alert: This alert will fire if for any of the gateway in your network, connected UEs dip by over 50% when compared day-over-day.

Expression:

(ue_connected{networkID="<your network ID>"} - ue_connected{networkID="<your network ID>"} offset 1d) / (ue_connected{networkID="<your network ID>"}) < -0.5

Duration: 15 minutes

REST API for alerts

api

Troubleshooting

In case we are having issues with alerts. Logs from the following services will give more information on debugging this further.

kubectl --namespace orc8r logs -l app.kubernetes.io/component=alertmanager
kubectl --namespace orc8r logs -l app.kubernetes.io/component=alertmanager-configurer
kubectl --namespace orc8r logs -l app.kubernetes.io/component=prometheus-configurer -c prometheus-configurer
kubectl --namespace orc8r logs -l app.kubernetes.io/component=prometheus -c prometheus
kubectl --namespace orc8r logs -l app.kubernetes.io/component=metricsd