Perform Access Gateway health check

Performing a timely AGW health check is essential to confirm that an AGW is in good operating state, has no failures or errors and to proactively resolve any occurring issues. It is recommended to perform this operation (in addition to regular checks) before and after any changes in node (and compare) to make sure there are no undesired effects of any activity.

Affected components: AGW, enodeb, Orchestrator

Connectivity

Login to the AGW by running below the command in terminal:

ssh magma@<IP of AGW>
Checking Magma interfaces. Make sure eth0 and eth1 are UP.

ip addr

Enodeb Connection

Check S1 and SGi interfaces can ping eNodeB(s) and internet respectively.

ping google.com -I eth0

ping <enodeB IP> -I eth1

For managed eNB check status of eNodeB(s) attached to gateway using the cli(skip this step for unmanaged eNB):

sudo enodebd_cli.py get_all_status

An eNodeB in good state, looks similar to the below:

magma@magma:~$ enodebd_cli.py get_all_status
--- eNodeB Serial: 120200004917CNJ0028 ---
IP Address..................10.0.2.243
eNodeB connected....................ON
eNodeB Configured...................ON
Opstate Enabled.....................ON
RF TX on............................ON
RF TX desired.......................ON
GPS Connected.......................ON
PTP Connected......................OFF
MME Connected.......................ON
GPS Longitude..............-106.347936
GPS Latitude.................35.608135
FSM State...............Completed provisioning eNB. Awaiting new Inform.

Check eNodeB at SCTP level by taking a TCP dump. There should be a heartbeat messaging between eNB and AGW IP.

sudo tcpdump -i any sctp

A sctp association in good state looks similar as below:

magma@magma:~$ sudo tcpdump -i any sctp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
06:59:06.045369 IP 10.0.2.243.36412 > 10.0.2.242.36412: sctp (1) [HB REQ]
06:59:06.045521 IP 10.0.2.242.36412 > 10.0.2.243.36412: sctp (1) [HB ACK]
06:59:07.534188 IP 10.0.2.242.36412 > 10.0.2.243.36412: sctp (1) [HB REQ]
06:59:07.544183 IP 10.0.2.243.36412 > 10.0.2.242.36412: sctp (1) [HB ACK]

Magma Services

Check all gateway services and their status by running below commands.

sudo service magma@* status

sudo service scptd status

service openvswitch-switch status

Make sure that all services are in “active (running)” state and there are no errors in any service.
Consider the "active (running)" duration aligns with the AGW being running, this will give you an idea of unexpected restart of services. See example below, service running for 3 min and 35s.
Verify memory doesn't reach the limit assigned to the service. See example below, memory used is 113.9M out of 512M.

magma@mme.service - Magma OAI MME service
Loaded: loaded (/etc/systemd/system/magma@mme.service; disabled; vendor preset: enabled)
Active: active (running) since Tue 2021-09-07 13:18:51 UTC; 3 min and 35s ago
Process: 7732 ExecStartPre=/usr/bin/env python3 /usr/local/bin/config_stateless_agw.py reset_sctpd_for_stateful (code=exited, status=0/SUCCESS)
Process: 7617 ExecStartPre=/usr/bin/env python3 /usr/local/bin/generate_oai_config.py (code=exited, status=0/SUCCESS)
Main PID: 7854 (mme)
Tasks: 28 (limit: 4915)
Memory: 113.9M (limit: 512.0M)
CGroup: /system.slice/system-magma.slice/magma@mme.service
        └─7854 /usr/local/bin/mme -c /var/opt/magma/tmp/mme.conf -s /var/opt/magma/tmp/spgw.conf

Check the status of OVS module with sudo ovs-vsctl show. Make sure the “is_connected” states are “true” and there are any port errors.

OVS in good state looks similar to the below:

magma-dev:~$ sudo ovs-vsctl show
e2bf2cb0-7bbe-48ef-a489-3341731685e1
    Manager "ptcp:6640"
    Bridge "uplink_br0"
        Port "uplink_br0"
            Interface "uplink_br0"
                type: internal
        Port patch-agw
            Interface patch-agw
                type: patch
                options: {peer=patch-up}
        Port "dhcp0"
            Interface "dhcp0"
                type: internal
    Bridge "gtp_br0"
        Controller "tcp:127.0.0.1:6633"
            is_connected: true

For further debuging steps, you can follow the AGW Datapath debugging guide.

Orchestrator Interface

Verify connectivity with Orchestrator by running below command: checkin_cli.py.

An AGW connection with Orc8r in good state, looks similar to the below:

magma@magma:~$ checkin_cli.py

1. -- Testing TCP connection to controller.magma.test.io:443 --
2. -- Testing Certificate --
3. -- Testing SSL --
4. -- Creating direct cloud checkin --
5. -- Creating proxy cloud checkin --
Success!

Verify in syslogs AGW is perioridically checkin in to Orc8r. Around every minute you should see this message Checkin Successful! Successfully sent state. Syslogs can be found in /var/log/syslog
Verify if AGW was successfully checkin in NMS(Show as "Good" Health)

For further debuging steps, you can follow the AGW Unable to checkin to Orc8r.

Subscribers

Check subscribers attached using the below command.

sudo mobility_cli.py get_subscriber_table

If users are unable to attach to the network, you can follow the Use unable to attach guide.
Check if subscribers are not dropping packets due to Magma. Follow the AGW Datapath debugging guide.

Performance

Check CPU utilization. top. If it is high, check which process is utilizing CPU more from output of the same command. All processes are listed there.
Check memory utilization by running the same command as above. You can also verify by service using the command ps -o pid,user,%mem,command ax | sort -b -k3 -r

Metrics

Login to NMS UI. From the left side menu options, select “Metrics”. Check various metrics that are available. Look for any sudden spike or degradation that may indicate issues with the system.

Number of Connected eNBs (Grafana -> Dashboards -> Networks)
Network of Connected UE (Grafana -> Dashboards -> Networks)
Network of Registered UE (Grafana -> Dashboards -> Networks)
Attach/ Reg attempts (Grafana -> Dashboards -> Networks)
Attach Success Rate (Grafana -> Dashboards -> Networks)
S6a Authentication Success Rate (Grafana -> Dashboards -> Networks)
Service Request Success Rate (Grafana -> Dashboards -> Networks)
Session Create Success Rate (Grafana -> Dashboards -> Networks)
Upload/Download Throughput (Grafana -> Dashboards -> Gateway)

Note: Number of sites(enodeb) down, users affected, and outage duration are key indicators of service impact.

Optional Features

Make sure you test any other feature that is applicable to your network

X2 Handover
S1-Flex
Inbound Roaming
External DHCP
UE Bridge Mode