AGW datapath debugging
Following is a step by step guide for debugging datapath connectivity issues of a UE.
AGW datapath is based on OVS but there are multiple components that handle the packet in uplink and downlink direction. Any one of the components can result in connectivity issues.
Major components of datapath
- S1 (GTP) tunnel
- OVS datapath
- NAT/NonNAT forwarding plane.
You need to check if any of this component dropping packet to root cause packet drop issues. Following steps guides through the debugging process.
Datapath debugging when 100% packets are dropped
Debugging datapath issues is much easier when you have traffic running. This
is specially important in case of LTE to avoid UE getting into inactive state.
Inactive state changes state of datapath flows for a UE, so its hard to debug
issues when there are such state changes.
It is recommended to have ping
or other traffic generating utility running
on UE or the server (on SGi side of the network) while debugging the issue.
Check magma services are up and running:
service magma@* status
. For datapath health mme, sessions and pipelineD are important services to look at. Check syslog for ERRORs from services. If All looks good continue to next step.Check for OVS services:
service openvswitch-switch status
Check OVS Bridge status: gtp ports might vary depending on number of eNB connected sessions. but
ovs-vsctl show
should not show any port with any errors. If you see GTP related error run/usr/local/bin/ovs-kmod-upgrade.sh
. After running this command you need to reattach UEs.ovs-vsctl show ...-...-...-...-..... Manager "ptcp:6640" Bridge gtp_br0 Controller "tcp:127.0.0.1:6633" is_connected: true Controller "tcp:127.0.0.1:6654" is_connected: true fail_mode: secure Port mtr0 Interface mtr0 type: internal Port g_563160a Interface g_563160a type: gtpu options: {key=flow, remote_ip="w.z.y.z"} Port ipfix0 Interface ipfix0 type: internal Port patch-up Interface patch-up type: patch options: {peer=patch-agw} Port gtp0 Interface gtp0 type: gtpu options: {key=flow, remote_ip=flow} Port g_963160a Interface g_963160a type: gtpu options: {key=flow, remote_ip="a.b.c.d"} Port li_port Interface li_port type: internal Port gtp_br0 Interface gtp_br0 type: internal Port proxy_port Interface proxy_port ... Bridge uplink_br0 Port uplink_br0 Interface uplink_br0 type: internal Port dhcp0 Interface dhcp0 type: internal Port patch-agw Interface patch-agw type: patch options: {peer=patch-up} ovs_version: "2.14.3"
Check if UE is actually connected to datapath using:
mobility_cli.py get_subscriber_table
. In case the IMSI is missing in this table, you need to debug issue in control plane. UE is not attached to the AGW, you need to inspect MME logs for control plane issues. If UE is connection continue to next step.From here onwards you are going to debug OVS datapath, so you need to select a UE and identify which traffic direction is broken. You can do so by
- Generating uplink traffic in UE
- Capturing packets on gtpu_sys_2152 device:
tcpdump -eni gtpu_sys_2152 host $UE_IP
. If you do not see any packet, it means that packets are not reaching the GTP tunnel. Check S1 connectivity to debug further. - NATed datapath: Capture packets on gtp_br0
tcpdump -eni gtp_br0 host $UE_IP
. If you don't see any packets, try debugging with thedp_probe_cli.py
utility. This utility would show which OVS table is dropping the packet. - NATed datapath: You also need to check if the packet is egressing on
the SGi port. You can do so by running tcpdump on SGi port
tcpdump -eni $SGi_dev dst $SERVER_IP
. In case the packet is missing on the SGi port, you have an issue with the routing. Check the routing table and iptables rules on the AGW. - Non-NAT datapath: You also need to check if the packet is egressing on
the SGi port. You can do so by running tcpdump on the SGi port
tcpdump -eni $SGi_dev dst $SERVER_IP
. If you don't see any packets, try debugging with thedp_probe_cli.py
utility. This utility would show which OVS table is dropping the packet. - In case uplink packets are reaching SGi port, you need to debug issues in downlink direction.
Check if you are receiving packets from server by capturing return traffic packet:
tcpdump -eni $SGi_dev src $SERVER_IP
. If you do not see these packets you need to debug SGi network configuration.Check traffic stats from UE in OVS.
dp_probe_cli.py --imsi 1234 -D UL stats
In case the stats show packets reaching OVS. This should be non-zero. For downlink traffic, Check stats for DL.If all looks good so far, you need to trace packet in OVS pipeline, This command would show datapath action that OVS would apply to incoming packets. If it shows ‘drop’ it means OVS is dropping the packet, For tracing packets in UL direction:
- If there is action to forward the traffic to egress port check connectivity between SGi interface and destination host.
- For NonNat (Bridged mode) you might need vlan action for handling MultiAPN.
$ dp_probe_cli.py -i 414200000000029 -d UL -I 114.114.114.114 -P 80 -p tcp`. IMSI: 414200000000029, IP: 192.168.128.12 Running: sudo ovs-appctl ofproto/trace gtp_br0 tcp,in_port=3,tun_id=0x1,ip_dst=114.114.114.114,ip_src=192.168.128.12,tcp_src=3372,tcp_dst=80 Datapath Actions: set(eth(src=02:00:00:00:00:01,dst=5e:5b:d1:8a:1a:42)),set(skb_mark(0x5)),1 Uplink rules: allowlist_sid-IMSI414200000000029-APNNAME1
For DL traffic: check if action show tunnel set action.
$ Dp_probe_cli.py -i 414200000000029 -d DL -I 114.114.114.114 -P 80 -p tcp IMSI: 414200000000029, IP: 192.168.128.12 Running: sudo ovs-appctl ofproto/trace gtp_br0 tcp,in_port=local,ip_dst=192.168.128.12,ip_src=114.114.114.114,tcp_src=80,tcp_dst=3372 Datapath Actions: set(tunnel(tun_id=0xc400003f,dst=10.0.2.208,ttl=64,tp_dst=2152,flags(df|key))),pop_eth,set(skb_mark(0x4)),2
In case of DL traffic, if you see datapath action, check if the dst ip address in tunnel() action is the right eNB for the UE.
- Check routing table for this IP address
ip route get $dst_ip
- Check if the eNB is reachable from the AGW. there could be FW rules dropping the packets.
- Check routing table for this IP address
In case probe command shows drop you need to check which table is dropping the packet. Manually run the OVS trace command from above output shown on line starting with
Running
. For above DL examplesudo ovs-appctl ofproto/trace gtp_br0 tcp,in_port=local,ip_dst=192.168.128.12,ip_src=114.114.114.114,tcp_src=80,tcp_dst=3372
The trace command shows which table is dropping the packet. To map the numberical tble number to AGW pipeline table use pipelined-cli.
root@magma:~# pipelined_cli.py debug table_assignment App Main Table Scratch Tables ---------------------------------------------------------------------- mme 0 [] ingress 1 [] arpd 2 [] access_control 3 [21] proxy 4 [] middle 10 [] gy 11 [22, 23] enforcement 12 [24] enforcement_stats 13 [] egress 20 []
In case enforcement or gy table is dropping the packet, it means there is no rule for traffic or there is blocking rule for the traffic, that drops the packet.
- You can check rules in datapath using dp-probe command:
dp_probe_cli.py -i 414200000000029 --direction UL list_rules
- To validate rules pushed from orc8r, you can use stat-cli:
state_cli.py parse "policydb:rules"
, This command would dump all rules, you need to check which rule are applicable to the UE.
- You can check rules in datapath using dp-probe command:
Packet drops in access_control means there is static config in pipelineD which does not allow this connection.
AGW should not be dropping packet in any other table. File a bug report with the trace output in a github issue.
If this document does not help to debug the issue, please post output of all steps in new github issue.
Intermittent packets drop
Intermittent packets loss is harder to debug than previous case. In this case the services and flow tables are configured currently but still some packets are dropped. Following are usual suspects:
TC queue is dropping packets due to rate limiting, command
pipelined_cli.py debug qos
shows stats for all dropped packets. Run the test case and observe if you see any dropped packetsroot@agw:~# pipelined_cli.py debug qos /usr/local/lib/python3.5/dist-packages/scapy/config.py:411: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release of cryptography. Please upgrade your Python. import cryptography Root stats for: eth0 qdisc htb 1: root refcnt 2 r2q 10 default 0 direct_packets_stat 5487 ver 3.17 direct_qlen 1000 Sent 1082274 bytes 7036 pkt (dropped 846, overlimits 4244 requeues 0) backlog 0b 0p requeues 0 Root stats for: eth1 qdisc htb 1: root refcnt 2 r2q 10 default 0 direct_packets_stat 41140 ver 3.17 direct_qlen 1000 Sent 3603343 bytes 41337 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0
NAT could be dropping packets. This can be due to no ports available in NAT table due to large number of open connections. AGW has default setting for the max connections
sysctl net.netfilter.nf_conntrack_max
and default range of source portsysctl net.ipv4.ip_local_port_range
. If you see higher number of simultaneous connections, you need to tune these parameters.
If none of this works file detailed bug report on github.