Network: Debug TCP SYN Packet Drop

Posted by  Bin Du on Sunday, November 26, 2023

TCP Connection Failure

TCP connection failure is a common problem in the network world. If your service runs in one cloud service provider like Azure or AWS, the connection from your client to your service will go through multiple layers and hops. There are many reasons that can cause the connection failure. Your client may have the misconfigured firewall rule or proxy that blocks the connection. Your cloud service provider may have faulty network devices or misconfigured network settings that drop the connection. Your service may be overloaded and cannot accept the connections, just list a few. In this blog, I will share a case study of debugging a particular issue that caused TCP SYN packet drop.

TCP 3-Way Handshake (SYNC, SYNC-ACK, ACK)

Before we dive into the case, let’s review the TCP 3-way handshake. The TCP 3-way handshake is the process of establishing a TCP connection. The client sends a SYN packet to the server, the server responds with a SYN-ACK packet, and the client sends an ACK packet to the server. After that, the TCP connection is established and the client can send the data to the server. There are many good articles that explain the TCP 3-way handshake. Here is one of them: TCP 3-Way Handshake Process.

How Do We Approach Network Issues?

When we debug the network issues, we usually start from the client side. We first check the client’s firewall rule, proxy, and network settings like DNS resolution. If everything looks normal, we can try to capture the network traffic on the client side. On Linux, we can use Tcpdump. On Windows, we can use Packet Monitor. If condition allowed, we can also capture the network traffic on the server side. Wireshark is a popular tool to analyze the network trace. By comparing the network traffic between the client and server sides, we can narrow down the root cause or get some hints to further investigate.

Case Study: TCP SYN Packet Drop

Recently I was working on a case that the client was unable to connect to the service. After capturing the network traffic on both client and server sides, we could see the server received the TCP SYN packet from the client, but the server did not respond with a SYN-ACK packet. The client kept retransmitting the TCP SYN packet, but the server did not respond. The client finally gave up and closed the connection.

Server Dropped TCP SYN Packet

To confirm the packet drop, we also ran netstat -s | grep -i LISTEN on the server machine and found the SYNCs to LISTEN sockets dropped counter was increasing.

netstat -s | grep -i LISTEN
    825858 SYNs to LISTEN sockets dropped

Why Are Linux Kernel Protocol Stacks Dropping SYN Packets discusses two main scenarios in which SYN packets may be dropped. SYN packet handling in the wild is another good read to understand the mechanism in Linux kernel to implement TCP protocol. It also analyzes SYN Flood attacks.

However, we ran ss -plnt sport = :443 and confirmed that the SYN queue was healthy and not full. So there must be another subtle issue in our case and we need to dig deeper.

ss -plnt sport = :443
State          Recv-Q         Send-Q                   Local Address:Port                   Peer Address:Port         Process
LISTEN         0              511                                  *:443                               *:*             users:(("...",pid=3514528,fd=7))

After reading Linux tcp_conn_request function, we noticed if the kernel failed to resolve the route to the destination (client), it would drop.

dst = af_ops->route_req(sk, skb, &fl, req);
if (!dst)
    goto drop_and_free;

To confirm the hypothesis, we ran ip -6 route show dev eth0 and found the default gateway route was missing. It explained why the server could receive the SYN packet from the client but could not send the SYN-ACK packet back to the client.

ip -6 route show dev eth0
2001:4899::/64 proto ra metric 100 pref medium

After adding the default gateway route, the server could respond with the SYN-ACK packet and the client could establish the TCP connection successfully.

Closing: Why the default gateway route was missing and how to prevent it?

We don’t know what exactly caused the missing default gateway route in this case. But we know it is usually triggered by the network interface reset or upgrade which can happen in the dynamic cloud environment. In Azure, we constantly apply the learning from such kind of case study and invest our infrastructure to improve the system monitoring and auto-remediation. Hope this article can help you to debug similar network issues and improve your system reliability.