1. Establish a TCP connection

1.1. The server creates a socket and calls the listen system call.
1.2. The client calls the connect (blocking) system call.
1.3. The client OS creates a pending connection and sends a SYN packet to the destination(server) OS. The client is expecting to receive a SYN-ACK packet in return, meaning that the server OS accepted the connection.

1.3.1. If the server OS could be reached, and for example no one is listening on that port, then the server OS will return an RST,ACK packet, and the connect() system call in the client, will return -1 ECONNREFUSED (Connection refused).
1.3.2. If the SYN packet couldn’t reach the server OS (it was dropped along the way), or maybe that IP(of the server we are connecting to) is not allocated/found in the network, then, the client OS will receive no packet (at all) in return. Thus, it will retry to send the SYN packet N times, before failing. That number of times is configurable in linux as: $ sysctl net.ipv4.tcp_syn_retries , or in FreeBSD as net.link.ether.inet.maxtries.In my case it is 6. So, that means that if there was no reply to the first SYN packet, the OS will send 6 more SYN packets with a delay(time interval) between them. In the Linux kernel, the interval starts at 1 and is doubled every time, but in BSD it might be constant, I don’t knkow for sure. So, in Linux for example, if there is no reply to the first SYN packet, 1 second later, the first (1/6) SYN retry is sent. If no reply is received to it, 2 seconds later the second (2/6) SYN retry is sent, and so on. So it takes Linux(the client) ~127 seconds to give up on the connection and declare that -1 ETIMEDOUT (Connection timed out). However on FreeBSD it takes ~75 seconds (because of net.inet.tcp.keepinit).

1.4. Server OS receives the SYN packet, sends back to the client a SYN-ACK packet and adds a new pending connection to the backlog. If the backlog is full(max size given by server at listen() time), then client might be refused.(See more about the backlog parameter of listen). What usually happens when the backlog is full, is that the SYN-ACK is sent to the user, the user replies with an ACK(which the server ignores) and the client thinks that the connection is established. However, when the client starts sending data on the new connection. The server does not know of this connection(as it was not accepted from backlog) and will end the communication by sending a RST(Reset) packet back to the client.
1.5. Client OS receives the SYN-ACK packet, marks the connection(created at step 3.) as established, and sends back an ACK packet to the server. When the connection becomes established the client process will return from the connect() method.
1.6. Server OS receives the ACK packet, marks the connection(creates at step 4.) as established.

2. Receiving data

See: YT Linux Net Tuning, RH Tuning guide, RH Packet Reception, Receive Tuning

2.1. Hardware (NIC) ring-buffer

Frames are read from the wire and stored in the network card, inside a ring-buffer (a circular data structure where the new data overwrites the old data). To see how many frames were lost in the NIC(ring-buffer) you can use ethtool -S eth0. And for every frame read from the wire into the buffer, the NIC generates a hardware interrupt (HW IRQ). To increase the ring buffer size you can use: ethtool --set-ring eth0 rx 4096 tx 4096. There is a buffer for each opperation read/write. You can also show the current buffer size and the maximum size by using: ethtool --show-ring eth0. You can also increase the device weight to drain the buffer faster: see more.

2.2. Kernel

The HW IRQ makes the kernel schedule a software interrupt (SW IRQ) that will actually pull the data from that ring-buffer and into the kernel. The SW IRQ can be seen as a process ksoftirqd/X, and it runs continuously to draw the traffic off the NIC, to avoid data loss caused by the ring-buffer. Once the data/frame is in the kernel it it goes through the protocol stack(Ethernet, IP, TCP/UDP, …) and the data payload is put in the socket buffer.

2.3. Socket buffer

The socket buffer is the buffer that holds the data until the application calls recv() to get the data. The size of this buffer (in bytes) can be set/get from the application by using get/set sockopt(). If no such option is set the system provides a default value(also in bytes): sysctl net.core.rmem_default and a maximum value: sysctl net.core.rmem_max. You can also see the number of bytes that are in this buffer, that the application did not consume by using ss or netstat. Note that the buffer size we ask for (using the OS rmem_default or setsockopt()), is not the actual size allocated by the kernel. The kernel doubles this value *, and uses the extra space for administrative purposes and internal structures. However, the getsockopt() call returns the actual size, that is twice the size you asked for, but we should stick to half of it, to the size we asked for.

In kernel terms:

  1. sk_rmem_alloc - Is the number of bytes that are in the buffer, that are not consumed/received by the application.
  2. sk_rcvbuf - The total memory (value) returned by the getsockopt() function.

See more, How TCP sockets work


3. Errors explained:

3.1. Could not read from socket. Error code is: 104. Error message is: “Connection reset by peer”. (got this message using boost asio) This happens when you connect, the server can’t store the connection in the backlog(full) and then you think you’re connected(because the server sends the SYN-ACK even if the backlog was full) and you start writing to the socket. When the server receives the data on an unknown connection, it will “close” the connection by sending a RST packet. And this is the error that you get when you try to read the response from the socket. These errors will increase the ListenOverflowstimes the listen queue of a socket overflowed” in netstat -s or /proc/net/netstat (column -t /proc/net/netstat | less -NS). See more: ListenDrop

3.2. X UDP packet receive errors - These drops happen in the kernel, either because the packet was corrupted, or because the receive socket buffer was full. Receive drops can be seen per socket in cat /proc/net/udp (last column)(based on the inode you can match the socket with the output of ss -panuem and see the exact destionation, file descriptor, pid). To distinguish between corrupted packets and the receive buffer errors you can look at “X receive buffer errors” or you can see the value in column -t /proc/net/snmp | less -NS at Udp -> RcvbufErrors (InErrors-RcvbufErrors=number of corrupted packets that were dropepd).

3.3. Broken pipe - A process receives a SIGPIPE when it attempts to write to a pipe (named or not) or socket of type SOCK_STREAM that has no reader left. This usually happens when you try to write to a closed socket/connection.

3.4. Connection timed out - A SYN packet was sent but we got no SYN-ACK in return. This could be that the SYN packet was dropped somewhere along the way, or, the SYN-ACK was dropped somewhere, while returning. To find out how far your SYN packet gets, you can use this traceroute command: traceroute -n -T -p 25 smtp.google.com.

4. See also:


4.1. FreeBSD TCP params - https://www.freebsd.org/cgi/man.cgi?query=tcp&sektion=4&manpath=FreeBSD+5.0-RELEASE
4.2. FreeBSD network tuning - https://edersoncorbari.github.io/_posts/2017-05-27-freebsd-performance-tuning/
4.3. FreeBSD TCP keepalive explained - https://www.gnugk.org/keepalive.html
4.4. Linux SYN-flood DDoS- https://sudonull.com/post/152060-SYN-flood-type-detection-and-dynamic-protection-against-DDoS - and the FreeBSD equivalent - https://web.archive.org/web/20111031164050/http://synflood-defender.net/blog/synflood-defender_freebsd_kernel_parameters
4.5. SYN and SYN-ACK in wireshark - https://www.globalknowledge.com/us-en/resources/resource-library/articles/when-is-a-tcp-syn-not-a-syn/#gref
4.6. When TCP sockets refuse to die - https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/