Bug 1413

Summary: Unbound server stops responding to all requests
Product: unbound Reporter: Phil <philporada>
Component: serverAssignee: unbound team <unbound-team>
Status: ASSIGNED ---    
Severity: critical CC: cathya, wouter
Priority: P5    
Version: 1.6.4   
Hardware: x86_64   
OS: Linux   
Attachments: Compressed tarball of metrics we retrieve from Unbound

Description Phil 2017-08-21 18:11:05 CEST
Created attachment 435 [details]
Compressed tarball of metrics we retrieve from Unbound

First off, thank you very much for Unbound.

What I've noticed from our monitoring and alerting is that TCP connections reported from `./unbound-control stats` will drop to 0 over a period of time until the unbound process is restarted. With a log verbosity of 1, I do not see anything strange in the logs. As soon as unbound is restarted, it will begin processing requests.

I've been testing Unbound 1.6.4 in the lab on CentOS 7 x86_64. It was compiled with the following flags. It was also built without support for python2, python3, and munin as we do not require those features.

    --with-libevent \ 
    --with-pthreads \
    --with-ssl \
    --disable-rpath \
    --disable-static \
    --enable-subnet \
    --enable-ipsecmod \
    --with-conf-file=%{_sysconfdir}/%{name}/unbound.conf \
    --with-pidfile=%{_localstatedir}/run/%{name}/%{name}.pid \
    --enable-sha2 \
    --disable-gost \
    --enable-ecdsa \
    --with-rootkey-file=%{_sharedstatedir}/unbound/root.key \

There are two ways I know of to cause this on Unbound 1.6.4, with only one method being more than luck. The first method is to throw ~500 queries per second at the server. The server can handle this wonderfully at ~50% CPU utilization and ~30% ram utilization. The second method is randomly without anywhere near that as much traffic as method one.

Please let me know if you need any more information.
Comment 1 Wouter Wijngaards 2017-08-22 15:18:28 CEST
Hi Phil,

Is this just the stats, or does a dig +tcp fail to connect as well?  The tcp channels are stored and reused inside the code, I guess if there was a leak in there this sort of failure could be the result of that.  That failure would have the dig +tcp fail to connect (timeout).

Best regards, Wouter
Comment 2 Wouter Wijngaards 2017-08-22 15:36:50 CEST
Hi Phil,

Something I am not sure of, because I cannot see from the graphs you provided.  Is there simply no statistics in that period of time, eg. the statistics collector failed to get statistics.  Or did it grab 0s?  This is sometimes graphed differently, eg. no line drawn, vs a line drawn at 0.

That would point to an issue with the statistics collection.  And I would guess that the statistics collector cannot get a TCP connection, because of full buffers or no free port numbers.  Perhaps enable the kernel port number reuse feature on the client in Linux net.ipv4.tcp_tw_reuse (port numbers cannot be reused when in time_wait, so at 500/sec or so, you might exhaust all the ports for the kernel and thus block new TCP connections from forming).

Best regards, Wouter
Comment 3 Phil 2017-08-25 05:17:18 CEST
Queries with +tcp continue to work from the actual unbound server box when this happens. The stats monitor is not the issue here, though I understand what you mean about it. It's merely to give an idea of what I experience.

On the client, I enabled net.ipv4.tcp_tw_reuse to no avail.
Comment 4 Wouter Wijngaards 2017-08-25 09:05:14 CEST
Hi Phil,

So the problem seems to be on the client.  netstat (with '-s') or other commands can tell you what is happening perhaps (it also reports errors).  Or the client software reports errors when it cannot create a tcp query?

Another option, by the way, is that there could be a firewall between client and server, this firewall runs out of capacity (eg. due to connection tracking).

Best regards, Wouter
Comment 5 Phil 2017-08-28 22:22:20 CEST
On the client(s) I've noticed in `netstat -tupan | grep -i time_wait | wc -l` can grow to ~8000. I haven't seen any errors reported from netstat, but the client software does start to have timeouts from functions that rely on the dns query to return. Yes, the client software communicates to unbound over tcp. I have not seen any timeout related errors in the physical firewall separating the client/server. This particular firewall is tuned to accept well in excess of 8000 connections.

During the times that this happens, unbound stays running but doesn't accept any connections from outside the server. Other network connections to the box work just fine. An unbound process restart will allow unbound to accept connections for a little while longer before communications stop again.
Comment 6 Wouter Wijngaards 2017-08-29 09:17:29 CEST
Hi Phil,

To be sure, you said, queries with +tcp keep working and that unbound stops answering.  So, if you try to connect from another machine, with tcp, (not the client making whole loads of requests), does that work?

What is your config for:
incoming-num-tcp: 10

Best regards, Wouter
Comment 7 Phil 2017-08-29 17:26:53 CEST

Sorry for the confusion. +tcp queries from the unbound server to itself work, like
dig @localhost +tcp example.com

+tcp queries from other clients stop working when this problem occurs.

We have tried the following configurations
incoming-num-tcp: 2000     => still fails
incoming-num-tcp: 10000    => still fails
Comment 8 Phil 2017-08-31 21:09:29 CEST

When using Unbound 1.4.20 from the Centos7 centos-base yum repository, we did not encounter this issue when placing the unbound server under significant load. Only upon upgrading it to 1.6.4 did this start to occur. 

Thanks for the assistance thus far!
Comment 9 Wouter Wijngaards 2017-09-01 10:17:29 CEST
Hi Phil,

Centos you say.  So, kernel versions and new unbound features.  TCP fastopen.  This feature has been introduced recently.

./configure --disable-tfo-server --disable-tfo-client (together with your other options).  That should disable TCP fastopen.  But these are not enabled by default.  So perhaps it is not this feature.

Could you test unbound 1.6.6 (from the repository?)  It would be helpful if we can find which version the problems happen at.  For example find out which version the problem was introduced at.  And if you can, perhaps narrow it down to a specific commit from svn:
Or use git bisect on a git mirror of the repository.

What could possibly also give information (if it doesn't stop the issue from appearing by slowing down), is enabling verbosity level 4.  And then give the pieve just as it stops (and before that) to respond to queries.  I don't really need to see the 1000s of queries before it that worked, but the last one that succeeded and then that it fails.  Perhaps the verbose logs say something useful here.

Best regards, Wouter