Bug 4100 - unbound-1.7.0 regression when ttl expires
unbound-1.7.0 regression when ttl expires
Product: unbound
Classification: Unclassified
Component: server
x86_64 Linux
: P5 critical
Assigned To: unbound team
Depends on:
  Show dependency treegraph
Reported: 2018-06-01 11:32 CEST by Tuomo Soini
Modified: 2018-06-04 16:47 CEST (History)
2 users (show)

See Also:

Log showing the servfails after ttl expire. (1.73 KB, text/plain)
2018-06-01 11:32 CEST, Tuomo Soini

Note You need to log in before you can comment on or make changes to this bug.
Description Tuomo Soini 2018-06-01 11:32:10 CEST
Created attachment 503 [details]
Log showing the servfails after ttl expire.

Config with samba AD.

domain-insecure: ad.foobar.fi
local-zone: "ad.foobar.fi" transparent
        name: "ad.foobar.fi"

dig derek-01.ad.foobar.fi

;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40496
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

derek-01.ad.foobar.fi. 427 IN	A

But when ttl reaches 0, next queries (for some time) give SERVFAIL.

This is regression in 1.7.0 and 1.7.1 - this works with 1.6.8.
Comment 1 Tuomo Soini 2018-06-01 11:38:27 CEST
In normal config I have:

prefetch: yes

I tried disabling prefetch and still got same problem.
Comment 2 Tuomo Soini 2018-06-01 11:47:03 CEST
Analyzing older logs - it looks like it's about 30 seconds after ttl expire that unbound responses with SERVFAIL.
Comment 3 Wouter Wijngaards 2018-06-01 11:55:15 CEST
Hi Tuomo,

So it cannot lookup the query at the destination.  What happens when it tries to lookup the data?  Can you give logs for that, eg. verbosity 4 logs for the lookup of a name.

That it takes 30 seconds seems to be related to timeouts on the upstream server side.  Eg. packets dropped, servers down.  There are fixes, pending, that fixup the serve-expired behaviour to not do SERVFAIL, and this might be related to that?  In that case, try with the latest code repository version.

Best regards, Wouter
Comment 4 Tuomo Soini 2018-06-01 12:14:28 CEST
There is no problem in querying upstream server.
Comment 5 Tuomo Soini 2018-06-01 12:22:03 CEST
Setting up with verbosity: 4. Will get back in half an hour or so when I have the log.
Comment 6 Tuomo Soini 2018-06-01 12:27:16 CEST
(In reply to Tuomo Soini from comment #4)
> There is no problem in querying upstream server.

I mean with this: there is no issue in querying with 1.6.8. And this same issue is visible against different upstream system with different upstream dns server software. I don't see any evidence that unbound will query upstream at all before 30 seconds has gone.
Comment 7 Wouter Wijngaards 2018-06-01 12:37:01 CEST
Hi Tuomo,

With servfail in cache the servfail itself has a TTL.  This is much shorter than 30 seconds.  But it is still there.  With serve-expired enabled, I believe there was a bug where it would overwrite cache with a SERVFAIL.  But only if there is also reason for an ordinary SERVFAIL (eg. server down).  This is why I was guessing trouble at the upstream.

But the logs may show more.  Unbound somehow makes different queries now and they aren't answered?  For example, because of agressive negative nsec or qname minimisation or something like that.  Or the upstream is sending answers that are handled differently because of fixes in this version?  (I mean, a fix for someone else that somehow negatively impacts this situation).

The 30 second wait could actually still be server down.  Unbound has timers for ratelimiting the queries to down servers, and it could be waiting for that timer.  The server has to be down a lot for that.  Which is weird when you say it isn't (for someone else at one point, it turned out that dig +dnssec didn't get an answer but ordinary dig worked).

Best regards, Wouter
Comment 8 Tuomo Soini 2018-06-01 12:45:14 CEST
I checked before with unbound-control dump_cache there there is no entry in cache during ubnound giving SERVFAIL.
Comment 9 Tuomo Soini 2018-06-01 13:27:41 CEST
qname-minimization and aggressive-nsec are both disabled because I had problems with those before.
Comment 10 Tuomo Soini 2018-06-01 14:38:22 CEST
If I enable verbosity: 4 the problem doesn't happen. And I can't run with verbosity: 4 because it kills unbound performance so badly dns will fail to answer queries to clients because there is too much logging going on and dns answers delay too much.

Verified again and there is no data in cache during problem.
Comment 11 Tuomo Soini 2018-06-01 14:39:52 CEST
I'd again like to point this is regression from 1.6.8 which works.
Comment 12 Wouter Wijngaards 2018-06-01 15:08:22 CEST
1.7.1 has this fix:
        - Fix #3736: Fix 0 TTL domains stuck on SERVFAIL unless manually
          flushed with serve-expired on.

But otherwise I don't see obvious reasons for this in changes between 1.6.8 and 1.7.0 ...

You could dump_infra and check the IPs for the upstream nameservers.  The ping should be low, and if rtt is like 30 seconds, that could be a 30 second timer going off.  But this assumes timeouts and you say there aren't, but it's hard to explain what happens (without logs) without assuming timeouts on the upstream server.

You can enable and disable verbosity with unbound-control verbosity xxx..  So you could enable logging briefly and look if the problem happens ...

Can you make the problem happen with eg. unbound-host (with a config file that directs traffic to that stub)?  That could be an easy way to reproduce and have lots of logs.

Unbound can log to logfile with logfile: "unbound.log" could be faster than syslog...

Not sure about that no cache entry when there is nothing.  I guess it means that it is trying to resolve the query.  This somehow doesn't work; your entry does not get prefetched or refreshed; and there are wait times of up to 30 seconds.  It really sounds like timeouts; eg. unbound is waiting on timeouts.  That would be packets not returning.

Best regards, Wouter
Comment 13 Tuomo Soini 2018-06-01 15:10:38 CEST
I'm running 1.7.1.
Comment 14 Tuomo Soini 2018-06-01 15:15:21 CEST
(In reply to Wouter Wijngaards from comment #12)

> Not sure about that no cache entry when there is nothing.  I guess it means
> that it is trying to resolve the query.  This somehow doesn't work; your
> entry does not get prefetched or refreshed; and there are wait times of up
> to 30 seconds.  It really sounds like timeouts; eg. unbound is waiting on
> timeouts.  That would be packets not returning.

There wouldn't be issue if unbound wouldn't answer at all. But it answers with SERVFAIL breaking clients. In case of no answer clients would ask next server after timeout. But SERVFAIL causes infrastructure to break.
Comment 15 Tuomo Soini 2018-06-01 15:22:53 CEST
during SERVFAIL, there is no entry for upstream server(s) in dump_infra.
Comment 16 Wouter Wijngaards 2018-06-01 15:23:43 CEST
Hi Tuomo,

What could be happening, and this is a guess, is that your number of file descriptors is too small.  And queries have to wait for a file descriptor.  The wait time runs up to 30 seconds, more queries must be coming in then it has file descriptors to answer them.

This is caused by having lots of threads and lots of queries-per-thread, but compile without libevent and only 1024 total file descriptors available (on some OSes even less).  You can check with unbound -h ; apart from help text it prints if unbound is linked with libevent.  You don't want the internal 'mini-event', because it uses select.  You want eg. libevent and epoll.

Then also check the config file and warnings near startup of unbound (along the lines of 'warning not enough file descriptors reducing ... '

In the config file make sure that outgoing-range: 4096 is about that big and not something very small, like '40' or reduced at startup time because of low rlimit or ulimit.

If this is the problem, and I think it could because you say the server is too busy to enable logging, i.e. there is lots of traffic.  The solution is then to compile with libevent (install libevent-devel and rerun configure --with-libevent) and then make sure outgoing-range is large enough.

Best regards, Wouter
Comment 17 Tuomo Soini 2018-06-01 15:45:37 CEST
Another domain, another backend but same system

unbound-control flush ivo-01.intra.foobar.fi.

dig ivo-01.intra.foobar.fi. @
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40477

dig +dnssec ivo-01.intra.foobar.fi. @

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1183
ivo-01.intra.foobar.fi. 1200 IN	A

Backend in this is bind-ldap (freeipa).

I have never got problem with backends with 1.6.8 and if I downgrade to 1.6.8 everything works.
Comment 18 Tuomo Soini 2018-06-01 15:55:21 CEST
Version 1.7.1
linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips  26 Jan 2017
linked modules: dns64 python ipsecmod respip validator iterator

num-threads: 8

starting unbound manually shows:

Jun 01 16:50:27 unbound[7831:0] debug: increased limit(open files) from 1024 to 33118

Hmh. Let me verify how many open files systemd allows for the service.
Comment 19 Wouter Wijngaards 2018-06-01 15:59:54 CEST
Hi Tuomo,

This change happened in the meantime:
Fix #2882: Unbound behaviour changes (wrong) when domain-insecure is set for stub zone. It no longer searches for DNSSEC information.

And although this is different I don't think it should be a problem, but it is definitely different?  Still would like verbose logs, because that would explain the difference between the 1.6.8 and current version you have very explicitly.

Perhaps you can recognize the issue or see similar elements?  I'm also searching for what is happening.

Best regards, Wouter
Comment 20 Tuomo Soini 2018-06-01 16:20:56 CEST
This issue is not limited to stub-zones. After I enabled full logging on other resolvers I found out this happens for normal dns traffic too like rbldns queries. Unbound responses with SERVFAIL and dig against nsd serving the data works.
Comment 21 Wouter Wijngaards 2018-06-01 16:27:09 CEST
Hi Tuomo,

So, I would guess that unbound is in trouble somehow.  Your list of file descriptor increase looks okay.  Is the num-queries-per-thread okay (at defaults would be okay)?  Not too low or something like that?  That would drop queries with because no space.  Or you enabled rate limiting?

Without logs it is hard to guess what happens, but because it is so busy I thought it was out of capacity somehow?

Best regards, Wouter
Comment 22 Tuomo Soini 2018-06-01 16:33:34 CEST
No such changes to defaults. This is basically fedora/centos config with minor changes which I have already told.
Comment 23 Wouter Wijngaards 2018-06-04 14:49:53 CEST
Got debug logs from Tuomo and have committed a fix to the code repository.  The stub became useless and in 1.7.0/1.7.1 there are fixes for stub leakage.  This prevented the stub from leaking, but it did not resolve because the query was stopped.  Code fix is for the stub to re-prime.
Comment 24 Wouter Wijngaards 2018-06-04 16:47:20 CEST
Updated fix fixed the issue, also for nonprime stubs.  It falls back to the configured delegation point when it turns up useless.  But still shouldn't leak stub queries to the internet.  Fix is in the code repository.