Bugzilla – Bug 4100
unbound-1.7.0 regression when ttl expires
Last modified: 2018-06-04 16:47:20 CEST
Created attachment 503 [details] Log showing the servfails after ttl expire. Config with samba AD. domain-insecure: ad.foobar.fi local-zone: "ad.foobar.fi" transparent stub-zone: name: "ad.foobar.fi" stub-addr: 172.27.32.221 stub-addr: 172.27.32.222 dig derek-01.ad.foobar.fi ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40496 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 and: derek-01.ad.foobar.fi. 427 IN A 172.27.32.221 But when ttl reaches 0, next queries (for some time) give SERVFAIL. This is regression in 1.7.0 and 1.7.1 - this works with 1.6.8.
In normal config I have: prefetch: yes I tried disabling prefetch and still got same problem.
Analyzing older logs - it looks like it's about 30 seconds after ttl expire that unbound responses with SERVFAIL.
Hi Tuomo, So it cannot lookup the query at the destination. What happens when it tries to lookup the data? Can you give logs for that, eg. verbosity 4 logs for the lookup of a name. That it takes 30 seconds seems to be related to timeouts on the upstream server side. Eg. packets dropped, servers down. There are fixes, pending, that fixup the serve-expired behaviour to not do SERVFAIL, and this might be related to that? In that case, try with the latest code repository version. Best regards, Wouter
There is no problem in querying upstream server.
Setting up with verbosity: 4. Will get back in half an hour or so when I have the log.
(In reply to Tuomo Soini from comment #4) > There is no problem in querying upstream server. I mean with this: there is no issue in querying with 1.6.8. And this same issue is visible against different upstream system with different upstream dns server software. I don't see any evidence that unbound will query upstream at all before 30 seconds has gone.
Hi Tuomo, With servfail in cache the servfail itself has a TTL. This is much shorter than 30 seconds. But it is still there. With serve-expired enabled, I believe there was a bug where it would overwrite cache with a SERVFAIL. But only if there is also reason for an ordinary SERVFAIL (eg. server down). This is why I was guessing trouble at the upstream. But the logs may show more. Unbound somehow makes different queries now and they aren't answered? For example, because of agressive negative nsec or qname minimisation or something like that. Or the upstream is sending answers that are handled differently because of fixes in this version? (I mean, a fix for someone else that somehow negatively impacts this situation). The 30 second wait could actually still be server down. Unbound has timers for ratelimiting the queries to down servers, and it could be waiting for that timer. The server has to be down a lot for that. Which is weird when you say it isn't (for someone else at one point, it turned out that dig +dnssec didn't get an answer but ordinary dig worked). Best regards, Wouter
I checked before with unbound-control dump_cache there there is no entry in cache during ubnound giving SERVFAIL.
qname-minimization and aggressive-nsec are both disabled because I had problems with those before.
If I enable verbosity: 4 the problem doesn't happen. And I can't run with verbosity: 4 because it kills unbound performance so badly dns will fail to answer queries to clients because there is too much logging going on and dns answers delay too much. Verified again and there is no data in cache during problem.
I'd again like to point this is regression from 1.6.8 which works.
1.7.1 has this fix: - Fix #3736: Fix 0 TTL domains stuck on SERVFAIL unless manually flushed with serve-expired on. But otherwise I don't see obvious reasons for this in changes between 1.6.8 and 1.7.0 ... You could dump_infra and check the IPs for the upstream nameservers. The ping should be low, and if rtt is like 30 seconds, that could be a 30 second timer going off. But this assumes timeouts and you say there aren't, but it's hard to explain what happens (without logs) without assuming timeouts on the upstream server. You can enable and disable verbosity with unbound-control verbosity xxx.. So you could enable logging briefly and look if the problem happens ... Can you make the problem happen with eg. unbound-host (with a config file that directs traffic to that stub)? That could be an easy way to reproduce and have lots of logs. Unbound can log to logfile with logfile: "unbound.log" could be faster than syslog... Not sure about that no cache entry when there is nothing. I guess it means that it is trying to resolve the query. This somehow doesn't work; your entry does not get prefetched or refreshed; and there are wait times of up to 30 seconds. It really sounds like timeouts; eg. unbound is waiting on timeouts. That would be packets not returning. Best regards, Wouter
I'm running 1.7.1.
(In reply to Wouter Wijngaards from comment #12) > Not sure about that no cache entry when there is nothing. I guess it means > that it is trying to resolve the query. This somehow doesn't work; your > entry does not get prefetched or refreshed; and there are wait times of up > to 30 seconds. It really sounds like timeouts; eg. unbound is waiting on > timeouts. That would be packets not returning. There wouldn't be issue if unbound wouldn't answer at all. But it answers with SERVFAIL breaking clients. In case of no answer clients would ask next server after timeout. But SERVFAIL causes infrastructure to break.
during SERVFAIL, there is no entry for upstream server(s) in dump_infra.
Hi Tuomo, What could be happening, and this is a guess, is that your number of file descriptors is too small. And queries have to wait for a file descriptor. The wait time runs up to 30 seconds, more queries must be coming in then it has file descriptors to answer them. This is caused by having lots of threads and lots of queries-per-thread, but compile without libevent and only 1024 total file descriptors available (on some OSes even less). You can check with unbound -h ; apart from help text it prints if unbound is linked with libevent. You don't want the internal 'mini-event', because it uses select. You want eg. libevent and epoll. Then also check the config file and warnings near startup of unbound (along the lines of 'warning not enough file descriptors reducing ... ' In the config file make sure that outgoing-range: 4096 is about that big and not something very small, like '40' or reduced at startup time because of low rlimit or ulimit. If this is the problem, and I think it could because you say the server is too busy to enable logging, i.e. there is lots of traffic. The solution is then to compile with libevent (install libevent-devel and rerun configure --with-libevent) and then make sure outgoing-range is large enough. Best regards, Wouter
Another domain, another backend but same system unbound-control flush ivo-01.intra.foobar.fi. dig ivo-01.intra.foobar.fi. @127.0.0.1 ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40477 dig +dnssec ivo-01.intra.foobar.fi. @172.27.32.10 ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1183 ivo-01.intra.foobar.fi. 1200 IN A 172.27.32.10 Backend in this is bind-ldap (freeipa). I have never got problem with backends with 1.6.8 and if I downgrade to 1.6.8 everything works.
Version 1.7.1 linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips 26 Jan 2017 linked modules: dns64 python ipsecmod respip validator iterator num-threads: 8 starting unbound manually shows: Jun 01 16:50:27 unbound[7831:0] debug: increased limit(open files) from 1024 to 33118 Hmh. Let me verify how many open files systemd allows for the service.
Hi Tuomo, This change happened in the meantime: Fix #2882: Unbound behaviour changes (wrong) when domain-insecure is set for stub zone. It no longer searches for DNSSEC information. https://nlnetlabs.nl/bugs-script/show_bug.cgi?id=2882 And although this is different I don't think it should be a problem, but it is definitely different? Still would like verbose logs, because that would explain the difference between the 1.6.8 and current version you have very explicitly. Perhaps you can recognize the issue or see similar elements? I'm also searching for what is happening. Best regards, Wouter
This issue is not limited to stub-zones. After I enabled full logging on other resolvers I found out this happens for normal dns traffic too like rbldns queries. Unbound responses with SERVFAIL and dig against nsd serving the data works.
Hi Tuomo, So, I would guess that unbound is in trouble somehow. Your list of file descriptor increase looks okay. Is the num-queries-per-thread okay (at defaults would be okay)? Not too low or something like that? That would drop queries with because no space. Or you enabled rate limiting? Without logs it is hard to guess what happens, but because it is so busy I thought it was out of capacity somehow? Best regards, Wouter
No such changes to defaults. This is basically fedora/centos config with minor changes which I have already told.
Got debug logs from Tuomo and have committed a fix to the code repository. The stub became useless and in 1.7.0/1.7.1 there are fixes for stub leakage. This prevented the stub from leaking, but it did not resolve because the query was stopped. Code fix is for the stub to re-prime.
Updated fix fixed the issue, also for nonprime stubs. It falls back to the configured delegation point when it turns up useless. But still shouldn't leak stub queries to the internet. Fix is in the code repository.