Bug 4125 - While reload unbound works without having loaded the stubs.list
While reload unbound works without having loaded the stubs.list
Product: unbound
Classification: Unclassified
Component: server
x86_64 Linux
: P5 critical
Assigned To: unbound team
Depends on:
  Show dependency treegraph
Reported: 2018-07-10 15:21 CEST by joanna
Modified: 2018-10-25 10:51 CEST (History)
2 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description joanna 2018-07-10 15:21:15 CEST
We have a split-horizon DNS.
All internal zones are configured in the stubs.list of unbound, so unbound does ask internal and not "the internet".
Frequently we have to update the stubs.list because of a new internal zone.
So we copy a new stubs.list at the right place.
After that we reload unbound:
unbound-control reload

Logfile says:
Fri Jul  6 17:34:01 CEST 2018

It seems that unbound deletes the old data and the caches but hasn't read or processed the stubs.list at this point.
Instead of asking the internal auth DNS unbound is asking its default DNS Server (root server) in the internet.

New Stubs.list 
-rw-r--r-- 1 user group 232551 Jul  6 17:34 stubs.list
111.11.10.in-addr.arpa was added

Beginn of problem ;; WHEN: Fri Jul 06 17:34:18 CEST 2018
End of problem    ;; WHEN: Fri Jul 06 18:34:14 CEST 2018

The first question to the "internet" (ns2.example.com).

; <<>> DiG 9.9.9-P1 <<>> @ dealsdc1.example.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 23225
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

; EDNS: version: 0, flags:; udp: 4096
;dealsdc1.example.com.  IN      A

example.com.    3597    IN      SOA     ns2.example.com. hostmaster. 2017111600 10800 3600 2419200 3600

;; Query time: 0 msec
;; WHEN: Fri Jul 06 17:34:18 CEST 2018
;; MSG SIZE  rcvd: 140

Ten seconds later unbound knows the internal DNS Server but has still cached the NXDOMAIN.

; <<>> DiG 9.9.9-P1 <<>> @ dealsdc1.example.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 20224
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

; EDNS: version: 0, flags:; udp: 4096
;dealsdc1.example.com.  IN      A

example.com.    3599    IN      SOA     dealsdc1.example.com. hostmaster. 4615182 900 600 86400 3600

;; Query time: 0 msec
;; WHEN: Fri Jul 06 17:34:28 CEST 2018
;; MSG SIZE  rcvd: 103

Possibly here is the problem.

in 705/6 its killing the worker
but before its already killing local_zones and views.

686         /* before stopping main worker, handle signals ourselves, so we
687            don't die on multiple reload signals for example. /
688         signal_handling_record();
689         log_thread_set(NULL);
690         / clean up caches because
691          * a) RRset IDs will be recycled after a reload, causing collisions
692          * b) validation config can change, thus rrset, msg, keycache clear /
693         slabhash_clear(&daemon->env->rrset_cache->table);
694         slabhash_clear(daemon->env->msg_cache);
695         local_zones_delete(daemon->local_zones);
696         daemon->local_zones = NULL;
697         respip_set_delete(daemon->respip_set);
698         daemon->respip_set = NULL;
699         views_delete(daemon->views);
700         daemon->views = NULL;
701         if(daemon->env->auth_zones)
702                 auth_zones_cleanup(daemon->env->auth_zones);
703         / key cache is cleared by module desetup during next daemon_fork() */
704         daemon_remote_clear(daemon->rc);
705         for(i=0; i<daemon->num; i++)
706                 worker_delete(daemon->workers[i]);
707         free(daemon->workers);
708         daemon->workers = NULL;
709         daemon->num = 0;
710         alloc_clear_special(&daemon->superalloc);
Comment 1 Wouter Wijngaards 2018-07-10 15:27:55 CEST
Hi Joanna,

Before daemon_cleanup() is called there, the other workers are already stopped.  So this is the only thread running.

When unbound rereads the config file, it reads the stubs straight away.  The stub statements from config file take effect straight away on start.  At least I assume this stubs.list is a file included in unbound.conf for processing.  If the contents of that file is applied some other way (eg. unbound-control somehow?), then perhaps flush the cache for that zone after applying the stub for the zone.  Or create a config include file.

Best regards, Wouter
Comment 2 joanna 2018-07-10 15:53:22 CEST
Hi Wouter,

so my guess was wrong, but the problem exists.
There seems to be a moment unbound is working without knowing the whole configuration.

stubs.list is included in unbound.conf
include: "/DNS/unbound/stubs.list"

Best wishes
Comment 3 Wouter Wijngaards 2018-07-10 15:58:17 CEST
Hi Joanna,

That code is in the daemon_fork() routine.  It has the calls to daemon_start_others() and daemon_stop_others(),  those start the other threads and stop the other threads.  There does not seem to be a moment in between where the stub config does not apply.

However, since 1.6.8 there have been fixes where unreachability caused queries to the internet instead of internal queries with the stub.  These fixes are available in 1.7.3, perhaps an upgrade can solve the issue?  The fixes are several, with notes similar to 'leak from stub to internet fixed' and so on.

Best regards, Wouter
Comment 4 joanna 2018-07-24 12:50:09 CEST
Hi Wouter,

finally it seems I found the problem.

First: There is no problem with the stubs.list.
The problem is the delegation of a subdomain and appears after every reload.

The windows domain controller dealsdc1.example.com delegates some subdomains (e.g. proxy.example.com) to itself.
My auth DNS server only get example.com per zone transfer not the subdomains.

If some client asks for server.proxy.example.com unbound tries to reach the NS dealsdc1.example.com which is authoritative for proxy.example.com.
After several retries with no answer unbound asks the root server for the A RR of dealsdc1.example.com.
And after that unbound finally reaches my external auth DNS server which answer with NXDOMAIN.

unbound	-> DNS 85 Standard query 0x2df5  A dealsdc1.example.com

After the NXDOMAIN you can see with a
$ unbound-control dump_infra 
the unbound now knows the internal auths and my external auths for example.com.
So now unbound asks any of them and gets now and then a NXDOMAIN depending which auth unbound asks.

It would be nice if you could tell me where this behavior is described and
if there is some change in the behavior with a new version of unbound.
Comment 5 Wouter Wijngaards 2018-10-25 10:38:55 CEST
Hi Joanna,

The problem you describe fits exactly with the fixes described for 1.6.8.  The bug entry details are 1.6.8, is that the case?  If the bug really still happens, there is more instances of the bug that need a fix.  Could you try with a newer version?  (eg. 1.8.1), that would be nice but I do not actually think there is a code change that should impact the outcome.  If that still happens with fixed code, I'd like to figure out where the failure is happening; you did not set stub-first to yes or something?  Because that would set unbound to do this on purpose (eg not a bug, but configured to do it), if so set it to no.

If you use DNSSEC, you also have to give a domain-insecure for the added domains for which you add stubs, otherwise the validator starts looking for the chain of trust, eg. by sending queries for the internal domain to the internet to connect a chain of trust to it.

Best regards, Wouter
Comment 6 Wouter Wijngaards 2018-10-25 10:51:42 CEST
Hi Joanna,

Looking through the logs, I see 1.6.9 has the fix added, not 1.6.8, and that could specifically be the problem you describe with queries sent to the internet root server instead of local servers.

Best regards, Wouter