Bug 2882 - Unbound behaviour changes (wrong) when domain-insecure is set for a stub zone with multiple stub-addr(s)
Unbound behaviour changes (wrong) when domain-insecure is set for a stub zone...
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.5.8
x86_64 Linux
: P5 minor
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-14 09:30 CET by Richard Arends
Modified: 2017-11-14 11:18 CET (History)
2 users (show)

See Also:


Attachments
unbound log on level 4 (46.17 KB, text/plain)
2017-11-14 10:47 CET, Richard Arends
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Arends 2017-11-14 09:30:46 CET
Hello,

We run Unbound version 1.5.8-1ubuntu1 (the default package on a Ubuntu 16.04 system) and have unbound in front of several zones that are hosted on a PowerDNS cluster. Every zone is configured like this:

domain-insecure: "example.com"

stub-zone:
  name: "example.com"
  stub-addr: 192.168.1.207
  stub-addr: 192.168.1.211
  stub-addr: 192.168.1.213

When all PowerDNS nodes are up, every works like expected. But when 1 of our PowerDNS servers went down, our users noticed long delays in the DNS queries. 

After debugging we noticed that when 'domain-insecure is set for a zone, Unbound always contacts every stub-addr, even when it allready received an answer. 

When domain-insecure is removed for a zone, we see the expected behaviour, only 1 node is queried, if that node is down, Unbound continues to the next node. Besides that, in this sitaution, Ubound also balances over the stub-addr(s)

This is the output from my debug session with tcpdump

OOD
$ sudo /usr/sbin/unbound-control flush_zone example.com && dig @127.0.0.1 t2890.example.com


tcpdump output shows that node .211 answers the query

08:54:45.907005 IP 192.168.1.173.37540 > 192.168.1.211.53: 38565% [1au] A? t2890.example.com. (55)
08:54:45.910153 IP 192.168.1.211.53 > 192.168.1.173.37540: 38565*- 1/0/1 A 192.168.1.64 (71)

$ sudo /usr/sbin/unbound-control flush_zone example.com && dig @127.0.0.1 t2890.example.com

tcpdump output shows that node .207 denied the connection and unbound shifts (after trying once again) to the
next node (.211 is this case).

08:54:49.814655 IP 192.168.1.173.33537 > 192.168.1.207.53: 47904% [1au] A? t2890.example.com. (55)
08:54:49.815090 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
08:54:50.191234 IP 192.168.1.173.33525 > 192.168.1.207.53: 39245% [1au] A? t2890.example.com. (55)
08:54:50.191810 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
08:54:50.567922 IP 192.168.1.173.33732 > 192.168.1.211.53: 518% [1au] A? t2890.example.com. (55)
08:54:50.568326 IP 192.168.1.211.53 > 192.168.1.173.33732: 518*- 1/0/1 A 192.168.1.64 (71)




WRONG

domain-insecure: "example.com" is set in the config

$ sudo /usr/sbin/unbound-control flush_zone example.com && dig @127.0.0.1 t2890.example.com

tcpdump output show that node .211 is queried and answers, but unbound continues to node .207 who denies
the connection. After 4 tries, unbound continues to node .213 who answers the query, thus now unbound should
have the same information 2 times. But even then it continues to query node .207 again 4 times and after that
it queries .211 again and gives me the answer i was waiting for.

09:02:23.362310 IP 192.168.1.173.33672 > 192.168.1.211.53: 34346% [1au] A? t2890.example.com. (55)
09:02:23.365660 IP 192.168.1.211.53 > 192.168.1.173.33672: 34346*- 1/0/1 A 192.168.1.64 (71)
09:02:23.365800 IP 192.168.1.173.44955 > 192.168.1.207.53: 51983% [1au] A? t2890.example.com. (55)
09:02:23.366313 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:23.742291 IP 192.168.1.173.55618 > 192.168.1.207.53: 52209% [1au] A? t2890.example.com. (55)
09:02:23.742750 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:24.118838 IP 192.168.1.173.46272 > 192.168.1.207.53: 32490% [1au] A? t2890.example.com. (55)
09:02:24.119347 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:24.871809 IP 192.168.1.173.51673 > 192.168.1.207.53: 17079% [1au] A? t2890.example.com. (55)
09:02:24.872129 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:25.624835 IP 192.168.1.173.36865 > 192.168.1.213.53: 29928% [1au] A? t2890.example.com. (55)
09:02:25.628089 IP 192.168.1.213.53 > 192.168.1.173.36865: 29928*- 1/0/1 A 192.168.1.64 (71)
09:02:25.628213 IP 192.168.1.173.39785 > 192.168.1.207.53: 972% [1au] A? t2890.example.com. (55)
09:02:25.628556 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:27.133996 IP 192.168.1.173.38741 > 192.168.1.207.53: 58941% [1au] A? t2890.example.com. (55)
09:02:27.134436 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:28.638439 IP 192.168.1.173.51034 > 192.168.1.207.53: 13991% [1au] A? t2890.example.com. (55)
09:02:28.638816 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:31.649662 IP 192.168.1.173.41072 > 192.168.1.207.53: 23120% [1au] A? t2890.example.com. (55)
09:02:31.650082 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 91
09:02:34.659535 IP 192.168.1.173.43957 > 192.168.1.207.53: 49885 A? t2890.example.com. (44)
09:02:34.659879 IP 192.168.1.207 > 192.168.1.173: ICMP 192.168.1.207 udp port 53 unreachable, length 80
09:02:35.660884 IP 192.168.1.173.60780 > 192.168.1.211.53: 61590% [1au] A? t2890.example.com. (55)
09:02:35.661470 IP 192.168.1.211.53 > 192.168.1.173.60780: 61590*- 1/0/1 A 192.168.1.64 (71)

This sequence is repeated every time for a query that is not present in the cache.

Regards,
Richard Arends.
Comment 1 Wouter Wijngaards 2017-11-14 09:44:53 CET
Hi Richard,

What is happening is that only one server actually works.  That server is down, and now no server works any more.  This causes unbound to try all of them to find a working server.

With domain-insecure you change unbound's way to query.  If you upgrade (to 1.6.x) unbound will also set EDNS options differently.  That will cause unbound to get a different answer.

Another thing that may happen is that you have a stub-zone but the powerdns servers are serving recursively (it is powerdns-recursor).  And then unbound retries to find the actual authority server.

So, likely, the powerdns servers are serving wrong content or not replying to EDNS (firewall in front of them?) or they are replying without DNSSEC information but unbound expects to have DNSSEC information (eg, there is a DS record).  And then unbound starts searching for the DNSSEC information.

What is actually happening, you can get unbound to print logs by increasing verbosity.  Send queries with 'dig' to the servers in question, do those answers look appropriate (and it seems that only the first one works).  I am pretty sure the first one only works, because you say 'normally only the first one gets queries', but unbound's default behaviour is to randomly distribute over the working servers.  So something is wrong with the other servers and that is why unbound does not send queries there.

Best regards, Wouter
Comment 2 Richard Arends 2017-11-14 10:11:02 CET
(In reply to Wouter Wijngaards from comment #1)

Hi Wouter,

Thank you for your fast reply!

> What is happening is that only one server actually works.  That server is
> down, and now no server works any more.  This causes unbound to try all of
> them to find a working server.

No that's not the case. The PowerDNS servers all look at the same MySQL database cluster, are configured exactly the same and are answering queries for zones where they are authoritative for.


> With domain-insecure you change unbound's way to query.  If you upgrade (to
> 1.6.x) unbound will also set EDNS options differently.  That will cause
> unbound to get a different answer.
> 
> Another thing that may happen is that you have a stub-zone but the powerdns
> servers are serving recursively (it is powerdns-recursor).  And then unbound
> retries to find the actual authority server.


The servers are authoritative only, that's why we use stub zones (No Recurse bit is set) instead of a forward zone where recursion is allowed.

> 
> So, likely, the powerdns servers are serving wrong content or not replying
> to EDNS (firewall in front of them?) or they are replying without DNSSEC
> information but unbound expects to have DNSSEC information (eg, there is a
> DS record).  And then unbound starts searching for the DNSSEC information.
> 
> What is actually happening, you can get unbound to print logs by increasing
> verbosity.  Send queries with 'dig' to the servers in question, do those
> answers look appropriate (and it seems that only the first one works).  I am
> pretty sure the first one only works, because you say 'normally only the
> first one gets queries', but unbound's default behaviour is to randomly
> distribute over the working servers.  So something is wrong with the other
> servers and that is why unbound does not send queries there.

When i remove domain-insecure for a zone, Unbound starts to query like i expect. It tries one server and when an answer is received, it's done. The next time it queries the other server in the stub-addr 'pool', so it's looping as it should.

Is there a command you want to see at a certain loglevel?

With regards,
Richard.
Comment 3 Wouter Wijngaards 2017-11-14 10:19:12 CET
Hi Richard,

Okay, so that part looks good.

Do you have caps-for-id set and powerdns refuses to do that correctly?

The retries and finally settling on the original source, looks like unbound is either a) searching for a better server, or b) validation failure.

b is weird because you turned off dnssec with domain-insecure.

So it must be A.  Unbound scores servers in various ways, but I didn't think that would react to domain-insecure.  But domain-insecure makes unbound (after some version) ask the questions differently.  Perhaps there is some sort of EDNS compliance issue, and with domain-insecure, unbound receives an answer with bits in the flags or EDNS contents missing that should not be happening, and then starts searching for a better server.

Although the queries and responses look identical in what you show.

It is strange how domain-insecure seems to change things.

If you set unbound's verbosity to level 4 and show me the logs (of the failing case) then it likely prints out what is going on, eg. why it is rejecting the answer it receives.

Best regards, Wouter
Comment 4 Richard Arends 2017-11-14 10:47:15 CET
Created attachment 472 [details]
unbound log on level 4
Comment 5 Richard Arends 2017-11-14 10:48:44 CET
(In reply to Wouter Wijngaards from comment #3)

Hi Wouter,

> b is weird because you turned off dnssec with domain-insecure.

Indeed...

> It is strange how domain-insecure seems to change things.

Very indeed... :)

> If you set unbound's verbosity to level 4 and show me the logs (of the
> failing case) then it likely prints out what is going on, eg. why it is
> rejecting the answer it receives.

I attached the log to this bug.

With regards,
Richard Arends.
Comment 6 Richard Arends 2017-11-14 10:50:49 CET
(In reply to Wouter Wijngaards from comment #3)

Hi Wouter,


Forgot to answer this question:

> Do you have caps-for-id set and powerdns refuses to do that correctly?

No, we do not have 'use-caps-for-id' set.

With regards,
Richard Arends.
Comment 7 Wouter Wijngaards 2017-11-14 11:03:16 CET
Hi Richard,

Thanks for the debug logs they printed out what was happening in the search logic.  Unbound was searching for DNSSEC information, this patch fixes this.  The patch is also in the code base.

Thanks for the report!
Wouter

Index: iterator/iter_utils.c
===================================================================
--- iterator/iter_utils.c	(revision 4403)
+++ iterator/iter_utils.c	(working copy)
@@ -656,6 +656,11 @@
 	/* a trust anchor exists with this name, RRSIGs expected */
 	if((a=anchor_find(env->anchors, dp->name, dp->namelabs, dp->namelen,
 		dclass))) {
+		if(a->numDS == 0 && a->numDNSKEY == 0) {
+			/* insecure trust point */
+			lock_basic_unlock(&a->lock);
+			return 0;
+		}
 		lock_basic_unlock(&a->lock);
 		return 1;
 	}
Comment 8 Richard Arends 2017-11-14 11:18:29 CET
Hi Wouter,

Thanks, we will upgrade to the latest version very soon.

With regards,
Richard.