Bug 4188 - IPv6 forwarders without ipv6 result in SERVFAIL
IPv6 forwarders without ipv6 result in SERVFAIL
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.7.3
x86_64 Linux
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-09-27 03:47 CEST by Ian Wienand
Modified: 2018-09-27 12:59 CEST (History)
2 users (show)

See Also:


Attachments
Debugging logging around failed query (93.99 KB, application/gzip)
2018-09-27 03:55 CEST, Ian Wienand
Details
query sample with qname patch applied (847.36 KB, text/x-log)
2018-09-27 11:56 CEST, Ian Wienand
Details
qname.patch (926 bytes, patch)
2018-09-27 11:58 CEST, Ian Wienand
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Wienand 2018-09-27 03:47:40 CEST
Hello,

Our config management sets up our forwaders as

---
forward-zone:
  name: "."
  forward-addr: 2620:0:ccc::2
  forward-addr: 2001:4860:4860::8888
  forward-addr: 208.67.222.222
  forward-addr: 8.8.8.8
---

After updating some of our CI hosts to Fedora 28, with unbound 1.7.3, we're noticing that we start out with very intermittent DNS.

e.g. for the ping below I have attached everything relating to what i can see for this hostname from the debug logs.

---
[root@ianw-test unbound]# ping www.abc.net.au
ping: www.abc.net.au: Name or service not known
---

Ultimately we end up with

---
Sep 27 01:14:58 unbound[1028:2] info: 127.0.0.1 www.abc.net.au. A IN
Sep 27 01:14:58 unbound[1028:2] info: 127.0.0.1 www.abc.net.au. A IN SERVFAIL 0.000000 1 32
Sep 27 01:14:58 unbound[1028:2] info: 127.0.0.1 www.abc.net.au. AAAA IN
Sep 27 01:14:58 unbound[1028:2] info: 127.0.0.1 www.abc.net.au. AAAA IN SERVFAIL 0.000000 1 32
---

I can see it try to send the request over ipv6

---
Sep 27 00:21:56 unbound[1028:0] info: auth_zone . query www.abc.net.au. A, domain NULL notexact notexist, ce NULL, rrset NULL
Sep 27 00:21:56 unbound[1028:0] debug: attempt to get extra 3 targets
Sep 27 00:21:56 unbound[1028:0] debug: servselect ip6 2620:0:ccc::2 port 53 (len 28)
Sep 27 00:21:56 unbound[1028:0] debug:    rtt=376
Sep 27 00:21:56 unbound[1028:0] debug: servselect ip6 2001:4860:4860::8888 port 53 (len 28)
Sep 27 00:21:56 unbound[1028:0] debug:    rtt=376
Sep 27 00:21:56 unbound[1028:0] debug: servselect ip4 208.67.222.222 port 53 (len 16)
Sep 27 00:21:56 unbound[1028:0] debug:    rtt=38
Sep 27 00:21:56 unbound[1028:0] debug: servselect ip4 8.8.8.8 port 53 (len 16)
Sep 27 00:21:56 unbound[1028:0] debug:    rtt=41
Sep 27 00:21:56 unbound[1028:0] debug: selrtt 38
Sep 27 00:21:56 unbound[1028:0] info: sending query: www.abc.net.au. A IN
Sep 27 00:21:56 unbound[1028:0] debug: sending to target: <.> 2001:4860:4860::8888#53
Sep 27 00:21:56 unbound[1028:0] debug: dnssec status: not expected
Sep 27 00:21:56 unbound[1028:0] debug: EDNS lookup known=0 vs=0
Sep 27 00:21:56 unbound[1028:0] debug: serviced query UDP timeout=376 msec
Sep 27 00:21:56 unbound[1028:0] debug: inserted new pending reply id=8054
Sep 27 00:21:56 unbound[1028:0] debug: opened UDP if=0 port=57760
Sep 27 00:21:56 unbound[1028:0] debug: comm point start listening 42
Sep 27 00:21:56 unbound[1028:0] notice: sendto failed: Network is unreachable
Sep 27 00:21:56 unbound[1028:0] notice: remote address is ip6 2001:4860:4860::8888 port 53 (len 28)
Sep 27 00:21:56 unbound[1028:0] debug: close of port 57760
Sep 27 00:21:56 unbound[1028:0] debug: close fd 42
Sep 27 00:21:56 unbound[1028:0] info: error sending query to auth server ip6 2001:4860:4860::8888 port 53 (len 28)
---

but then we see it passed on the the ipv4 forwarder

---
Sep 27 00:21:56 unbound[1028:0] info: sending query: www.abc.net.au. A IN
Sep 27 00:21:56 unbound[1028:0] debug: sending to target: <.> 208.67.222.222#53
Sep 27 00:21:56 unbound[1028:0] debug: dnssec status: not expected
Sep 27 00:21:56 unbound[1028:0] debug: EDNS lookup known=1 vs=0
Sep 27 00:21:56 unbound[1028:0] debug: serviced query UDP timeout=50 msec
Sep 27 00:21:56 unbound[1028:0] debug: inserted new pending reply id=c65d
Sep 27 00:21:56 unbound[1028:0] debug: opened UDP if=0 port=60714
Sep 27 00:21:56 unbound[1028:0] debug: comm point start listening 42
Sep 27 00:21:56 unbound[1028:0] debug: mesh_run: iterator module exit state is module_wait_reply
---

and then I start getting lost :)

Now it is our fault that we're putting ipv6 forwarders in the list when our host doesn't have ipv6 connectivity (some providers we use do, some don't, which is how we end up with things like this).  However this is the first we're seeing of this problem across our Centos, Ubuntu and a few other suse and other hosts.  It does seem from the stats that unbound knows the ipv6 interfaces are not good choices to use, so it's unclear why they're causing failures when it can fall back to ipv4.

Thanks for any insights
Comment 1 Ian Wienand 2018-09-27 03:55:14 CEST
Created attachment 526 [details]
Debugging logging around failed query

This the full trace of everything I can see around the failed "ping www.abc.net.au" mentioned in the bug
Comment 2 Ian Wienand 2018-09-27 04:22:52 CEST
OK, after some fiddling I went back to a Fedora 27 image with 1.7.2, and also shoe-horned in some 1.6.6 rpms and it happens there too.  So just ruling out that it is some exclusive issue.  We may have just not been noticing this as much.
Comment 3 Wouter Wijngaards 2018-09-27 10:18:14 CEST
Hi Ian,

I believe it is caused by the qname minimisation making resends.  The IPv6 addresses cause failures during that, and the counter for max resends is hit.  The resends from qname minimisation should not happen because you are forwarding.  But the logic resets qname minimisation in some causes, forgetting it was not doing qname minimisation because of the forwarding, and then ends up resending queries.  The patch here should fix this, but I cannot reproduce easily, because IPv6 works for me.

Does this patch fix the problem?  It should not resend the query because the IPv6 failed and that re-enabled qname minimisation, making it reject the IPv4 response.

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c	(revision 4915)
+++ iterator/iterator.c	(working copy)
@@ -2174,7 +2174,8 @@
 		return 0;
 	}
 
-	if(iq->minimisation_state == INIT_MINIMISE_STATE) {
+	if(iq->minimisation_state == INIT_MINIMISE_STATE
+		&& !(iq->chase_flags & BIT_RD)) {
 		/* (Re)set qinfo_out to (new) delegation point, except when
 		 * qinfo_out is already a subdomain of dp. This happens when
 		 * increasing by more than one label at once (QNAMEs with more
@@ -2715,7 +2716,8 @@
 			sock_list_insert(&qstate->reply_origin, 
 				&qstate->reply->addr, qstate->reply->addrlen, 
 				qstate->region);
-		if(iq->minimisation_state != DONOT_MINIMISE_STATE) {
+		if(iq->minimisation_state != DONOT_MINIMISE_STATE
+			&& !(iq->chase_flags & BIT_RD)) {
 			if(FLAGS_GET_RCODE(iq->response->rep->flags) != 
 				LDNS_RCODE_NOERROR) {
 				if(qstate->env->cfg->qname_minimisation_strict)
Comment 4 Ian Wienand 2018-09-27 11:55:19 CEST
> Does this patch fix the problem?

I think it does!  I applied this to the Fedora packages (attached) and previously, when I booted I could very reliably get the error status querying a host.  With the patch applied I do not seem to get any errors.

I've attached the log from a boot with the patch applied to see if it matches up with what you think it should
Comment 5 Ian Wienand 2018-09-27 11:56:14 CEST
Created attachment 527 [details]
query sample with qname patch applied
Comment 6 Ian Wienand 2018-09-27 11:58:08 CEST
Created attachment 528 [details]
qname.patch

The patch I applied to the Fedora 1.7.3-6 unbound package
Comment 7 Wouter Wijngaards 2018-09-27 12:59:31 CEST
Hi Ian,

Yes that log shows that the bug has been fixed by the patch!  Thanks for the bugreport and the details.

The patch is also in the code repository, for future unbound releases.

Best regards, Wouter