Bug 3512 - unbound incorrectly reports SERVFAIL for CAA query when there is a CNAME loop
unbound incorrectly reports SERVFAIL for CAA query when there is a CNAME loop
Status: REOPENED
Product: unbound
Classification: Unclassified
Component: server
1.6.8
x86_64 Linux
: P5 normal
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-21 13:01 CET by Arkadiusz Miskiewicz
Modified: 2018-03-22 18:36 CET (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Arkadiusz Miskiewicz 2018-02-21 13:01:39 CET
Example:

poczta.igielka.pl is CNAME to itself, so a loop.

Now querying unbound (on 127.0.0.1), type257 is CAA record.

Result is SERVFAIL while there should be no SERVFAIL for this.

$  dig poczta.igielka.pl type257 @127.0.0.1

; <<>> DiG 9.12.0 <<>> poczta.igielka.pl type257 @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27556
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;poczta.igielka.pl.             IN      CAA

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Feb 21 12:59:37 CET 2018
;; MSG SIZE  rcvd: 46

and for comparision querying bind 9.12.0 resolver, NOERROR
as result:

$  dig poczta.igielka.pl type257 @192.168.1.254

; <<>> DiG 9.12.0 <<>> poczta.igielka.pl type257 @192.168.1.254
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35941
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 2a2ee5ab9ae9786b7800c39e5a8d5f3045c78bbad2f40c52 (good)
;; QUESTION SECTION:
;poczta.igielka.pl.             IN      CAA

;; ANSWER SECTION:
poczta.igielka.pl.      641     IN      CNAME   poczta.igielka.pl.

;; Query time: 2 msec
;; SERVER: 192.168.1.254#53(192.168.1.254)
;; WHEN: Wed Feb 21 12:59:44 CET 2018
;; MSG SIZE  rcvd: 88
Comment 1 Arkadiusz Miskiewicz 2018-02-21 13:02:35 CET
BTW: this affects letsencrypt infrastructure as they use unbound as resolver.
Comment 2 Wouter Wijngaards 2018-02-21 13:30:07 CET
Hi Arkadiusz,

CNAME loops are SERVFAILs.  Nice of BIND to return the partial loop contents, but it is not resolvable.  For unbound, this is not specific for type CAA, it also answers like that for other types.  (Apart from a CNAME query, as you then query for the CNAME itself and it does not follow the indirection).

It is definitely a misconfiguration.  The set up of the domain name is broken.  So, somehow this is going to cause trouble of some sort.  Unbound makes SERVFAIL out of it, because it follows the CNAME, a number of times and then the maximum number of CNAME redirections is exceeded and it gives up.

So it really can be servfail, but that servfail is a problem somehow?  And what I was previously trying to get at is, someone is somehow using that CNAME loop?

Best regards, Wouter
Comment 3 Arkadiusz Miskiewicz 2018-02-21 13:50:59 CET
It's obvious misconfiguration but was seeing that bind handles it somehow and thought that this is unbound bug.

The trouble is that letsencrypt infrastructure (which uses unbound)
sees SERVFAIL and reports:

"DNS problem: SERVFAIL looking up CAA for poczta.igielka.pl"

which aborts letsencrypt certificate generation for poczta.igielka.pl (where poczta.igielka.pl is one of ~10 hostnames in single certificate).

If it didn't fail at this point then everything would succeed
as querying acme txt record works fine:

$ host -t txt _acme-challenge.poczta.igielka.pl 127.0.0.1
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

_acme-challenge.poczta.igielka.pl descriptive text "DSmPcJ6lOxL3iTpzm5i8gZiljaU4E_Zdb3ApnbUzfhY"

so entire letsencrypt validation would pass just fine, too.
Comment 4 Wouter Wijngaards 2018-02-21 14:15:11 CET
Hi Arkadiusz,

The acme-challenge record looks up fine with unbound.  So, what happens is that the absence of the CAA is considered proven with the NOERROR answer with the CNAME loop.  I understand what you are saying and I also think that differences between BIND and Unbound are interesting; such corner cases can be important (usually combined with other trouble).

So, I fixed it, by also returning the partial CNAME loop contents.  That should give a similar answer to what BIND is giving you.

Thank you for reporting it!

Best regards, Wouter
Comment 5 Wouter Wijngaards 2018-02-21 14:15:41 CET
Hi,

This is the patch, in case you need it.

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c  (revision 4543)
+++ iterator/iterator.c  (working copy)
@@ -1157,6 +1157,10 @@
         if(iq->query_restart_count > MAX_RESTART_COUNT) {
                 verbose(VERB_QUERY, "request has exceeded the maximum number"
                         " of query restarts with %d", iq->query_restart_count);
+                if(iq->response) {
+                        iq->state = FINISHED_STATE;
+                        return 1;
+                }
                 return error_response(qstate, id, LDNS_RCODE_SERVFAIL);
         }
 
@@ -1246,6 +1250,10 @@
                         iq->qchase.qname_len = slen;
                         /* This *is* a query restart, even if it is a cheap 
                          * one. */
+                        msg->rep->an_numrrsets = 0;
+                        msg->rep->ns_numrrsets = 0;
+                        msg->rep->ar_numrrsets = 0;
+                        msg->rep->rrset_count = 0;
                         iq->dp = NULL;
                         iq->refetch_glue = 0;
                         iq->query_restart_count++;
@@ -2739,6 +2747,10 @@
                 if (qstate->env->cfg->qname_minimisation)
                         iq->minimisation_state = INIT_MINIMISE_STATE;
                 /* Clear the query state, since this is a query restart. */
+                iq->response->rep->an_numrrsets = 0;
+                iq->response->rep->ns_numrrsets = 0;
+                iq->response->rep->ar_numrrsets = 0;
+                iq->response->rep->rrset_count = 0;
                 iq->deleg_msg = NULL;
                 iq->dp = NULL;
                 iq->dsns_point = NULL;
Comment 6 Arkadiusz Miskiewicz 2018-02-21 14:39:03 CET
Works here, thanks.

$ dig poczta.igielka.pl type257 @127.0.0.1

; <<>> DiG 9.12.0 <<>> poczta.igielka.pl type257 @127.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4806
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;poczta.igielka.pl.             IN      CAA

;; ANSWER SECTION:
poczta.igielka.pl.      3575    IN      CNAME   poczta.igielka.pl.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Feb 21 14:38:31 CET 2018
;; MSG SIZE  rcvd: 60
Comment 7 Wouter Wijngaards 2018-02-21 15:05:42 CET
Hi Arkadiusz,

The validator needs to be fixed for CNAME loops, also apply this patch.
(or use the code repository version, or next release for unbound).

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c  (revision 4546)
+++ iterator/iterator.c  (working copy)
@@ -1257,6 +1257,7 @@
                         msg->rep->ns_numrrsets = 0;
                         msg->rep->ar_numrrsets = 0;
                         msg->rep->rrset_count = 0;
+                        iq->response = msg;
                         iq->dp = NULL;
                         iq->refetch_glue = 0;
                         iq->query_restart_count++;
Index: validator/validator.c
===================================================================
--- validator/validator.c        (revision 4543)
+++ validator/validator.c        (working copy)
@@ -1529,6 +1529,22 @@
                 if(verbosity >= VERB_ALGO)
                         log_dns_msg("chased extract", &vq->qchase, 
                                 vq->chase_reply);
+                /* we skipped cnames, and now the reply is empty, is this
+                 * a CNAME loop? */
+                if(vq->rrset_skip > 0 && vq->chase_reply->rrset_count == 0) {
+                        if(reply_find_rrset_section_an(vq->orig_msg->rep,
+                                lookup_name, lookup_len, LDNS_RR_TYPE_CNAME,
+                                vq->qchase.qclass)) {
+                                if(anchor) {
+                                        lock_basic_unlock(&anchor->lock);
+                                }
+                                verbose(VERB_ALGO, "validator: encountered "
+                                        "CNAME loop - terminating");
+                                vq->chase_reply->security = vq->orig_msg->rep->security;
+                                vq->state = VAL_FINISHED_STATE;
+                                return 1;
+                        }
+                }
         }
 
         vq->key_entry = key_cache_obtain(ve->kcache, lookup_name, lookup_len,
Comment 8 Roland Shoemaker 2018-02-27 21:05:06 CET
(In reply to Wouter Wijngaards from comment #4)
> Hi Arkadiusz,
> 
> The acme-challenge record looks up fine with unbound.  So, what happens is
> that the absence of the CAA is considered proven with the NOERROR answer
> with the CNAME loop.  I understand what you are saying and I also think that
> differences between BIND and Unbound are interesting; such corner cases can
> be important (usually combined with other trouble).
> 
> So, I fixed it, by also returning the partial CNAME loop contents.  That
> should give a similar answer to what BIND is giving you.
> 
> Thank you for reporting it!
> 
> Best regards, Wouter

Hey Wouter,

I see you've already closed this issue out but I'd just like to add a counter argument. I believe that fixing this is the incorrect solution, based on my reading of the relevant DNS RFCs I would argue that the bug here is in fact in the behaviour BIND exhibits.

RFC 1034 says that "CNAME chains should be followed and CNAME loops signalled as an error" and "Alias loops and aliases which point to non-existent names should be caught and an error condition passed back to the client" and RFC 1035 says that NOERROR should signal "No error condition". Given this it follows that the resolver should not use NOERROR when it hits a CNAME loop. From my reading the previously used RCODE, SERVFAIL (defined in 1035 as "The name server was unable to process this query due to a problem with the name server"), was perfectly correct and it is BIND that was behaving incorrectly.

Beyond standards considerations I don't see the use in returning the offending CNAME, or in a more complicated case the whole chain of CNAMEs, to the client with a NOERROR RCODE. The RCODE implies to the client that everything went fine and without incorporating loop detection logic into the client itself the CNAMEs are of no use other than to create further confusion (and possibly extra queries).

I would strongly argue that this change should be backed out and the previous behaviour reinstated.

Thanks,
Roland
Comment 9 Wouter Wijngaards 2018-02-28 10:08:54 CET
Hi Roland,

I think you are correct, that is what the RFC says.  As such, I guess those changes have to get reverted.

BIND behaves differently though.  Perhaps because they have good reason to?  And then it is best for Unbound to do the same thing, creating DNS servers that do the same thing for this case.

Best regards, Wouter
Comment 10 Roland Shoemaker 2018-02-28 21:31:07 CET
(In reply to Wouter Wijngaards from comment #9)
> Hi Roland,
> 
> I think you are correct, that is what the RFC says.  As such, I guess those
> changes have to get reverted.
> 
> BIND behaves differently though.  Perhaps because they have good reason to? 
> And then it is best for Unbound to do the same thing, creating DNS servers
> that do the same thing for this case.
> 
> Best regards, Wouter

I think reverting these changes is the best path forward. I'm writing up a bug on the BIND tracker now and will pass on a link to it once I'm done so we can understand why they think this behaviour is correct.

In general I think having ecosystem homogeneity for resolvers isn't the best thing in this kind of situation. Attempting to have consistent behaviour is a good thing, but only when there is a good reason to do so. While BIND is, to some degree, the main resolver reference it is by no means a golden implementation, which is why we have things like RFC 1912.

Vriendelijke groeten, Roland.
Comment 11 Wouter Wijngaards 2018-03-06 09:33:47 CET
Hi,

Because we have different views expressed here, and also other implementations have different outputs, we are not going to release this fix in the upcoming release.  We may still fix this, to have similar output as other dns servers (but which one?).

So, it is not that we don't have code, but we don't have a clear answer what to do.  Thus the RFC quoted seems to be the most direct answer.

Best regards, Wouter
Comment 12 Roland Shoemaker 2018-03-20 16:49:46 CET
(In reply to Wouter Wijngaards from comment #11)
> Hi,
> 
> Because we have different views expressed here, and also other
> implementations have different outputs, we are not going to release this fix
> in the upcoming release.  We may still fix this, to have similar output as
> other dns servers (but which one?).
> 
> So, it is not that we don't have code, but we don't have a clear answer what
> to do.  Thus the RFC quoted seems to be the most direct answer.
> 
> Best regards, Wouter

Sorry it took me so long to get back to this. I've narrowed down the issue in BIND, it actually only shows this behavior in single record loops (i.e. loop.com CNAME loop.com) not with multi record loops.

I've opened an issue on BIND to see if there is a reason they have to do this or if it is indeed a bug: https://gitlab.isc.org/isc-projects/bind9/issues/174
Comment 13 Roland Shoemaker 2018-03-22 18:36:06 CET
(In reply to Roland Shoemaker from comment #12)
> (In reply to Wouter Wijngaards from comment #11)
> > Hi,
> > 
> > Because we have different views expressed here, and also other
> > implementations have different outputs, we are not going to release this fix
> > in the upcoming release.  We may still fix this, to have similar output as
> > other dns servers (but which one?).
> > 
> > So, it is not that we don't have code, but we don't have a clear answer what
> > to do.  Thus the RFC quoted seems to be the most direct answer.
> > 
> > Best regards, Wouter
> 
> Sorry it took me so long to get back to this. I've narrowed down the issue
> in BIND, it actually only shows this behavior in single record loops (i.e.
> loop.com CNAME loop.com) not with multi record loops.
> 
> I've opened an issue on BIND to see if there is a reason they have to do
> this or if it is indeed a bug:
> https://gitlab.isc.org/isc-projects/bind9/issues/174

This has been marked a bug in BIND, so assuming this change has been backed out of Unbound then nothing further needs to be done.