Bugzilla – Bug 4193
SERVFAIL is cached even if a valid cache entry exists
Last modified: 2018-12-10 09:30:40 CET
Created attachment 530 [details] full debug log (verbosity:10) Our unbound server (C) is configured as a cache server to a backend recursive unbound (R), which does dns64. - set so-reuseport: no (on C) - query to R times out, and C retries the query - R may reply multiple time (for the original and for the retries) - finishing one query successfully, C replies NOERROR to the client and caches the answer - another query times out and 'some other unknown conditions' met C replies SEVFAIL to the client and caches the answer, overwriting the previous NOERROR entry even before it expires -- incoming request from dig Oct 10 22:04:44 unbound[15686:1] info: ::1 www.amazon.co.jp. AAAA IN Oct 10 22:04:44 unbound[15686:1] debug: udp request from ip6 ::1 port 40165 (len 28) -- query to the backend recursive server Oct 10 22:04:44 unbound[15686:1] info: sending query: www.amazon.co.jp. AAAA IN Oct 10 22:04:44 unbound[15686:1] debug: sending to target: <.> XX.YYY.Z.WWW#53 Oct 10 22:04:44 unbound[15686:1] debug: dnssec status: not expected Oct 10 22:04:44 unbound[15686:1] debug: EDNS lookup known=1 vs=0 Oct 10 22:04:44 unbound[15686:1] debug: serviced query UDP timeout=228 msec Oct 10 22:04:44 unbound[15686:1] debug: inserted new pending reply id=a308 Oct 10 22:04:44 unbound[15686:1] debug: opened UDP if=0 port=58003 -- times out Oct 10 22:04:44 unbound[15686:1] debug: query response was timeout Oct 10 22:04:44 unbound[15686:1] debug: iter_handle processing q with state QUERY TARGETS STATE Oct 10 22:04:44 unbound[15686:1] info: processQueryTargets: www.amazon.co.jp. AAAA IN Oct 10 22:04:44 unbound[15686:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 1 Oct 10 22:04:44 unbound[15686:1] info: DelegationPoint<.>: 0 names (0 missing), 1 addrs (1 result, 0 avail) parentNS Oct 10 22:04:44 unbound[15686:1] debug: ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 10 22:04:44 unbound[15686:1] debug: attempt to get extra 3 targets Oct 10 22:04:44 unbound[15686:1] debug: servselect ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 10 22:04:44 unbound[15686:1] debug: rtt=456 Oct 10 22:04:44 unbound[15686:1] debug: selrtt 456 ----- several retries... ----- Oct 10 22:04:58 unbound[15686:0] info: sending query: www.amazon.co.jp. AAAA IN Oct 10 22:04:58 unbound[15686:0] debug: sending to target: <.> XX.YYY.Z.WWW#53 Oct 10 22:04:58 unbound[15686:0] debug: dnssec status: not expected Oct 10 22:04:58 unbound[15686:0] debug: EDNS lookup known=1 vs=0 Oct 10 22:04:58 unbound[15686:0] debug: serviced query UDP timeout=3648 msec Oct 10 22:04:58 unbound[15686:0] debug: inserted new pending reply id=0997 -- answer comes Oct 10 22:05:02 unbound[15686:0] debug: Incoming reply id = 0997 Oct 10 22:05:02 unbound[15686:0] debug: Incoming reply addr = ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 10 22:05:02 unbound[15686:0] debug: lookup size is 1 entries Oct 10 22:05:02 unbound[15686:0] debug: received udp reply. -- resolv CNAME chain... Oct 10 22:05:02 unbound[15686:0] info: sending query: dtioykqj1u8de.cloudfront.net. AAAA IN Oct 10 22:05:02 unbound[15686:0] debug: sending to target: <.> XX.YYY.Z.WWW#53 Oct 10 22:05:02 unbound[15686:0] debug: dnssec status: not expected Oct 10 22:05:02 unbound[15686:0] debug: EDNS lookup known=1 vs=0 Oct 10 22:05:02 unbound[15686:0] debug: serviced query UDP timeout=3701 msec Oct 10 22:05:02 unbound[15686:0] debug: inserted new pending reply id=dc65 Oct 10 22:05:02 unbound[15686:0] debug: opened UDP if=0 port=38624 -- receives AAAA from the backend Oct 10 22:05:02 unbound[15686:0] debug: answer cb Oct 10 22:05:02 unbound[15686:0] debug: Incoming reply id = dc65 Oct 10 22:05:02 unbound[15686:0] debug: Incoming reply addr = ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 10 22:05:02 unbound[15686:0] debug: lookup size is 1 entries -- query finishes Oct 10 22:05:02 unbound[15686:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE Oct 10 22:05:02 unbound[15686:0] info: finishing processing for www.amazon.co.jp. AAAA IN Oct 10 22:05:02 unbound[15686:0] debug: prepending 2 rrsets Oct 10 22:05:02 unbound[15686:0] debug: mesh_run: iterator module exit state is module_finished Oct 10 22:05:02 unbound[15686:0] debug: query took 2.299876 sec Oct 10 22:05:02 unbound[15686:0] debug: query took 7.300139 sec -- some other pending queries exist...?? Oct 10 22:05:02 unbound[15686:1] debug: timeout udp Oct 10 22:05:02 unbound[15686:1] debug: svcd callbacks start Oct 10 22:05:02 unbound[15686:1] debug: worker svcd callback for qstate 0x7f9ef864d7c0 Oct 10 22:05:02 unbound[15686:1] debug: mesh_run: start Oct 10 22:05:02 unbound[15686:1] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply Oct 10 22:05:02 unbound[15686:1] info: iterator operate: query www.amazon.co.jp. AAAA IN Oct 10 22:05:02 unbound[15686:1] debug: process_response: new external response event Oct 10 22:05:02 unbound[15686:1] debug: iter_handle processing q with state QUERY RESPONSE STATE Oct 10 22:05:02 unbound[15686:1] debug: query response was timeout Oct 10 22:05:02 unbound[15686:1] debug: iter_handle processing q with state QUERY TARGETS STATE Oct 10 22:05:02 unbound[15686:1] info: processQueryTargets: www.amazon.co.jp. AAAA IN Oct 10 22:05:02 unbound[15686:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 5 Oct 10 22:05:02 unbound[15686:1] info: DelegationPoint<.>: 0 names (0 missing), 1 addrs (0 result, 0 avail) parentNS Oct 10 22:05:02 unbound[15686:1] debug: ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 10 22:05:02 unbound[15686:1] debug: No more query targets, attempting last resort Oct 10 22:05:02 unbound[15686:1] debug: configured stub or forward servers failed -- returning SERVFAIL Oct 10 22:05:02 unbound[15686:1] debug: store error response in message cache Oct 10 22:05:02 unbound[15686:1] debug: return error response SERVFAIL
dig output corresponding to the log is: $ dig @::1 www.amazon.co.jp. AAAA ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6 <<>> @::1 www.amazon.co.jp. AAAA ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3500 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.amazon.co.jp. IN AAAA ;; ANSWER SECTION: www.amazon.co.jp. 1793 IN CNAME www.cdn.amazon.co.jp. www.cdn.amazon.co.jp. 60 IN CNAME dtioykqj1u8de.cloudfront.net. dtioykqj1u8de.cloudfront.net. 60 IN AAAA 64:ff9b::d21:d0f0 ;; Query time: 2300 msec ;; SERVER: ::1#53(::1) ;; WHEN: Wed Oct 10 22:05:02 2018 ;; MSG SIZE rcvd: 126 several seconds later: $ dig @::1 www.amazon.co.jp. AAAA ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6 <<>> @::1 www.amazon.co.jp. AAAA ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 38874 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.amazon.co.jp. IN AAAA ;; Query time: 0 msec ;; SERVER: ::1#53(::1) ;; WHEN: Wed Oct 10 22:05:07 2018 ;; MSG SIZE rcvd: 34
Hi Minoura, The issue is a bug in qname minimisation, I think, that has already been fixed. And this is also triggered by the timeouts and forwarder configuration (that code-path in the source code has the flaw, but also elsewhere). These bugs have been fixed, I think, and specifically also with a CNAME in the response, I thought in 1.8.1. You list 1.8.1 in the settings, are you using 1.8.1? If so, this bug needs more attention, if not, please upgrade to 1.8.1. Best regards, Wouter
(In reply to Wouter Wijngaards from comment #2) > elsewhere). These bugs have been fixed, I think, and specifically also with > a CNAME in the response, I thought in 1.8.1. You list 1.8.1 in the > settings, are you using 1.8.1? If so, this bug needs more attention, if > not, please upgrade to 1.8.1. Thanks for reply. The attached debug log is of 1.8.0, but we confirmed with 1.8.1, with so-reuseport: no and num-threads: 2. Thanks.
We'll try with 1.8.1 and compare the behavior from the debug log.
Still reproduces. Look at the timestamp and cache TTL. $ dig @::1 www.amazon.co.jp. AAAA ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6 <<>> @::1 www.amazon.co.jp. AAAA ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38678 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.amazon.co.jp. IN AAAA ;; ANSWER SECTION: www.amazon.co.jp. 1793 IN CNAME www.cdn.amazon.co.jp. www.cdn.amazon.co.jp. 54 IN CNAME dtioykqj1u8de.cloudfront.net. dtioykqj1u8de.cloudfront.net. 60 IN AAAA 64:ff9b::d21:d0f0 ;; Query time: 1594 msec ;; SERVER: ::1#53(::1) ;; WHEN: Mon Oct 29 23:15:54 2018 ;; MSG SIZE rcvd: 126 $ !! ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.68.rc1.el6 <<>> @::1 www.amazon.co.jp. AAAA ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 21292 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.amazon.co.jp. IN AAAA ;; Query time: 0 msec ;; SERVER: ::1#53(::1) ;; WHEN: Mon Oct 29 23:15:59 2018 ;; MSG SIZE rcvd: 34 $ Attached is the debug log; important parts (I think) are: -- got the final answer from the backend server Oct 29 23:15:54 unbound[20017:0] debug: Incoming reply id = 9f7a Oct 29 23:15:54 unbound[20017:0] debug: Incoming reply addr = ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 29 23:15:54 unbound[20017:0] debug: lookup size is 1 entries Oct 29 23:15:54 unbound[20017:0] debug: received udp reply. Oct 29 23:15:54 unbound[20017:0] debug: udp message[85:0] 9F7A818000010001000000010D6474696F796B716A31753864650A636C6F756466726F6E74036E657400001C0001C00C001C00010000003C00100064FF9B00000000000000000D21D0F00000291000000080000000 Oct 29 23:15:54 unbound[20017:0] debug: outnet handle udp reply Oct 29 23:15:54 unbound[20017:0] debug: measured roundtrip at 5 msec Oct 29 23:15:54 unbound[20017:0] debug: svcd callbacks start Oct 29 23:15:54 unbound[20017:0] debug: worker svcd callback for qstate 0x1572b30 Oct 29 23:15:54 unbound[20017:0] debug: mesh_run: start Oct 29 23:15:54 unbound[20017:0] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_reply Oct 29 23:15:54 unbound[20017:0] info: iterator operate: query www.amazon.co.jp. AAAA IN Oct 29 23:15:54 unbound[20017:0] info: iterator operate: chased to dtioykqj1u8de.cloudfront.net. AAAA IN Oct 29 23:15:54 unbound[20017:0] debug: process_response: new external response event Oct 29 23:15:54 unbound[20017:0] info: scrub for . NS IN Oct 29 23:15:54 unbound[20017:0] info: response for www.amazon.co.jp. AAAA IN Oct 29 23:15:54 unbound[20017:0] info: reply from <.> XX.YYY.Z.WWW#53 Oct 29 23:15:54 unbound[20017:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0 ;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: dtioykqj1u8de.cloudfront.net. IN AAAA ;; ANSWER SECTION: dtioykqj1u8de.cloudfront.net. 60 IN AAAA 64:ff9b::d21:d0f0 ;; AUTHORITY SECTION: ;; ADDITIONAL SECTION: ;; MSG SIZE rcvd: 74 Oct 29 23:15:54 unbound[20017:0] debug: iter_handle processing q with state QUERY RESPONSE STATE Oct 29 23:15:54 unbound[20017:0] info: query response was ANSWER Oct 29 23:15:54 unbound[20017:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE Oct 29 23:15:54 unbound[20017:0] info: finishing processing for www.amazon.co.jp. AAAA IN Oct 29 23:15:54 unbound[20017:0] debug: prepending 2 rrsets Oct 29 23:15:54 unbound[20017:0] debug: mesh_run: iterator module exit state is module_finished Oct 29 23:15:54 unbound[20017:0] debug: query took 1.593732 sec Oct 29 23:15:54 unbound[20017:0] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out Oct 29 23:15:54 unbound[20017:0] info: average recursion processing time 1.593732 sec Oct 29 23:15:54 unbound[20017:0] info: histogram of recursion processing times Oct 29 23:15:54 unbound[20017:0] info: [25%]=0 median[50%]=0 [75%]=0 Oct 29 23:15:54 unbound[20017:0] info: lower(secs) upper(secs) recursions Oct 29 23:15:54 unbound[20017:0] info: 1.000000 2.000000 1 Oct 29 23:15:54 unbound[20017:0] debug: cache memory msg=66648 rrset=66864 infra=8105 val=0 Oct 29 23:15:54 unbound[20017:0] debug: svcd callbacks end Oct 29 23:15:54 unbound[20017:0] debug: close of port 11086 Oct 29 23:15:54 unbound[20017:0] debug: close fd 20 -- on another thread a timeout occurs and SERVFAIL is cached Oct 29 23:15:54 unbound[20017:1] debug: timeout udp Oct 29 23:15:54 unbound[20017:1] debug: try edns1xx0 <www.amazon.co.jp.> XX.YYY.Z.WWW#53 Oct 29 23:15:54 unbound[20017:1] debug: EDNS lookup known=1 vs=0 Oct 29 23:15:54 unbound[20017:1] debug: serviced query UDP timeout=141 msec Oct 29 23:15:54 unbound[20017:1] debug: inserted new pending reply id=8a88 Oct 29 23:15:54 unbound[20017:1] debug: opened UDP if=0 port=34898 Oct 29 23:15:54 unbound[20017:1] debug: comm point start listening 19 Oct 29 23:15:54 unbound[20017:1] debug: close of port 23868 Oct 29 23:15:54 unbound[20017:1] debug: close fd 21 Oct 29 23:15:54 unbound[20017:1] debug: timeout udp Oct 29 23:15:54 unbound[20017:1] debug: svcd callbacks start Oct 29 23:15:54 unbound[20017:1] debug: worker svcd callback for qstate 0x7fdf3864d7c0 Oct 29 23:15:54 unbound[20017:1] debug: mesh_run: start Oct 29 23:15:54 unbound[20017:1] debug: iterator[module 0] operate: extstate:module_wait_reply event:module_event_noreply Oct 29 23:15:54 unbound[20017:1] info: iterator operate: query www.amazon.co.jp. AAAA IN Oct 29 23:15:54 unbound[20017:1] debug: process_response: new external response event Oct 29 23:15:54 unbound[20017:1] debug: iter_handle processing q with state QUERY RESPONSE STATE Oct 29 23:15:54 unbound[20017:1] debug: query response was timeout Oct 29 23:15:54 unbound[20017:1] debug: iter_handle processing q with state QUERY TARGETS STATE Oct 29 23:15:54 unbound[20017:1] info: processQueryTargets: www.amazon.co.jp. AAAA IN Oct 29 23:15:54 unbound[20017:1] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 5 Oct 29 23:15:54 unbound[20017:1] info: DelegationPoint<.>: 0 names (0 missing), 1 addrs (0 result, 0 avail) parentNS Oct 29 23:15:54 unbound[20017:1] debug: ip4 XX.YYY.Z.WWW port 53 (len 16) Oct 29 23:15:54 unbound[20017:1] debug: No more query targets, attempting last resort Oct 29 23:15:54 unbound[20017:1] debug: configured stub or forward servers failed -- returning SERVFAIL Oct 29 23:15:54 unbound[20017:1] debug: store error response in message cache Oct 29 23:15:54 unbound[20017:1] debug: return error response SERVFAIL Oct 29 23:15:54 unbound[20017:1] debug: mesh_run: iterator module exit state is module_finished Oct 29 23:15:54 unbound[20017:1] debug: query took 9.077455 sec Oct 29 23:15:54 unbound[20017:1] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out Oct 29 23:15:54 unbound[20017:1] info: average recursion processing time 9.077455 sec Oct 29 23:15:54 unbound[20017:1] info: histogram of recursion processing times Oct 29 23:15:54 unbound[20017:1] info: [25%]=0 median[50%]=0 [75%]=0 Oct 29 23:15:54 unbound[20017:1] info: lower(secs) upper(secs) recursions Oct 29 23:15:54 unbound[20017:1] info: 8.000000 16.000000 1 Oct 29 23:15:54 unbound[20017:1] debug: cache memory msg=66576 rrset=66864 infra=8105 val=0 Oct 29 23:15:54 unbound[20017:1] debug: svcd callbacks end Oct 29 23:15:54 unbound[20017:1] debug: close of port 34898 Oct 29 23:15:54 unbound[20017:1] debug: close fd 19
Created attachment 532 [details] debug log from unbound-1.8.1
I do not understand what protects a valid cache entry from being overwritten by the result of a failed query.
Hi Minoura, I think this may be because you have prefetch enabled (do you?), if so, it could be the patch below. That would stop prefetch failures from wiping out valid existing data. Best regards, Wouter Index: iterator/iterator.c =================================================================== --- iterator/iterator.c (revision 4974) +++ iterator/iterator.c (working copy) @@ -292,7 +292,7 @@ if(!qstate->no_cache_store) { /* store in cache */ struct reply_info err; - if(qstate->prefetch_leeway > NORR_TTL) { + if(qstate->prefetch_leeway > 0) { verbose(VERB_ALGO, "error response for prefetch in cache"); /* attempt to adjust the cache entry prefetch */ if(dns_cache_prefetch_adjust(qstate->env, &qstate->qinfo,
Thank you for comment. We do not enable prefetch (default is no, isn't it?) on both side (C and R).
Hi Minoura, Yes the default is off for prefetch. Thanks for telling me the config setting. I think this may be the fix you need. It simply checks if already a good cache entry exists before putting servfail in the cache. Best regards, Wouter Index: iterator/iterator.c =================================================================== --- iterator/iterator.c (revision 4981) +++ iterator/iterator.c (working copy) @@ -292,7 +292,7 @@ if(!qstate->no_cache_store) { /* store in cache */ struct reply_info err; - if(qstate->prefetch_leeway > 0) { + if(qstate->prefetch_leeway > NORR_TTL) { verbose(VERB_ALGO, "error response for prefetch in cache"); /* attempt to adjust the cache entry prefetch */ if(dns_cache_prefetch_adjust(qstate->env, &qstate->qinfo, @@ -327,6 +327,29 @@ /* serving expired contents, but nothing is cached * at all, so the servfail cache entry is useful * (stops waste of time on this servfail NORR_TTL) */ + } else { + /* don't overwrite existing (non-expired) data in + * cache with a servfail */ + struct msgreply_entry* msg; + if((msg=msg_cache_lookup(qstate->env, + qstate->qinfo.qname, qstate->qinfo.qname_len, + qstate->qinfo.qtype, qstate->qinfo.qclass, + qstate->query_flags, *qstate->env->now, 0)) + != NULL) { + struct reply_info* rep = (struct reply_info*) + msg->entry.data; + if(FLAGS_GET_RCODE(rep->flags) == + LDNS_RCODE_NOERROR || + FLAGS_GET_RCODE(rep->flags) == + LDNS_RCODE_NXDOMAIN) { + /* we have a good entry, + * don't overwrite */ + lock_rw_unlock(&msg->entry.lock); + return error_response(qstate, id, rcode); + } + lock_rw_unlock(&msg->entry.lock); + } + } memset(&err, 0, sizeof(err)); err.flags = (uint16_t)(BIT_QR | BIT_RA);
Thank you for the patch! I'll try it.
The problem seems gone. Thank you!!
Hi Minoura, Thank you for testing. I'll close the ticket (but feel free to comment on it for other issues or opening another one). Best regards, Wouter