Bug 3736 - 0 TTL domains stuck on SERVFAIL unless manually flushed with serve-expired on
0 TTL domains stuck on SERVFAIL unless manually flushed with serve-expired on
Status: VERIFIED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.7.0
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-12 17:52 CET by Saksham Manchanda
Modified: 2018-03-13 17:04 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Saksham Manchanda 2018-03-12 17:52:46 CET
Config: serve-expired: yes

1) Query comes in for a foo.bar

2) Recursive lookup is performed. TTL is 0. Therefore, foo.bar is not cached

====== Upstream server goes down ======

3) Query comes in for foo.bar.

4) Recursive lookup returns SERVFAIL. This is cached.

====== Upstream server is back up ======

5) Query comes in for foo.bar.

6) serve_expired the SERVFAIL. Trigger a prefetch.

7) Upstream returns 0 TTL, therefore the cached SERVFAIL is not overwritten. 

Client always gets a SERVFAIL unless it is manually flushed from cache.
Comment 1 Wouter Wijngaards 2018-03-13 09:19:32 CET
Hi Saksham,

Yes, I saw your email.  The fix is much more complicated than you suggest, and I am in the middle of the 1.7.0 release.  Thanks for the report, it looks like a subtle problem with caching.

Best regards, Wouter
Comment 2 Wouter Wijngaards 2018-03-13 13:51:22 CET
Hi Saksham,

This patch probably solves the problem,
Best regards, Wouter

Index: services/cache/dns.c
===================================================================
--- services/cache/dns.c	(revision 4579)
+++ services/cache/dns.c	(working copy)
@@ -108,6 +108,48 @@
         }
 }
 
+/** delete message from message cache */
+static void
+msg_cache_remove(struct module_env* env, uint8_t* qname, size_t qnamelen, 
+	uint16_t qtype, uint16_t qclass, uint16_t flags)
+{
+	struct query_info k;
+	hashvalue_type h;
+
+	k.qname = qname;
+	k.qname_len = qnamelen;
+	k.qtype = qtype;
+	k.qclass = qclass;
+	k.local_alias = NULL;
+	h = query_info_hash(&k, flags);
+	slabhash_remove(env->msg_cache, h, &k);
+}
+
+/** remove servfail msg cache entry */
+static void
+msg_del_servfail(struct module_env* env, struct query_info* qinfo,
+	uint32_t flags)
+{
+	struct msgreply_entry* e;
+	/* see if the entry is servfail, and then remove it, so that
+	 * lookups move from the cacheresponse stage to the recursionresponse
+	 * stage */
+	e = msg_cache_lookup(env, qinfo->qname, qinfo->qname_len,
+		qinfo->qtype, qinfo->qclass, flags, *env->now, 0);
+	if(!e) return;
+	/* we don't check for the ttl here, also expired servfail entries
+	 * are removed.  If the user uses serve-expired, they would still be
+	 * used to answer from cache */
+	if(FLAGS_GET_RCODE(((struct reply_info*)e->entry.data)->flags)
+		!= LDNS_RCODE_SERVFAIL) {
+		lock_rw_unlock(&e->entry.lock);
+		return;
+	}
+	lock_rw_unlock(&e->entry.lock);
+	msg_cache_remove(env, qinfo->qname, qinfo->qname_len, qinfo->qtype,
+		qinfo->qclass, flags);
+}
+
 void 
 dns_cache_store_msg(struct module_env* env, struct query_info* qinfo,
 	hashvalue_type hash, struct reply_info* rep, time_t leeway, int pside,
@@ -132,6 +174,12 @@
 		 * which could be useful for delegation information */
 		verbose(VERB_ALGO, "TTL 0: dropped msg from cache");
 		free(rep);
+		/* if the message is SERVFAIL in cache, remove that SERVFAIL,
+		 * so that the TTL 0 response can be returned for future
+		 * responses (i.e. don't get answered by the servfail from
+		 * cache, but instead go to recursion to get this TTL0
+		 * response). */
+		msg_del_servfail(env, qinfo, flags);
 		return;
 	}
Comment 3 Wouter Wijngaards 2018-03-13 14:14:20 CET
Hi Saksham,

The issue is solved, with the change below.  Thanks for the report!
What it does is that it removes SERVFAIL responses from cache when storing a 0TTL response.  This will also cause 0TTL responses to be available when serve-expired is not in use when they are blocked by a SERVFAIL response in cache, although that normally would timeout in a couple seconds, with the fix the 0TTL response is available immediately after the 0TTL response has the cachestore routine called.

Best regards, Wouter

Index: services/cache/dns.c
===================================================================
--- services/cache/dns.c	(revision 4582)
+++ services/cache/dns.c	(working copy)
@@ -135,7 +135,7 @@
 	 * lookups move from the cacheresponse stage to the recursionresponse
 	 * stage */
 	e = msg_cache_lookup(env, qinfo->qname, qinfo->qname_len,
-		qinfo->qtype, qinfo->qclass, flags, *env->now, 0);
+		qinfo->qtype, qinfo->qclass, flags, 0, 0);
 	if(!e) return;
 	/* we don't check for the ttl here, also expired servfail entries
 	 * are removed.  If the user uses serve-expired, they would still be
Comment 4 Saksham Manchanda 2018-03-13 17:04:41 CET
Thanks for the quick turnaround! Verified fix.