Bug 4216

Summary: unbound stops sending ECS to auth after timeout when prefetch is set
Product: unbound Reporter: Manu Bretelle <chantr4>
Component: serverAssignee: unbound team <unbound-team>
Status: REOPENED ---    
Severity: enhancement CC: aluck, cathya, leoleovic.h, ralph, wouter
Priority: P5    
Version: 1.8.3   
Hardware: Other   
OS: All   

Description Manu Bretelle 2019-01-09 03:32:03 CET
I have been seeing some cache corruption bugs which I am not able to reproduce on the fly, one thing I noticed is that when the issue happen, the cache hit-rate goes to the roof. Fasten your seat belt.

using unbound 1.8.0 (synced from master on Sept 11 2018).

formula to compute global cache hit rate:
DS1 = total.num.cachehits
DS2 = total.num.cachemiss
hit rate = 100 * (DS1/(DS1 + DS2))

formula to compute ECS hit rate:
DS1 = num.query.subnet_cache
DS2 = num.query.subnet
ecs hit rate: 100 * DS1 / DS2

set up:
assume 2 authorities: example.com and sub.example.com
unbound is configured to send ECS to those authorities based on IP addresses (e.g send-client-subnet: ipv4/32 and ipv6/128)

The example.com authority is configured as a stub-zone:

stub-zone:
  name: "example.com."
  stub-addr: ipv4
  stub-addr: ipv6


ECS is used return different CNAME based on client IP. e.g


RFC1918 bar.example.com CNAME internal.sub.example.com
not RFC1918 bar.example.com CNAME external.sub.example.com

client query foo.example.com which CNAMEs to bar.example.com which depending on the client will CNAME to {internal,external}.sub.example.com
 
usually this is working well, but I have seen case where all a sudden, unbound cache hit rate will go to the roof (and sustains at ~100% for hours, well past TTL, so this does not seem to be a sporadic poisoned entry) and the same answer would seem to be returned to any clients.

Another thing that I spotted is that "dump_cache" contains 
;rrset 3588 1 0 8 2
foo.example.com.	3588	IN	CNAME	internal.sub.example.com.

This is *only* supposed to be returned when using ECS (e.g without ECS, it is always foo.example.com CNAME bar.example.com)


While the issue happened on 1.8.0, I have seen those cache hit rate symptoms (global cache, not ECS cache at the time) for a long time, at least 1 year+, but the issue may have only been exacerbated once ECS was added to the mix and answers for the "wrong" audience were returned, making the issue having an actual impact.

I unfortunately don't have any logs and enabling logs would just not be an option, and the issue only reproduce out of the blue as far as I can tell.
Comment 1 Manu Bretelle 2019-01-09 17:47:20 CET
I thing to add, but again a symptom... is that when the issue happens, querying

querying unbound when the issue happened with +subnet=198.51.100.0/24 was returning a result with a scope of /0, while the longest scope that was expected under normal operation was /24 e.g we had something like:
foo.example.com cname to bar.example.com with scope 198.51.100.0/24/0
bar.example.com  cname to external.sub.example.com with scope of 198.51.100.0/24/5
and finally external.sub.example.com has a A record with a scope of 198.51.100.0/24/24, hence the full answer that get returned under normal operation is:

;; CLIENT-SUBNET: 198.51.100.0/24/24
foo.example.com CNAME bar.example.com
bar.example.com CNAME external.sub.example.com
external.sub.example.com A 192.0.2.1

but when the issue happens, it returns something like:

;; CLIENT-SUBNET: 198.51.100.0/24/0
foo.example.com CNAME bar.example.com
bar.example.com CNAME external.sub.example.com
external.sub.example.com A 192.0.2.1
Comment 2 Wouter Wijngaards 2019-01-10 12:03:26 CET
Hi Manu,

So, in the update you say that it returns /0 results, and those end up in the global cache.  It sounds like that is the intended mode of operation? For the ECS cache, that is.  Not for what the upstream server is trying to do.

The first set of information, so it ends up cacheing foo CNAME internal.  But that is never sent, only foo CNAME bar and bar CNAME internal or external?  That is weird, unbound does not create such CNAMEs itself, so it must have received that foo CNAME internal somewhere.

Another way to get this may be that the server is unavailable, and older Unbound has had issues where it would query above the stub, eg, ignore stub because of unavailable to use further-upstream config to contact.  And that would have contacted other servers.  But those issues, are, I think, fixed in recent releases.

Could there be another third query,  third CNAME foo, and then foo CNAME .. continues resolution.  And that somehow gets different, I mean the wrong, answers?  Unbound should peel off that this CNAME foo from the front and then continue recursing in the iterator.  But that does look like a different code-path for the ECS framework, perhaps?

Best regards, Wouter
Comment 3 Manu Bretelle 2019-01-10 23:45:00 CET
Hi Wouter,

thanks for looking into this. Replying to your questions inline:


> So, in the update you say that it returns /0 results, and those end up in the global cache.  It sounds like that is the intended mode of operation? For the ECS cache, that is.  Not for what the upstream server is trying to do.

So unbound only returns a scope of /0 when the issue manifests. When this is happening, the auth is still returning the right scopes, e.g (also adding TTLs, I should have added this before)
```
foo.example.com cname to bar.example.com with scope 198.51.100.0/24/0 and TTL 3600
bar.example.com  cname to external.sub.example.com with scope of 198.51.100.0/24/5 and TTL 3600
and finally external.sub.example.com has a A record with a scope of 198.51.100.0/24/24 and TTL 60
```

I am guessing it ends up in local cache given the /0. In any case, if the auth server had sent it spuriously, I would expect that eventually as TTL expires, it would "fix" itself, but the issue happens way beyond TTL expiration.


> The first set of information, so it ends up cacheing foo CNAME internal.  But that is never sent, only foo CNAME bar and bar CNAME internal or external?  That is weird, unbound does not create such CNAMEs itself, so it must have received that foo CNAME internal somewhere.

bar CNAME internal is only sent for specific subnets when using ECS. If no ECS is set, it would only ever return `external`. Given that I could see it in `dump_cache` output, I am wondering if this means that it ended up in the non-ECS cache. In other word, is dump_cache suppose to show anything from ECS cache?

> Another way to get this may be that the server is unavailable, and older Unbound has had issues where it would query above the stub, eg, ignore stub because of unavailable to use further-upstream config to contact.  And that would have contacted other servers.  But those issues, are, I think, fixed in recent releases.

In this case stub-addr are the same that 1 would get from asking the parent, it is merely a way to bypass full recursion.
Was this fixed before 1.8.0? the issue was reproducing on 1.8.0.

> Could there be another third query,  third CNAME foo, and then foo CNAME .. continues resolution.  And that somehow gets different, I mean the wrong, answers?  Unbound should peel off that this CNAME foo from the front and then continue recursing in the iterator.  But that does look like a different code-path for the ECS framework, perhaps?

you mean:
```
third CNAME foo
foo CNAME bar
bar CNAME {external,internal}
{external,internal} A i.p.add.rr
```

Not that I know.
Comment 4 Manu Bretelle 2019-01-19 01:30:14 CET
Hi Wouter, the cache hit rate seems to be a red herring. We have been able to reproduce the issue and at this stage, I dont believe it is any cache corruption, or at least there is other issues that need to be clarified/fixed first.

Going to close this for now. Sorry for the bother.
Comment 5 Manu Bretelle 2019-01-23 18:51:19 CET
Ok, I digged more into the problem and I am able to repro the issue.

I can also reproduce on master with Ralph's latest diff which removes handling of lame EDNS server:
https://github.com/NLnetLabs/unbound/commit/f30fe713958b236fbba84be92953d242ec3677cb

So here is the thing:

unbound is configured to send EDNS to the set of servers handling zone `example.com` and its delegations. All of them, meaning that any single request from unbound to those name servers will contain ECS has we use `send-client-subnet` for each single address handling that zone.

If unbound queries the name server without ECS, it would be returned internal.sub.example.com, e.g auth will return internal.sub.example.com when ECS is not available.

What happens.... under load, the name servers may become unavailable, or say some packet get dropped.

Unbound logs will show:
[1548262396] unbound[4137483:1a] error: SERVFAIL <foo.example.com. A IN>: all the configured stub or forward servers failed, at zone example.com.

On the client performing the load test I can see that until 1548262396 responses are as expected.
Then, there is a pause of 5s and at 1548262401 responses show the issue.
The 5s seems to conveniently align with https://github.com/NLnetLabs/unbound/blob/master/services/outside_network.c#L1972


My understanding is that once unbound return the SERVFAIL, it gets into the state where unbound will stop adding EDNS? Or at least it gets into the state where "OPT" is still added, DO bit is still set, but ECS does not get added anymore.
When this happens, unbound ends up getting stuck on "non-ECS" mode, hence it will get the internal.sub.example.com. answer.
This explains why:
- internal.sub.example.com. makes it into the global cache. It should never make it there given that we should be sending ECS all the time.
- the high hit-rate (given that it seems unbound will fetch from global cache, or at least fallback to it?)


The dump_infra before the issue shows:

$./unbound-control -c edge-forwarder.conf dump_infra
::1@8863 sub.example.com. ttl 890 ping 34 var 6 rtt 58 rto 58 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
127.0.0.1@8853 example.com. ttl 890 ping 0 var 94 rtt 376 rto 376 tA 0 tAAAA 0 tother 0 ednsknown 0 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
::1@8853 example.com. ttl 890 ping 73 var 4 rtt 89 rto 89 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
127.0.0.1@8863 sub.example.com. ttl 890 ping 0 var 94 rtt 376 rto 376 tA 0 tAAAA 0 tother 0 ednsknown 0 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0

During the issue:
$ ./unbound-control -c edge-forwarder.conf dump_infra
::1@8863 sub.example.com. ttl 871 ping 338 var 223 rtt 1230 rto 1230 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
127.0.0.1@8853 example.com. ttl 871 ping 0 var 94 rtt 376 rto 376 tA 0 tAAAA 0 tother 0 ednsknown 0 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
::1@8853 example.com. ttl 871 ping 359 var 331 rtt 1683 rto 1683 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
127.0.0.1@8863 sub.example.com. ttl 871 ping 0 var 94 rtt 376 rto 376 tA 0 tAAAA 0 tother 0 ednsknown 0 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0

Note that for some reason the ednsknown seems to always be set to one for v6 servers (whether or not ECS is sent)


Hope that helps shedding some lights. I will be happy to provide more details if you need any.

I suppose the question I have in mind is whether or not unbound should just send ECS to those nameservers given that the config tell it to do so? Can it avoid the lame detection bit? Or can some options be provided to allow forcing ECS even though the unbound infra_host says it should not?
Comment 6 Wouter Wijngaards 2019-01-24 10:57:28 CET
Hi Manu,

The DNS Flag day changes may fix this problen for you then, as it removes the EDNS drop due to timeouts?  That is interesting to know.  Release upcoming on the flag day, and already in svn trunk.

Best regards, Wouter
Comment 7 Manu Bretelle 2019-01-24 19:10:12 CET
Hi Wouter,

> The DNS Flag day changes may fix this problen for you then, as it removes the EDNS drop due to timeouts?

Nope, it does not help. I can reproduce even with https://github.com/NLnetLabs/unbound/commit/f30fe713958b236fbba84be92953d242ec3677cb
Comment 8 Ralph Dolmans 2019-01-30 14:59:32 CET
Hi Manu,

The ECS option should not be removed from the query after a timeout, at least not in the most recent version (after the flagday changes). I am not able to reproduce that behavior after killing a nameserver. Are you sure the option is not available in the query? Does the query have an OPT record without ECS option?

Unbound will change the EDNS buffer size after a timeout to 1232 or 1472, any chance this changes the behavior on your auths?

While looking into this I did find a corner case in which a timeout can result in marking the delegation as ednsknown, trunk now contains a fix for this. This might have happened to you on your IPv6 delegations, but that still does not explain outgoing queries without the ECS option.

-- Ralph
Comment 9 Manu Bretelle 2019-01-31 01:53:35 CET
Thanks Ralph!

I did give a try to master, still no luck.

I have simplified the repro. So, this is what I use to perform the repro:
https://gist.github.com/chantra/d7b49d5b38b07c20d9fc7772b624e794

I dom't use subdelegation anymore, I just have
foo.example.com CNAME bar.example.com
bar.example.com CNAME extcode.example.com if ECS is not set or if ECS is set and subnet is not in RFC1918
otherwise
bar.example.com CNAME intcode.example.com


First thing that stand out is that even if I do a simple
```
 dig foo.example.com -p 1153 @::1  +subnet=10.0.0.0/24
```

The servers are marked as edns lame?

```
127.0.0.1@8853 example.com. ttl 897 ping 0 var 54 rtt 216 rto 216 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
::1@8853 example.com. ttl 897 ping 0 var 71 rtt 284 rto 284 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
```

I have uploaded a pcap to https://www.packettotal.com/app/analysis?id=ab657e5aed3af196db710bf60da850e2 (mind that the server is running on 8853) (it was too big to upload to this bug report)

One of the interesting part is, when using:
```
tshark -r /tmp/dns_example.com.pcap -V -Y 'dns.flags.response == 0 ' -T fields -e "frame.time" -e "dns.opt.data" | less
```

```
Jan 30, 2019 16:35:14.618349000 PST     00:01:20:00:21:b6:3f:23
Jan 30, 2019 16:35:14.619272000 PST     00:01:20:00:b7:5e:fe:5b
Jan 30, 2019 16:35:14.619630000 PST     00:01:20:00:b7:5e:fe:5b
Jan 30, 2019 16:35:14.620376000 PST     00:01:20:00:4f:2b:1c:04
Jan 30, 2019 16:35:14.620439000 PST     00:01:20:00:22:f1:38:3f
Jan 30, 2019 16:35:14.621797000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:14.621881000 PST     00:01:20:00:1c:fe:8c:a1
Jan 30, 2019 16:35:14.622982000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:14.623875000 PST     00:01:20:00:74:99:a8:ed
Jan 30, 2019 16:35:19.000153000 PST
Jan 30, 2019 16:35:19.000200000 PST
Jan 30, 2019 16:35:19.000307000 PST
Jan 30, 2019 16:35:19.000347000 PST
Jan 30, 2019 16:35:19.000629000 PST
Jan 30, 2019 16:35:19.000655000 PST
Jan 30, 2019 16:35:19.000780000 PST
Jan 30, 2019 16:35:19.000808000 PST
Jan 30, 2019 16:35:19.000831000 PST
Jan 30, 2019 16:35:19.000901000 PST
Jan 30, 2019 16:35:19.000955000 PST
Jan 30, 2019 16:35:19.001006000 PST
Jan 30, 2019 16:35:19.001097000 PST
Jan 30, 2019 16:35:19.001142000 PST
Jan 30, 2019 16:35:19.001194000 PST
Jan 30, 2019 16:35:19.001203000 PST
Jan 30, 2019 16:35:19.001237000 PST
Jan 30, 2019 16:35:19.001302000 PST
Jan 30, 2019 16:35:19.001350000 PST
Jan 30, 2019 16:35:19.001370000 PST
Jan 30, 2019 16:35:19.001464000 PST
Jan 30, 2019 16:35:19.001498000 PST
```

See the 5 sec pause in the trace and then all a sudden there is no ECS data anymore.

Mind that unbound is config with:
```
send-client-subnet: 127.0.0.1
send-client-subnet: ::1
```

Complete config is available at https://gist.github.com/chantra/3cd7d629b16fdd113c3b83239059e74e
Comment 10 Manu Bretelle 2019-01-31 01:59:18 CET
I just submitted the previous comment but in the meantime I was able to isolate the issue to prefetching....


If I remove comment out "prefetch: yes" in https://gist.github.com/chantra/3cd7d629b16fdd113c3b83239059e74e

Then, the issue does not reproduce!
Comment 11 Ralph Dolmans 2019-01-31 12:07:30 CET
Thanks for the detailed report Manu, much appreciated!

No ECS on prefetch queries makes sense, these queries do not have a reply_list on the mesh, since no replies are needed. The ECS module uses the reply_list to get the address to put in the options.

So, what needs to be done is making the prefetch mesh states unique (or, preferably, unique per subnet), and somehow passes the client information to the module. Let me see what I can do here.
Comment 12 Manu Bretelle 2019-01-31 17:24:30 CET
changed the title of this bug to be more accurate than cache corruption :D
Comment 13 Manu Bretelle 2019-04-08 03:37:32 CEST
Capturing some ideas that were exchanged IRL.

When the issue does not trigger, prefetching works just fine, so this would suggest that prefetching is able to perform prefetching with the right context (ECS source prefix).
As the issue triggers, hit rate becomes really high, which would suggest that the answer get cached with a scope of /0, making the answer valid for any single request that come after that.