Bug 4216 - cache corruption issue?
cache corruption issue?
Status: RESOLVED INVALID
Product: unbound
Classification: Unclassified
Component: server
1.8.3
Other All
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-01-09 03:32 CET by Manu Bretelle
Modified: 2019-01-19 01:30 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Manu Bretelle 2019-01-09 03:32:03 CET
I have been seeing some cache corruption bugs which I am not able to reproduce on the fly, one thing I noticed is that when the issue happen, the cache hit-rate goes to the roof. Fasten your seat belt.

using unbound 1.8.0 (synced from master on Sept 11 2018).

formula to compute global cache hit rate:
DS1 = total.num.cachehits
DS2 = total.num.cachemiss
hit rate = 100 * (DS1/(DS1 + DS2))

formula to compute ECS hit rate:
DS1 = num.query.subnet_cache
DS2 = num.query.subnet
ecs hit rate: 100 * DS1 / DS2

set up:
assume 2 authorities: example.com and sub.example.com
unbound is configured to send ECS to those authorities based on IP addresses (e.g send-client-subnet: ipv4/32 and ipv6/128)

The example.com authority is configured as a stub-zone:

stub-zone:
  name: "example.com."
  stub-addr: ipv4
  stub-addr: ipv6


ECS is used return different CNAME based on client IP. e.g


RFC1918 bar.example.com CNAME internal.sub.example.com
not RFC1918 bar.example.com CNAME external.sub.example.com

client query foo.example.com which CNAMEs to bar.example.com which depending on the client will CNAME to {internal,external}.sub.example.com
 
usually this is working well, but I have seen case where all a sudden, unbound cache hit rate will go to the roof (and sustains at ~100% for hours, well past TTL, so this does not seem to be a sporadic poisoned entry) and the same answer would seem to be returned to any clients.

Another thing that I spotted is that "dump_cache" contains 
;rrset 3588 1 0 8 2
foo.example.com.	3588	IN	CNAME	internal.sub.example.com.

This is *only* supposed to be returned when using ECS (e.g without ECS, it is always foo.example.com CNAME bar.example.com)


While the issue happened on 1.8.0, I have seen those cache hit rate symptoms (global cache, not ECS cache at the time) for a long time, at least 1 year+, but the issue may have only been exacerbated once ECS was added to the mix and answers for the "wrong" audience were returned, making the issue having an actual impact.

I unfortunately don't have any logs and enabling logs would just not be an option, and the issue only reproduce out of the blue as far as I can tell.
Comment 1 Manu Bretelle 2019-01-09 17:47:20 CET
I thing to add, but again a symptom... is that when the issue happens, querying

querying unbound when the issue happened with +subnet=198.51.100.0/24 was returning a result with a scope of /0, while the longest scope that was expected under normal operation was /24 e.g we had something like:
foo.example.com cname to bar.example.com with scope 198.51.100.0/24/0
bar.example.com  cname to external.sub.example.com with scope of 198.51.100.0/24/5
and finally external.sub.example.com has a A record with a scope of 198.51.100.0/24/24, hence the full answer that get returned under normal operation is:

;; CLIENT-SUBNET: 198.51.100.0/24/24
foo.example.com CNAME bar.example.com
bar.example.com CNAME external.sub.example.com
external.sub.example.com A 192.0.2.1

but when the issue happens, it returns something like:

;; CLIENT-SUBNET: 198.51.100.0/24/0
foo.example.com CNAME bar.example.com
bar.example.com CNAME external.sub.example.com
external.sub.example.com A 192.0.2.1
Comment 2 Wouter Wijngaards 2019-01-10 12:03:26 CET
Hi Manu,

So, in the update you say that it returns /0 results, and those end up in the global cache.  It sounds like that is the intended mode of operation? For the ECS cache, that is.  Not for what the upstream server is trying to do.

The first set of information, so it ends up cacheing foo CNAME internal.  But that is never sent, only foo CNAME bar and bar CNAME internal or external?  That is weird, unbound does not create such CNAMEs itself, so it must have received that foo CNAME internal somewhere.

Another way to get this may be that the server is unavailable, and older Unbound has had issues where it would query above the stub, eg, ignore stub because of unavailable to use further-upstream config to contact.  And that would have contacted other servers.  But those issues, are, I think, fixed in recent releases.

Could there be another third query,  third CNAME foo, and then foo CNAME .. continues resolution.  And that somehow gets different, I mean the wrong, answers?  Unbound should peel off that this CNAME foo from the front and then continue recursing in the iterator.  But that does look like a different code-path for the ECS framework, perhaps?

Best regards, Wouter
Comment 3 Manu Bretelle 2019-01-10 23:45:00 CET
Hi Wouter,

thanks for looking into this. Replying to your questions inline:


> So, in the update you say that it returns /0 results, and those end up in the global cache.  It sounds like that is the intended mode of operation? For the ECS cache, that is.  Not for what the upstream server is trying to do.

So unbound only returns a scope of /0 when the issue manifests. When this is happening, the auth is still returning the right scopes, e.g (also adding TTLs, I should have added this before)
```
foo.example.com cname to bar.example.com with scope 198.51.100.0/24/0 and TTL 3600
bar.example.com  cname to external.sub.example.com with scope of 198.51.100.0/24/5 and TTL 3600
and finally external.sub.example.com has a A record with a scope of 198.51.100.0/24/24 and TTL 60
```

I am guessing it ends up in local cache given the /0. In any case, if the auth server had sent it spuriously, I would expect that eventually as TTL expires, it would "fix" itself, but the issue happens way beyond TTL expiration.


> The first set of information, so it ends up cacheing foo CNAME internal.  But that is never sent, only foo CNAME bar and bar CNAME internal or external?  That is weird, unbound does not create such CNAMEs itself, so it must have received that foo CNAME internal somewhere.

bar CNAME internal is only sent for specific subnets when using ECS. If no ECS is set, it would only ever return `external`. Given that I could see it in `dump_cache` output, I am wondering if this means that it ended up in the non-ECS cache. In other word, is dump_cache suppose to show anything from ECS cache?

> Another way to get this may be that the server is unavailable, and older Unbound has had issues where it would query above the stub, eg, ignore stub because of unavailable to use further-upstream config to contact.  And that would have contacted other servers.  But those issues, are, I think, fixed in recent releases.

In this case stub-addr are the same that 1 would get from asking the parent, it is merely a way to bypass full recursion.
Was this fixed before 1.8.0? the issue was reproducing on 1.8.0.

> Could there be another third query,  third CNAME foo, and then foo CNAME .. continues resolution.  And that somehow gets different, I mean the wrong, answers?  Unbound should peel off that this CNAME foo from the front and then continue recursing in the iterator.  But that does look like a different code-path for the ECS framework, perhaps?

you mean:
```
third CNAME foo
foo CNAME bar
bar CNAME {external,internal}
{external,internal} A i.p.add.rr
```

Not that I know.
Comment 4 Manu Bretelle 2019-01-19 01:30:14 CET
Hi Wouter, the cache hit rate seems to be a red herring. We have been able to reproduce the issue and at this stage, I dont believe it is any cache corruption, or at least there is other issues that need to be clarified/fixed first.

Going to close this for now. Sorry for the bother.