Bug 4220 - memory leak in unbound
memory leak in unbound
Status: ASSIGNED
Product: unbound
Classification: Unclassified
Component: server
1.9.1
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-01-15 15:52 CET by Andrei Lukovenko
Modified: 2019-07-11 16:30 CEST (History)
4 users (show)

See Also:


Attachments
config for repro (real IPs are replaced) (2.21 KB, application/octet-stream)
2019-01-15 15:52 CET, Andrei Lukovenko
Details
chart (RSS vs rrset cache count) (331.09 KB, image/png)
2019-01-15 15:56 CET, Andrei Lukovenko
Details
Unbound leak without ECS on domain (no leak) (353.68 KB, image/png)
2019-01-24 11:49 CET, leoleovich
Details
Unbound leak with ECS on domain (leak) (675.68 KB, image/png)
2019-01-24 12:00 CET, leoleovich
Details
Unbound leak in prod (621.02 KB, image/png)
2019-02-01 14:48 CET, leoleovich
Details
Unbound SIGHUP only (527.67 KB, image/png)
2019-03-27 11:50 CET, leoleovich
Details
Unbound SIGHUP+flush infra (570.96 KB, image/png)
2019-03-27 11:51 CET, leoleovich
Details
Lower infra cache experiment (699.04 KB, image/png)
2019-03-29 18:08 CET, leoleovich
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrei Lukovenko 2019-01-15 15:52:09 CET
Created attachment 546 [details]
config for repro (real IPs are replaced)

We were observing what appeared as a memory leak in unbound for a while, and recently we finally were able to get a repro. It looks that the leak is caused by not fully cleaning up caches during config reload on SIGHUP.

The version we are using is x86_64 compiled from source:
Version 1.8.1
linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips  26 Jan 2017
linked modules: dns64 python subnetcache respip validator iterator


Steps to reproduce:
1) start unbound with minimal config (see attached)
2) start authoritative server that will return data in response to any request in 'example.com.'; in our particular case it will be:

$ dig +short $RANDOM.$RANDOM.$$.example.com AAAA
example.com.
2401:...:12a
3) send about 10000000 requests templated like $RANDOM.$RANDOM.$$.example.com, observe cache filling up and stabilizing (both RSS of unbound process and key count 'unbound-control stats -c /etc/unbound/unbound.conf | fgrep rrset.cache.count' are going up and stay stable)
4) wait about a minute
5) send SIGHUP or execute 'unbound-control -c /etc/unbound/unbound.conf reload'
6) at this point caches are almost entirely clean, and RSS goes a bit lower
7) repeat steps 3-6

Expected result: RSS stays at the same level as on step 6 regardless of how many times we repeat 3-6

Observed result: RSS goes up about 500MiB each time after repeating 3-6

We repeated steps 3-6 multiple times, and observed infinite continued memory growth ending with OOM-killer.
Comment 1 Andrei Lukovenko 2019-01-15 15:56:52 CET
Created attachment 547 [details]
chart (RSS vs rrset cache count)

Blue line - RSS of unbound cgroup
Orange line - rrset cache count

Notice that cache count is going down on each reload (18:25, 18:47, 18:55, 19:15, 19:24), then goes up again as cache is filling up with synthetic load. However, the dynamic of RSS growth is upwards.
Comment 2 Wouter Wijngaards 2019-01-15 15:59:28 CET
Hi Andrei,

The config reload does not actually clear the cache memory; although it may denote changes to the cache due to the reload.  The cache is supposed to grow until the configured maximum.  Does the cache exceed the configured maximum?

If so, don't configure more cache memory than you have memory in the machine.

This is because cache eviction policy for unbound does not actually delete the element but just performs bookkeeping.  Until it hits the configured value from config for the maximum memory.

The configured value * 2 is good to have as RAM; because memory allocators need that sort of overhead.  The sum of configured cache sizes is what I mean with that.

Best regards, Wouter
Comment 3 Wouter Wijngaards 2019-01-15 16:02:26 CET
Hi Andrei,

(Just looked at the config, did not see the graph yet as our last comments crossed), this is what you configured for cache sizes,
  msg-cache-size: 668632514
  neg-cache-size: 334316257
  rrset-cache-size: 1337265029
And unbound reports the cache sizes in use with the unbound-control stats (or stats_noreset) reports.  That is about 668M, 334M and 1337M.  For a total of 1.3 + 1 is about 2.3G of memory, x2 for allocator overhead grows to about 4.6G of memory?

Best regards, Wouter
Comment 4 Andrei Lukovenko 2019-01-15 16:05:47 CET
Hi Wouter,

Thanks for the quick response. With this config actually observe unbound getting up to 15GiB (at that point OOM happens).
Comment 5 Wouter Wijngaards 2019-01-15 16:09:19 CET
Hi Andrei,

So what are the other memory values from the stats command?  I mean apart from msg.cache.count and rrset.cache.count there is also 
mem.cache.rrset and mem.cache.message.  Are those values increasing, because then it is not the actual contents of the cache that is increasing your RSS.

But I can already tell that with rrset cache count staying stable, that unbound is caching the same amounts here.  So somehow the unbound process is using 500M extra memory but unbound is storing the same amount of data?

valgrind is good for finding memory leaks; and I would appreciate an output (from a short run perhaps because it is slow), if it reports memory leaks.  On some systems there is a fixed (couple Kb) of memory that is leaked on exit (by libraries used by unbound), but if you run some time that 500Mb could somehow become apparant.  I ask, because the bookkeeping that makes the stats output, if that does not show the memory, then it must be somewhere else.

Best regards, Wouter
Comment 6 Wouter Wijngaards 2019-01-15 16:10:45 CET
Hi Andrei,

I just noticed from config that you have edns subnet, without the subnetmod, without edns client subnet, does the problem disappear?  I so, the subnetcache must be the problem somehow.

Best regards, Wouter
Comment 7 Andrei Lukovenko 2019-01-15 16:13:50 CET
In our chart you can see it growing up to 8GiB, which is way beyond configured value (even taking into the account x2.5-x3 allocator overhead).

To answer your question: msg.cache.count, rrset.cache.count, mem.cache.rrset and mem.cache.message all look in a similar way, and don't show 'infinite growth'. I suspect that the leak is not accounted in these metrics.

I will try to run valgrind and update you with results.
Comment 8 Andrei Lukovenko 2019-01-15 18:39:19 CET
Hi Wouter,

Switching off EDNS does help (or significantly diminishes) the leak, so you are probably correct about subnetcache.

I also tried valgrind, but with a short run there were no significant leaks, and with a long run under stress - all requests were timing out (because, as you mentioned, valgrind is slow).

On a side note: how hard will it be to also expose metrics (key count, memory consumption) about neg-cache and the 'specialized cache' for EDNS (as documentation calls it)?

Let me know if you want me to run some other tests.
Comment 9 Andrei Lukovenko 2019-01-15 18:40:25 CET
One more thing: my colleague was able to reproduce the same issue on 1.8.3 as well.
Comment 10 Ralph Dolmans 2019-01-22 15:19:10 CET
Hi Andrei,

I am trying to reproduce the problem and finding its cause, but don't have much luck so far.

To help me narrow it down: does the authoritative server that answers for all example.com names answer with an ECS option? And if so, do they differ from each other, maybe due to random source addresses in the queries?

-- Ralph
Comment 11 Andrei Lukovenko 2019-01-23 12:25:02 CET
Hi Ralph,

That is a fair question. We are currently running more tests, and we will get back to you soon - currently it looks like there are too many moving parts, which makes it hard to reproduce. We still see these leaks in our production system, so the issue is real; now we are trying to get down to some minimal configuration that will reliably reproduce this leak.
Comment 12 leoleovich 2019-01-24 11:49:21 CET
Created attachment 555 [details]
Unbound leak without ECS on domain (no leak)

12 hours of intense testing on the following configuration:

# ECS module enabled:
module-config: "subnetcache validator iterator"
# ECS is not enabled on our testing domain
#client-subnet-zone: example.com
# ECS is enabled on some other domain, but we do not send query it
client-subnet-zone: some_other_domain_we_do_not_touch.com

What you see on this graph:
* Caches became full and keep stable (msg and rrset correlate)
* Memory holds still
* Mem.mod.subnet counter is next to nothing.

It seems like this configuration works stable and there is no memory leak observed.
Comment 13 leoleovich 2019-01-24 12:00:59 CET
Created attachment 556 [details]
Unbound leak with ECS on domain (leak)

24 hours of intense testing on the following configuration:

# ECS module enabled:
module-config: "subnetcache validator iterator"
# ECS is enabled on our testing domain
client-subnet-zone: example.com

What you see on this graph:
* msg and rrset caches are completely empty (~20 records)
* Memory consumption is much smaller (because main caches are not used). However memory consumption grows from 2.4 GiB at the beginning to 3.1 GiB at the end (~700 MiB difference)
* Mem.mod.subnet shows constant values ~670 MiB

The growth continues constantly (I'l keep test running to see how it behaves over longer period of time).
Comment 14 leoleovich 2019-01-24 12:12:38 CET
Hello unbound team,
I've submitted 2 different graphs for 2 different test cases.
In both tests unbound queries authoritative (tinydns) and gets following response:
```
16463.example.com. 86400 IN A 10.0.0.1
```
It is defined on tinydns side in a following format (no maps, nothing, just plain A record):
```
+*.example.com,10.0.0.1,86400
```
Load test client does not know anything about ECS. It sends normal lookup query with random prefix. Unbound is configured in 2 different ways to represent test cases:
1) ECS module is enabled, but unbound does not forward nor inject ECS info for the test domain.

2) ECS module is enabled. Unbound adds ECS info if missing (always missing as load test client does not support ECS).

What we see is a memory leak in the 2nd case (see attached graphs and comments to them).
In prod, we have both cases (full main cache and ECS) and much more domains enabled to inject ECS info. This leads to a great memory consumption and constant growth.

Thank you very much.
Comment 15 leoleovich 2019-01-24 12:18:12 CET
Hello unbound team,
I've submitted 2 different graphs for 2 different test cases.
In both tests unbound queries authoritative (tinydns) and gets following response:
```
16463.example.com. 86400 IN A 10.0.0.1
```
It is defined on tinydns side in a following format (no maps, nothing, just plain A record):
```
+*.example.com,10.0.0.1,86400
```
Load test client does not know anything about ECS. It sends normal lookup query with random prefixes for 30 minutes then sends SIGHUP (helps to enforce leak):
```
cat hit_unbound.sh
#!/bin/bash
while true
do
	<loadtest tool. Query $RANDOM.example.com for 30 minutes>
	unbound-control reload
	sleep 25 # To avoid servfails
done
```

Unbound is configured in 2 different ways to represent test cases:
1) ECS module is enabled, but unbound does not forward nor inject ECS info for the test domain.

2) ECS module is enabled. Unbound adds ECS info if missing (always missing as load test client does not support ECS).

What we see is a memory leak in the 2nd case (see attached graphs and comments to them).
In prod, we have both cases (full main cache and ECS) and much more domains enabled to inject ECS info. This leads to a great memory consumption and constant growth.

Thank you very much.
Comment 16 leoleovich 2019-01-28 12:04:01 CET
It seems the repro we've had is "leaking" for 1 day and then it stops. We will continue to come up with a proper repro, but we certainly see the leak on production.
Comment 17 leoleovich 2019-02-01 14:48:56 CET
Created attachment 557 [details]
Unbound leak in prod

This is what we see on ~weekly cycle in prod (random server. Same pattern is everywhere). It takes ~9 days from restart of unbound to fill out memory and get OOMed.
What you can see on this graph is:
Green - SIGHUPS event
Orange - rrset cache count (you can see it goes down every SIGHUP)
BLUE - RSS.
You can see when there is no SIGHUP for a while, RSS stays flat together with cache. Then SIGHUP comes (green) and memory bumps a bit (or even stays the same), but then new portion of data is filling cache - RSS grows. Eventually it leads to OOM.

We are struggling to reproduce it reliably, but the problem is definitely there even in 1.8.3.
Comment 18 leoleovich 2019-02-02 17:49:10 CET
The same behaviour is seen on hosts which don't perform internet recurstion (only stub zone locally).
Comment 19 leoleovich 2019-03-27 11:50:28 CET
Created attachment 570 [details]
Unbound SIGHUP only
Comment 20 leoleovich 2019-03-27 11:51:06 CET
Created attachment 571 [details]
Unbound SIGHUP+flush infra
Comment 21 leoleovich 2019-03-27 12:01:31 CET
Hello Unbound team,
From what we see is the leak is somewhere in infra cache subsystem.
It is fair to say that we have unusually large config file (close to 100k lines) which mostly consists of "stub-zones"
  stub-zone:
    name: "internal1.domain."
    stub-addr: AuthIP1
    stub-addr: AuthIP2

  stub-zone:
    name: "internal2.domain."
    stub-addr: AuthIP3
    stub-addr: AuthIP4

  stub-zone:
    name: "internal3.domain."
    stub-addr: AuthIP1
    stub-addr: AuthIP2

As soon as we add "unbound-control flush_infra all" before sending reload signal - the leak seems to be gone (see 2 attachments).

If you take a look at "Unbound SIGHUP only" graph you can see that "infrastructure cache" is not flushed during reload (counters are on it's top) and less memory freed, when at "Unbound SIGHUP+flush infra" you can see infrastructure cache being flushed and memory being freed.
Comment 22 Wouter Wijngaards 2019-03-27 14:40:55 CET
Hi Leolovic,

The number of bytes in use by the infra cache is the number x 310 bytes.  That should be the actual memory size in use.  With the number that it stabilizes on something fairly close to the 1 million you configured.

Infra cache items are all the same size, but the name of the zone can be longer or shorter.  296 bytes of record data with an average of 14 bytes used for the 310 above.

If infra cache clean all removes the item, it must be in the hosts slabhash of the infra cache.  The counter for the size of the cache in bytes is the same memory in use counter code as used for other caches, eg. RRset and message...  So I wonder what could make the infra cache size go wrong.

Best regards, Wouter
Comment 23 leoleovich 2019-03-27 14:51:00 CET
Updated unbound version as in the bug description as we see this on unbound 1.9.1 too.
Comment 24 Wouter Wijngaards 2019-03-27 15:13:08 CET
Hi Leoleovic,

As a silly question perhaps, but I would like to be sure.  Thanks for updating the version number, by the way.  So, if the infra cache memory should be about 310million bytes, about 310Mb, that could grow with x2 overhead in the allocator into about 620 Mb.  I wonder if that is just possible?  1 million is large number, perhaps it is still filling up, and then getting its x2 overhead in the memory allocator?

I cannot really tell, it depends on the numbers associated with the graph lines.  Eg. what is the number of infra-cache-count when full.  And is the total memory growth when it is full actually too much for that?

If so, run with a much lower infra cache count.  That may show the problem earlier; or even stabilize the infra cache (or the part caused by infra cache) memory use.

Earlier versions of unbound stored infra entries per IP, but newer have entries per IP and zone.

Because the memory is freed when the infra cache is deleted, it is not really a leak, but it does have somehow accumulated memory too much; this is not impossible.  But I also do not see bugs in the code for this angle.

Best regards, Wouter
Comment 25 leoleovich 2019-03-27 15:41:09 CET
Hi Wouter,
I tried to lower infra cache by 10x and grow by 10x - similar results. If you have a look at "Unbound SIGHUP only" graph you will see that infra cache fills within 14 hours to 1M entries and stays like this throughout the lifetime (see the tooltip on the right) until OOM comes.
Then reload (green spikes on the same graph) come and rrset cache gets flushed. infra (according to the graphs) stays the same.

We have a lot more memory available and this growth will be indefinite until OOM comes. But again, growth happens after "reload" is called.
And if we flush infra cache together with reload - the issue seems to be gone (see the second graph - Unbound SIGHUP+flush infra).

We see only symptoms here, it is hard to say why exactly this is happening. I have a feeling the reload logic marks infra cache as free, but does not flush counters nor frees memory. Or something around this area.
Comment 26 Wouter Wijngaards 2019-03-27 16:05:18 CET
Hi Leoleovic,

You are right that the reload does not delete the infra cache.  And the flush deletes the cache.

So the reload uses more memory, but it should not.  It is in the infra cache, which remains at the same number of items which have the same size.

If you set the number of slabs to 1, then the infra cache only has only array for its hash table.  The array itself also grows, and it could grow to 1 million x 8 bytes 8Mb, if the hash is one-sided.  Not really enough to explain the amount you talk about.

So, the fact that the reload does not delete the cache, is just how the code is written.  I wanted to reuse the infra entries after reload.

But that should not increase the amount of memory used by the infra cache.  It does not really do anything to the infra entries during a reload.

Best regards, Wouter
Comment 27 leoleovich 2019-03-29 18:08:23 CET
Created attachment 572 [details]
Lower infra cache experiment

In this experiment I lowered infra-cache-numhosts 100x (from 10M to 10k) between Feb 3 and Feb 10.
As you can see phase of memory leaking slowed down but yet it was an "infinite growth".
Comment 28 leoleovich 2019-03-29 18:12:06 CET
Hi Wouter,
I found graphs from my old experiment, where we lowered value for infra-cache-numhosts. We were using only reload back them (no flush of infra).
From my point of view it slowed down the leak, which is a good indicator that the leak is somewhere in reload+infra counters codepath.
Thanks a lot
Comment 29 leoleovich 2019-04-29 13:29:57 CEST
I am still experimenting with different combinations and seems like `unbound-control flush_infra all` is the only remediation that helps.
I was thinking maybe it is not "flush_infra" itself, but the fact that we have a longer pause or in general involve flush logic...

I tried reload +

"flush_bogus",
"flush_negative",
"flush_requestlist",
"flush_stats",

It did not help (OOM came back).

With reload + flush_infra it works fine and we have "flat" graphs and not a single OOM.
Before we had up to 25% of the fleet OOMing every day.
Comment 30 leoleovich 2019-07-10 13:28:56 CEST
Hi team,
Do you have an intention to proceed with this? Our mitigation by flushing infra cache helps, but it's a hack around, which masks a possibly critical bug.
I am more than happy to provide any extra details you need in order for you to repro this behaviour.
Thank you very much.
Comment 31 Ralph Dolmans 2019-07-11 16:30:02 CEST
Hi Leoleovic,

We definitely have the intention to proceed with this issue. We are looking into this with multiple developers right now, but still aren't completely able to explain the observed behaviour. I still fail to understand the relation with the infra- and ecs cache.

One request: would you be able to add the mem.mod.subnet statistic to your last two plots?

Thanks,
-- Ralph