Bug 4149 - DNS over TLS generates memory leak with OpenSSL 1.0.2p
DNS over TLS generates memory leak with OpenSSL 1.0.2p
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.7.3
All Linux
: P5 normal
Assigned To: unbound team
: 4157 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-09-06 03:35 CEST by Eric Luehrsen
Modified: 2018-09-28 05:26 CEST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Luehrsen 2018-09-06 03:35:40 CEST
The issue is in the discovery phase right now, but looking for insight. OpenWrt users are reporting system crashes due to running out of RAM when they enable TLS over DNS in Unbound. Default builds are with OpenSSL 1.0.2p. OpenWrt is attempting to move to OpenSSL 1.1.0, but it is still held up due to the multitude of packages to port. It is not clear if this is an API misuse, OpenSSL bug, some undocumented requirement (missing cleanup API call), or something else.

See: https://github.com/openwrt/packages/issues/6925
See: https://github.com/openwrt/packages/issues/6957

---example forward---
forward-zone:
  name: .
  forward-addr: 1.0.0.1@853
  forward-first: no
  forward-tls-upstream: yes

---example stats---
root@OpenWrt:~# unbound-control stats_noreset
thread0.num.queries=158407
thread0.num.queries_ip_ratelimited=0
thread0.num.cachehits=15381
thread0.num.cachemiss=143026
thread0.num.prefetch=0
thread0.num.zero_ttl=0
thread0.num.recursivereplies=143025
thread0.requestlist.avg=0.536707
thread0.requestlist.max=11
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
thread0.requestlist.current.all=1
thread0.requestlist.current.user=1
thread0.recursion.time.avg=0.024748
thread0.recursion.time.median=0.0290863
thread0.tcpusage=0
total.num.queries=158407
total.num.queries_ip_ratelimited=0
total.num.cachehits=15381
total.num.cachemiss=143026
total.num.prefetch=0
total.num.zero_ttl=0
total.num.recursivereplies=143025
total.requestlist.avg=0.536707
total.requestlist.max=11
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=1
total.requestlist.current.user=1
total.recursion.time.avg=0.024748
total.recursion.time.median=0.0290863
total.tcpusage=0
time.now=1536183628.523561
time.up=1289732.424684
time.elapsed=1289732.424684
mem.cache.rrset=1710898
mem.cache.message=1272729
mem.mod.iterator=16504
mem.mod.validator=0
mem.mod.respip=0
histogram.000000.000000.to.000000.000001=3748
histogram.000000.000001.to.000000.000002=0
histogram.000000.000002.to.000000.000004=0
histogram.000000.000004.to.000000.000008=0
histogram.000000.000008.to.000000.000016=0
histogram.000000.000016.to.000000.000032=0
histogram.000000.000032.to.000000.000064=1
histogram.000000.000064.to.000000.000128=150
histogram.000000.000128.to.000000.000256=155
histogram.000000.000256.to.000000.000512=259
histogram.000000.000512.to.000000.001024=400
histogram.000000.001024.to.000000.002048=595
histogram.000000.002048.to.000000.004096=1501
histogram.000000.004096.to.000000.008192=9296
histogram.000000.008192.to.000000.016384=1960
histogram.000000.016384.to.000000.032768=68939
histogram.000000.032768.to.000000.065536=34762
histogram.000000.065536.to.000000.131072=14711
histogram.000000.131072.to.000000.262144=3451
histogram.000000.262144.to.000000.524288=1787
histogram.000000.524288.to.000001.000000=903
histogram.000001.000000.to.000002.000000=270
histogram.000002.000000.to.000004.000000=98
histogram.000004.000000.to.000008.000000=20
histogram.000008.000000.to.000016.000000=15
histogram.000016.000000.to.000032.000000=4
histogram.000032.000000.to.000064.000000=0
histogram.000064.000000.to.000128.000000=0
histogram.000128.000000.to.000256.000000=0
histogram.000256.000000.to.000512.000000=0
histogram.000512.000000.to.001024.000000=0
histogram.001024.000000.to.002048.000000=0
histogram.002048.000000.to.004096.000000=0
histogram.004096.000000.to.008192.000000=0
histogram.008192.000000.to.016384.000000=0
histogram.016384.000000.to.032768.000000=0
histogram.032768.000000.to.065536.000000=0
histogram.065536.000000.to.131072.000000=0
histogram.131072.000000.to.262144.000000=0
histogram.262144.000000.to.524288.000000=0
num.query.type.A=132061
num.query.type.PTR=4393
num.query.type.MX=1
num.query.type.TXT=1774
num.query.type.AAAA=20040
num.query.type.SRV=132
num.query.type.DS=3
num.query.type.DNSKEY=3
num.query.class.IN=158407
num.query.opcode.QUERY=158407
num.query.tcp=0
num.query.tcpout=157574
num.query.ipv6=0
num.query.flags.QR=0
num.query.flags.AA=0
num.query.flags.TC=0
num.query.flags.RD=158407
num.query.flags.RA=0
num.query.flags.Z=0
num.query.flags.AD=0
num.query.flags.CD=6
num.query.edns.present=1741
num.query.edns.DO=6
num.answer.rcode.NOERROR=154408
num.answer.rcode.FORMERR=0
num.answer.rcode.SERVFAIL=938
num.answer.rcode.NXDOMAIN=3060
num.answer.rcode.NOTIMPL=0
num.answer.rcode.REFUSED=0
num.answer.rcode.nodata=17092
num.query.ratelimited=0
num.answer.secure=0
num.answer.bogus=0
num.rrset.bogus=0
num.query.aggressive.NOERROR=0
num.query.aggressive.NXDOMAIN=0
unwanted.queries=0
unwanted.replies=0
msg.cache.count=6741
rrset.cache.count=7397
infra.cache.count=1
key.cache.count=0
num.query.authzone.up=0
num.query.authzone.down=0
Comment 1 Eric Luehrsen 2018-09-06 03:44:09 CEST
I missed copying the error log in initial report. "lib(7):func(100):reason(65)" can be decoded to buffer resizing in "openssl/crypto/buffer/buffer.c: int BUF_MEM_grow(BUF_MEM *str, size_t len);"

---OpenSSL Error---
Sat Sep  1 17:47:29 2018 daemon.notice unbound: [24512:0] notice: ssl handshake failed 1.0.0.1 port 853
Sat Sep  1 17:47:29 2018 daemon.err unbound: [24512:0] error: ssl handshake failed crypto error:07064041:lib(7):func(100):reason(65)
Comment 2 Wouter Wijngaards 2018-09-06 09:44:35 CEST
Hi Eric,

With 1.1.0 I have no memory leaks. With 1.0.2p, I can see that, even although I call the same cleanup routines (eg. SSL_CTX_free and so on) at the exit of unbound, memory is not released (according to valgrind) at exit.  It is reporting memory not freed on a context but I am sure I am calling the free routine.

Openssl, however, uses reference counting.  Somewhere a reference count could be wrong, and that could cause a memory leak.  Not sure why that would not be present in 1.1.0; because Unbound calls the same routines (apart from SSL_set1_host that does not exist in 1.0.2p, but I did not think that increased a reference count).

The issue does not exist for you in 1.1.0?

Best regards, Wouter
Comment 3 Wouter Wijngaards 2018-09-06 10:05:47 CEST
Hi Eric,

Okay, so what I can see with ssl reference count debug enabled, and valgrind.  Is that when I free the connect_sslctx twice, it indeed tries to free the ssl ctx twice.  But in the valgrind printout some elements of what was created for that context are not free on exit.  I think that could be a reference count thing, but on those elements.  So, the SSL_CTX_free call works, and free's most of the stuff, but not some stuff; not sure where the reference counts are wrong.

Best regards, Wouter
Comment 4 Wouter Wijngaards 2018-09-06 10:41:43 CEST
Hi Eric,

So I see unfreed memory at exit, that is allocated in the routines load_gost_id, in the routine that loads the pems from the cert file, and unfreed memory in the error string thread array.  They do not grow if I connect multiple times to the upstream TLS.

This is with 1.0.2p

Because it does not grow, it is not a leak that grows.  It should not drive openWRT users out of memory.  Some cleanup that is missing or reference counts that are wrong, but they do not hamper unbound; those allocations are made only once.

If openWRT users are running out of memory, it must be lost somewhere.  I am not sure where.  A test with 1.0.2p in debug mode and valgrind and TLS lookups did not reveal the information.  Do you have any other information where the memory could be lost?

You have a statistics printout, does this mean you think the memory lost is counted in those statistics?  Unbound doesn't count TLS from openssl in that (I believe), so if so, that must be some sort of growing cache?  Cache sizes are configurable, but due to malloc overheads, expect usage to grow to about double the size you configure in unbound.conf.

Best regards, Wouter
Comment 5 Eric Luehrsen 2018-09-07 02:18:54 CEST
Hi Wouter,

I do not think this is basic cache behavior. Unbound is well behaved on default cache settings, with and without DNSSEC, with and without basic Forwarding (no TLS). Basic operation uses 14MB within an hour and stays short of 20MB after many weeks. If TLS is off, then it appears quite stable. If TLS forwarding is enabled, then Unbound grows quickly. I see 40+ MB in a few days. OpenWrt most often finds itself on hardware with a total of 128MB or 256MB.

The example statistics are just for reference from one user report. The cache memory is only a few MB total. Even with malloc() the gap in useful memory and consumed memory doesn't make sense.

Thanks,
Eric
Comment 6 Wouter Wijngaards 2018-09-07 16:11:21 CEST
Hi,

If 1.1.0 does not have the bug; unbound supports a compilation mode where you give --with-ssl=directory and there is an include/ and lib/ for openssl.  Then unbound can compile, also statically compile, against openssl.  Different from the rest of the system.

This is what I use for testing different openssl versions.  And I think may also be what FreeBSD uses, but their different openssl versions in system base and ports does give the occasional link problem.

But compiling unbound with 1.1.0 can then be used as a workaround; that is, if 1.1.0 does not have the same issues.  For Unbound it does not make a difference, it makes (roughly) the same calls to openssl, but there are small differences, eg. sethostname for hostnameverification.

Best regards, Wouter
Comment 7 Eric Luehrsen 2018-09-11 06:08:52 CEST
Due to complexity in building the first try was to go back to OpenSSL 1.0.2o for a few days. This still repeats the error.
Comment 8 Eric Luehrsen 2018-09-12 02:34:45 CEST
OpenSSL 1.1.0i repeats the error so it looks like Unbound unless it is an API used as documented but documentation doesn't meet intent.
Comment 9 Eric Luehrsen 2018-09-13 05:45:45 CEST
Hi Wouter,
There was an ambitious run on OpenWrt linked with libressl 2.7.4 also. The leaks persist. OpenWrt has only two patches for Unbound 1.7.3 (1) Unbound.conf defaults to "low memory" suggestion in documentation and (2) the TLS name validation log warning for OpenSSL 1.0.2 (now part of base in Unbound 1.8.0). It appears that it is a leak in Unbound itself.
I know its frustrating, so thanks for looking into this.
Eric
(see updates https://github.com/openwrt/packages/issues/6957)
Comment 10 Wouter Wijngaards 2018-09-17 11:55:31 CEST
Hi Eric,

Because the openssl version does not matter, perhaps it is something else.  Could you be out of memory due to a growing large requestlist.  This is the number of average request list size and current request list size that gets reported in the statistics.  Every request uses a couple of K of memory, so it can add up to a large amount of memory.

One way to stop that is to lower the total max amount of requests you allow in unbound. The long list is likely with older elements that are not responding, but unbound is still trying them.  Unbound will replace such old elements with a new incoming query if the requestlist gets full.

But this sounds unlikely again, because with TLS people usually have a CDN upstream and they respond quickly.  Is the requestlist size perhaps the cause of that memory use?

Best regards, Wouter
Comment 11 Wouter Wijngaards 2018-09-17 13:29:52 CEST
Hi Eric,

Here a patch that should stop certain query states from "leaking", this is where they stop in a state with an empty list.  That grows the query state list.  It does not grow forever, but perhaps a bunch of Mb, and could be your problem.

After growing forever, I think it would get overwritten by new queries.  So not the end of the world, but for OpenWRT maybe not so nice.  If that happens to be the problem you are facing.  In any case, because I just found the bug, here is the patch for it.  I hope it can help out.

Best regards, Wouter

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c (revision 4911)
+++ iterator/iterator.c (working copy)
@@ -2752,6 +2752,12 @@
                                                verbose(VERB_ALGO,
                                                "could not validate NXDOMAIN "
                                                "response");
+                                       outbound_list_clear(&iq->outlist);
+                                       iq->num_current_queries = 0;
+                                       fptr_ok(fptr_whitelist_modenv_detach_subs(
+                                               qstate->env->detach_subs));
+                                       (*qstate->env->detach_subs)(qstate);
+                                       iq->num_target_queries = 0;
                                }
                        }
                        return next_state(iq, QUERYTARGETS_STATE);
Index: services/outside_network.c
===================================================================
--- services/outside_network.c  (revision 4911)
+++ services/outside_network.c  (working copy)
@@ -1979,7 +1979,7 @@
                        return 0;
                }
                if(rto >= RTT_MAX_TIMEOUT) {
-                       fallback_tcp = 1;
+                       /* fallback_tcp = 1; */
                        /* UDP does not work, fallback to TCP below */
                } else {
                        serviced_callbacks(sq, NETEVENT_TIMEOUT, c, rep);
Comment 12 Eric Luehrsen 2018-09-19 03:02:09 CEST
I was hopeful, but this patch did not reduce memory consumption. Normal recursion has response times in a tight pack around about 100 ms. TLS forwarding (1.1.1.1, 9.9.9.9) is a flatter a-normal distribution with many responses upto 1 s; probably it is handshaking time. This made it seem likely on target.
Comment 13 Wouter Wijngaards 2018-09-21 13:14:13 CEST
*** Bug 4157 has been marked as a duplicate of this bug. ***
Comment 14 Wouter Wijngaards 2018-09-21 13:18:38 CEST
Hi, 

So another report was marked as a duplicate of this, and it looks like it is exactly that.  I asked if he could run with valgrind, maybe you can, or another memory checker.  When I run with valgrind I have no memory leaks visible.

I ran with --disable-gost, because that removes from missing cleanup on exit.  I also ran with openssl 1.0.2p and 1.1.1.  With 1.1.1 everything is clean also after a number of queries.  1.0.2p has missing cleanups on exit, but the amount of memory does not grow after a number of queries.

So I cannot find it?  The guess on the requestlist was not true, but could have been related.  How to track down this bug.  The memory is not lost in the caches, cache counts are normal.  The memory is not lost in requestlist size (the other report has a printout that shows that it is not, and you tested with the fix and found it did not).  The memory must be used by something, but not something that I have when I run it and not cache or requestlist?

Best regards, Wouter
Comment 15 Eric Luehrsen 2018-09-21 14:09:46 CEST
Hi Wouter,
Yeah. I have struggled to walk the code to find it. It probably isn't a pure leak. A valid pointer is probably still pointing at allocated memory. That pointer would be in a large list or queue. Logically, the process forgets it was there and use another pointer location in the list. Something like each public key exchange risks allocating a new valid entry. Now we just need to hope to find such a pattern. Logic errors are a pain.
Have a good weekend
Eric
Comment 16 Eric Luehrsen 2018-09-25 05:43:21 CEST
For elimination, I have tried using (1.1.1.1, 9.9.9.9) forwarding with TCP only but not TLS. No abnormal growth in memory consumption appears even with forced long list of look ups, or just leaving it running. If I enable TLS, then its easy and quick to double Unbound overall size on the system. In other news this continues with Unbound 1.8.0. Abbreviated example of TCP only.

server:
...
  tcp-upstream: yes

forward-zone:
  name: .
  forward-addr: 1.1.1.1
  forward-addr: 1.0.0.1
  forward-addr: 9.9.9.9
  forward-first: no
  forward-tls-upstream: no
Comment 17 Wouter Wijngaards 2018-09-25 11:08:59 CEST
Hi Eric,

Here is the code with extra cleanup for the SSL state in case there is a TCP timeout event.  Does this mitigate or solve the issue?

Best regards, Wouter

Index: services/outside_network.c
===================================================================
--- services/outside_network.c	(revision 4914)
+++ services/outside_network.c	(working copy)
@@ -377,6 +379,8 @@
                         if(!SSL_set1_host(pend->c->ssl, w->tls_auth_name)) {
                                 log_err("SSL_set1_host failed");
 				pend->c->fd = s;
+				SSL_free(pend->c->ssl);
+				pend->c->ssl = NULL;
 				comm_point_close(pend->c);
 				return 0;
 			}
@@ -1264,6 +1268,13 @@
 	} else {
 		/* it was in use */
 		struct pending_tcp* pend=(struct pending_tcp*)w->next_waiting;
+		if(pend->c->ssl) {
+#ifdef HAVE_SSL
+			SSL_shutdown(pend->c->ssl);
+			SSL_free(pend->c->ssl);
+			pend->c->ssl = NULL;
+#endif
+		}
 		comm_point_close(pend->c);
 		pend->query = NULL;
 		pend->next_free = outnet->tcp_free;
Comment 18 Eric Luehrsen 2018-09-28 05:26:21 CEST
Hi Wouter,
I have pulled both commits as patch files from Unbound trunk and tested. The leak is sealed. Thank you.
Eric