Bug 4227 - Unbound 1.9 crashes intermittently
Unbound 1.9 crashes intermittently
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.9.0
x86_64 Linux
: P5 normal
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-02-25 16:25 CET by Isaac McDonald
Modified: 2020-11-03 12:38 CET (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Isaac McDonald 2019-02-25 16:25:58 CET
Hello,

Unbound 1.9 is crashing intermittently on two separate CentOS 7 servers. These servers were previously running Unbound 1.8.0 without issue. The error is:  vmap.c:367: Assertion nread >= 0 failed in evmap_io_del

If I can provide additional debugging information please let me know.

The Unbound configuration is as follows:


server:
        verbosity: 1
        statistics-interval: 0
        statistics-cumulative: yes
        extended-statistics: yes
        ratelimit: 1000
        ratelimit-size: 4m
        ratelimit-slabs: 8
        ratelimit-factor: 10
        num-threads: 8
        interface: 216.187.165.2
        interface-automatic: no
        msg-cache-size: 200m
        msg-cache-slabs: 8
        rrset-cache-size: 400m
        rrset-cache-slabs: 8
        infra-cache-slabs: 8
        access-control: 216.187.128.0/18 allow_snoop
        access-control: 204.130.255.0/24 allow_snoop
        access-control: 64.122.164.0/24 allow_snoop
        access-control: 192.168.0.0/16 allow_snoop
        access-control: 172.16.0.0/12 allow_snoop
        access-control: 10.0.0.0/8 allow_snoop
        chroot: ""
        username: "unbound"
        directory: "/etc/unbound"
        logfile: "/var/log/unbound/unbound.log"
        log-time-ascii: yes
        pidfile: "/var/run/unbound/unbound.pid"
        hide-version: yes
        harden-glue: yes
        harden-dnssec-stripped: yes
        harden-below-nxdomain: yes
        harden-referral-path: yes
        use-caps-for-id: no
        unwanted-reply-threshold: 10000000
        prefetch: yes
        prefetch-key: yes
        rrset-roundrobin: yes
        minimal-responses: yes
        serve-expired: yes
        trusted-keys-file: /etc/unbound/keys.d/*.key
        auto-trust-anchor-file: "/var/lib/unbound/root.key"
        val-clean-additional: yes
        val-permissive-mode: no
        val-log-level: 1
        key-cache-size: 85m
        key-cache-slabs: 8
        include: /etc/unbound/local.d/*.conf
        interface: 216.187.165.2@853
        tls-service-key: "/etc/pki/tls/private/example.com.key"
        tls-service-pem: "/etc/pki/tls/certs/example.com.crt"
        tls-port: 853
        aggressive-nsec: yes
        qname-minimisation: yes
        so-reuseport: yes

remote-control:
        control-enable: yes
        control-interface: 172.31.142.251
        control-port: 8953
        server-key-file: "/etc/unbound/unbound_server.key"
        server-cert-file: "/etc/unbound/unbound_server.pem"
        control-key-file: "/etc/unbound/unbound_control.key"
        control-cert-file: "/etc/unbound/unbound_control.pem"

include: /etc/unbound/conf.d/*.conf
Comment 1 Wouter Wijngaards 2019-02-25 16:50:22 CET
Hi Imcdona,

I have a fix in the code repository.  But, it is based on the assertion you list not on replicating the bug itself, so you can update from the code repository and then see if that fixes the bug?

I paired the event delete and add for libevent.  The new out of order processing in 1.9.0 had this imbalanced.

Best regards, Wouter

Index: services/listen_dnsport.c
===================================================================
--- services/listen_dnsport.c	(revision 5121)
+++ services/listen_dnsport.c	(working copy)
@@ -1636,10 +1636,12 @@
 	
 	if(wr) {
 		req->cp->tcp_is_reading = 0;
+		comm_point_stop_listening(req->cp);
 		comm_point_start_listening(req->cp, -1,
 			req->cp->tcp_timeout_msec);
 	} else if(rd) {
 		req->cp->tcp_is_reading = 1;
+		comm_point_stop_listening(req->cp);
 		comm_point_start_listening(req->cp, -1,
 			req->cp->tcp_timeout_msec);
 		/* and also read it (from SSL stack buffers), so
@@ -1647,6 +1649,7 @@
 		 * the TLS frame is sitting in the buffers. */
 		req->read_again = 1;
 	} else {
+		comm_point_stop_listening(req->cp);
 		comm_point_start_listening(req->cp, -1,
 			req->cp->tcp_timeout_msec);
 		comm_point_listen_for_rw(req->cp, 0, 0);
@@ -1759,6 +1762,7 @@
 		 * clear to write to */
 	send_it:
 		c->tcp_is_reading = 0;
+		comm_point_stop_listening(c);
 		comm_point_start_listening(c, -1, c->tcp_timeout_msec);
 		return;
 	}
Index: util/netevent.c
===================================================================
--- util/netevent.c	(revision 5121)
+++ util/netevent.c	(working copy)
@@ -989,10 +989,10 @@
 		c->tcp_is_reading = 1;
 	c->tcp_byte_count = 0;
 	/* switch from listening(write) to listening(read) */
-	comm_point_stop_listening(c);
 	if(c->tcp_req_info) {
 		tcp_req_info_handle_writedone(c->tcp_req_info);
 	} else {
+		comm_point_stop_listening(c);
 		comm_point_start_listening(c, -1, -1);
 	}
 }
@@ -1006,11 +1006,11 @@
 	if(c->tcp_do_toggle_rw)
 		c->tcp_is_reading = 0;
 	c->tcp_byte_count = 0;
-	if(c->type == comm_tcp)
-		comm_point_stop_listening(c);
 	if(c->tcp_req_info) {
 		tcp_req_info_handle_readdone(c->tcp_req_info);
 	} else {
+		if(c->type == comm_tcp)
+			comm_point_stop_listening(c);
 		fptr_ok(fptr_whitelist_comm_point(c->callback));
 		if( (*c->callback)(c, c->cb_arg, NETEVENT_NOERROR, &c->repinfo) ) {
 			comm_point_start_listening(c, -1, c->tcp_timeout_msec);
Comment 2 Isaac McDonald 2019-02-25 19:02:48 CET
Unbound was originally installed on these servers via a third party YUM repo. Compiling from the latest source isn't going to be easy. I'd like to avoid having to do that if at all possible.
Comment 3 Isaac McDonald 2019-02-25 19:40:21 CET
I spoke too soon. The maintainers of the third part repo are going to apply the patch to their latest build. I'll get back to you after I've had an opportunity to test.
Comment 4 Isaac McDonald 2019-02-28 16:08:41 CET
Unbound has been running solid since 2/25 without crashing. Your patch appears to have fixed the issue.

Thanks
Comment 5 Wouter Wijngaards 2019-02-28 16:12:15 CET
Hi Isaac,

Very good!  Thanks for the report.  I intend to move towards a production release with the fixes in there.  This makes the fixes also available for the other users.

Best regards, Wouter
Comment 6 Daniel Kahn Gillmor 2020-10-28 02:09:03 CET
Just noting that i think this has come up in debian as https://bugs.debian.org/973052 -- Wouter, if there are any other related patches to avoid this instability besides what you have in Comment 1, it'd be good to know so we can include it in a subsequent debian stable point release.
Comment 7 Wouter Wijngaards 2020-10-28 09:38:03 CET
Hi Daniel,

Yes there have been fixes for this kind of error in the mean time.   They were to reinitialize structures so that it did not mess with the internal state of the even handler.  Just upgrade, the latest release should have those fixes.

Best regards, Wouter
Comment 8 Benno Overeinder 2020-10-28 10:54:37 CET
(In reply to Wouter Wijngaards from comment #7)
> Yes there have been fixes for this kind of error in the mean time.   They
> were to reinitialize structures so that it did not mess with the internal
> state of the even handler.  Just upgrade, the latest release should have
> those fixes.

We will you send a combined patch of commits:

https://github.com/NLnetLabs/unbound/commit/348cbab016f824a336b65d0091310fe5cd58e762
https://github.com/NLnetLabs/unbound/commit/2b47ca080eb91e209fb86cd1dc90a6aff32e2a1f
5 April 2019: Wouter
       - Fix to reinit event structure for accepted TCP (and TLS) sockets.
8 April 2019: Wouter
       - Fix to use event_assign with libevent for thread-safety.

And these four for spoolbuf fixes that would be nice to have (not included in the patch): https://github.com/NLnetLabs/unbound/commit/0b77c9d6763686264d44dfd926c8cb4f2f03a43a
https://github.com/NLnetLabs/unbound/commit/6067ce6d2b82ce2e80055e578fdfd7ba3e67c523
https://github.com/NLnetLabs/unbound/commit/af6c5dea43fc63452d49b2339e607365b6652987
https://github.com/NLnetLabs/unbound/commit/a08fe8ca609b651c8d8c8379780aad508d492421

However, we recommend upgrading to Unbound 1.9.2 as this also fixes other related bugs.

Best regards,

-- Benno
Comment 9 Robert Edmonds 2020-10-28 19:25:02 CET
Hi, Benno:

We do make available newer Unbound releases for Debian stable users in the "backports" repository (currently Unbound 1.12.0) on an opt-in basis. However, the Debian release management policies and procedures make it difficult to import new upstream package versions into the Debian stable distribution. It's much easier for the Debian reviewers who need to sign off on stable updates if they only need to review targeted patches.

Based on the list of commits you provided, I prepared a rollup of the code changes in those commits targeted at 1.9.0. I also needed to apply commit cae8361dcd2809c8e266d259370c9ab8660c2c0e in order to avoid needing to manually resolve the textual conflict caused by applying 0b77c9d6763686264d44dfd926c8cb4f2f03a43a. Everything applied cleanly with small or no fuzz.
Comment 10 Benno Overeinder 2020-10-28 22:23:39 CET
Hi Robert,

Excellent, indeed forgot the back ports.  I always run Debian tests on my (non-production/my own production) machines.

Also thank you for applying the specific commits (plus the extra you mentioned) to Unbound 1.9.0.

Best,

-- Benno
Comment 11 Daniel Kahn Gillmor 2020-10-30 17:36:15 CET
While the patches identified do apply to 1.9.0, it looks like the resulting code just crashes: https://bugs.debian.org/962459 -- so i'm still not sure what we should be doing for debian's stable release yet.
Comment 12 Benno Overeinder 2020-11-03 12:38:57 CET
(In reply to Daniel Kahn Gillmor from comment #11)
> While the patches identified do apply to 1.9.0, it looks like the resulting
> code just crashes: https://bugs.debian.org/962459 -- so i'm still not sure
> what we should be doing for debian's stable release yet.

I'm sorry to hear that.  The issue was not resolved with a single bug fix, but rather made up of some code changes up to 1.9.1.  It is difficult to isolate the different patches to fix the issue without applying other code changes.

Have you also tried to include intermediate commits, as Robert pointed out in his email dated 28/10/20?


-- Benno