Bugzilla – Bug 4227
Unbound 1.9 crashes intermittently
Last modified: 2020-11-03 12:38:57 CET
Hello, Unbound 1.9 is crashing intermittently on two separate CentOS 7 servers. These servers were previously running Unbound 1.8.0 without issue. The error is: vmap.c:367: Assertion nread >= 0 failed in evmap_io_del If I can provide additional debugging information please let me know. The Unbound configuration is as follows: server: verbosity: 1 statistics-interval: 0 statistics-cumulative: yes extended-statistics: yes ratelimit: 1000 ratelimit-size: 4m ratelimit-slabs: 8 ratelimit-factor: 10 num-threads: 8 interface: 216.187.165.2 interface-automatic: no msg-cache-size: 200m msg-cache-slabs: 8 rrset-cache-size: 400m rrset-cache-slabs: 8 infra-cache-slabs: 8 access-control: 216.187.128.0/18 allow_snoop access-control: 204.130.255.0/24 allow_snoop access-control: 64.122.164.0/24 allow_snoop access-control: 192.168.0.0/16 allow_snoop access-control: 172.16.0.0/12 allow_snoop access-control: 10.0.0.0/8 allow_snoop chroot: "" username: "unbound" directory: "/etc/unbound" logfile: "/var/log/unbound/unbound.log" log-time-ascii: yes pidfile: "/var/run/unbound/unbound.pid" hide-version: yes harden-glue: yes harden-dnssec-stripped: yes harden-below-nxdomain: yes harden-referral-path: yes use-caps-for-id: no unwanted-reply-threshold: 10000000 prefetch: yes prefetch-key: yes rrset-roundrobin: yes minimal-responses: yes serve-expired: yes trusted-keys-file: /etc/unbound/keys.d/*.key auto-trust-anchor-file: "/var/lib/unbound/root.key" val-clean-additional: yes val-permissive-mode: no val-log-level: 1 key-cache-size: 85m key-cache-slabs: 8 include: /etc/unbound/local.d/*.conf interface: 216.187.165.2@853 tls-service-key: "/etc/pki/tls/private/example.com.key" tls-service-pem: "/etc/pki/tls/certs/example.com.crt" tls-port: 853 aggressive-nsec: yes qname-minimisation: yes so-reuseport: yes remote-control: control-enable: yes control-interface: 172.31.142.251 control-port: 8953 server-key-file: "/etc/unbound/unbound_server.key" server-cert-file: "/etc/unbound/unbound_server.pem" control-key-file: "/etc/unbound/unbound_control.key" control-cert-file: "/etc/unbound/unbound_control.pem" include: /etc/unbound/conf.d/*.conf
Hi Imcdona, I have a fix in the code repository. But, it is based on the assertion you list not on replicating the bug itself, so you can update from the code repository and then see if that fixes the bug? I paired the event delete and add for libevent. The new out of order processing in 1.9.0 had this imbalanced. Best regards, Wouter Index: services/listen_dnsport.c =================================================================== --- services/listen_dnsport.c (revision 5121) +++ services/listen_dnsport.c (working copy) @@ -1636,10 +1636,12 @@ if(wr) { req->cp->tcp_is_reading = 0; + comm_point_stop_listening(req->cp); comm_point_start_listening(req->cp, -1, req->cp->tcp_timeout_msec); } else if(rd) { req->cp->tcp_is_reading = 1; + comm_point_stop_listening(req->cp); comm_point_start_listening(req->cp, -1, req->cp->tcp_timeout_msec); /* and also read it (from SSL stack buffers), so @@ -1647,6 +1649,7 @@ * the TLS frame is sitting in the buffers. */ req->read_again = 1; } else { + comm_point_stop_listening(req->cp); comm_point_start_listening(req->cp, -1, req->cp->tcp_timeout_msec); comm_point_listen_for_rw(req->cp, 0, 0); @@ -1759,6 +1762,7 @@ * clear to write to */ send_it: c->tcp_is_reading = 0; + comm_point_stop_listening(c); comm_point_start_listening(c, -1, c->tcp_timeout_msec); return; } Index: util/netevent.c =================================================================== --- util/netevent.c (revision 5121) +++ util/netevent.c (working copy) @@ -989,10 +989,10 @@ c->tcp_is_reading = 1; c->tcp_byte_count = 0; /* switch from listening(write) to listening(read) */ - comm_point_stop_listening(c); if(c->tcp_req_info) { tcp_req_info_handle_writedone(c->tcp_req_info); } else { + comm_point_stop_listening(c); comm_point_start_listening(c, -1, -1); } } @@ -1006,11 +1006,11 @@ if(c->tcp_do_toggle_rw) c->tcp_is_reading = 0; c->tcp_byte_count = 0; - if(c->type == comm_tcp) - comm_point_stop_listening(c); if(c->tcp_req_info) { tcp_req_info_handle_readdone(c->tcp_req_info); } else { + if(c->type == comm_tcp) + comm_point_stop_listening(c); fptr_ok(fptr_whitelist_comm_point(c->callback)); if( (*c->callback)(c, c->cb_arg, NETEVENT_NOERROR, &c->repinfo) ) { comm_point_start_listening(c, -1, c->tcp_timeout_msec);
Unbound was originally installed on these servers via a third party YUM repo. Compiling from the latest source isn't going to be easy. I'd like to avoid having to do that if at all possible.
I spoke too soon. The maintainers of the third part repo are going to apply the patch to their latest build. I'll get back to you after I've had an opportunity to test.
Unbound has been running solid since 2/25 without crashing. Your patch appears to have fixed the issue. Thanks
Hi Isaac, Very good! Thanks for the report. I intend to move towards a production release with the fixes in there. This makes the fixes also available for the other users. Best regards, Wouter
Just noting that i think this has come up in debian as https://bugs.debian.org/973052 -- Wouter, if there are any other related patches to avoid this instability besides what you have in Comment 1, it'd be good to know so we can include it in a subsequent debian stable point release.
Hi Daniel, Yes there have been fixes for this kind of error in the mean time. They were to reinitialize structures so that it did not mess with the internal state of the even handler. Just upgrade, the latest release should have those fixes. Best regards, Wouter
(In reply to Wouter Wijngaards from comment #7) > Yes there have been fixes for this kind of error in the mean time. They > were to reinitialize structures so that it did not mess with the internal > state of the even handler. Just upgrade, the latest release should have > those fixes. We will you send a combined patch of commits: https://github.com/NLnetLabs/unbound/commit/348cbab016f824a336b65d0091310fe5cd58e762 https://github.com/NLnetLabs/unbound/commit/2b47ca080eb91e209fb86cd1dc90a6aff32e2a1f 5 April 2019: Wouter - Fix to reinit event structure for accepted TCP (and TLS) sockets. 8 April 2019: Wouter - Fix to use event_assign with libevent for thread-safety. And these four for spoolbuf fixes that would be nice to have (not included in the patch): https://github.com/NLnetLabs/unbound/commit/0b77c9d6763686264d44dfd926c8cb4f2f03a43a https://github.com/NLnetLabs/unbound/commit/6067ce6d2b82ce2e80055e578fdfd7ba3e67c523 https://github.com/NLnetLabs/unbound/commit/af6c5dea43fc63452d49b2339e607365b6652987 https://github.com/NLnetLabs/unbound/commit/a08fe8ca609b651c8d8c8379780aad508d492421 However, we recommend upgrading to Unbound 1.9.2 as this also fixes other related bugs. Best regards, -- Benno
Hi, Benno: We do make available newer Unbound releases for Debian stable users in the "backports" repository (currently Unbound 1.12.0) on an opt-in basis. However, the Debian release management policies and procedures make it difficult to import new upstream package versions into the Debian stable distribution. It's much easier for the Debian reviewers who need to sign off on stable updates if they only need to review targeted patches. Based on the list of commits you provided, I prepared a rollup of the code changes in those commits targeted at 1.9.0. I also needed to apply commit cae8361dcd2809c8e266d259370c9ab8660c2c0e in order to avoid needing to manually resolve the textual conflict caused by applying 0b77c9d6763686264d44dfd926c8cb4f2f03a43a. Everything applied cleanly with small or no fuzz.
Hi Robert, Excellent, indeed forgot the back ports. I always run Debian tests on my (non-production/my own production) machines. Also thank you for applying the specific commits (plus the extra you mentioned) to Unbound 1.9.0. Best, -- Benno
While the patches identified do apply to 1.9.0, it looks like the resulting code just crashes: https://bugs.debian.org/962459 -- so i'm still not sure what we should be doing for debian's stable release yet.
(In reply to Daniel Kahn Gillmor from comment #11) > While the patches identified do apply to 1.9.0, it looks like the resulting > code just crashes: https://bugs.debian.org/962459 -- so i'm still not sure > what we should be doing for debian's stable release yet. I'm sorry to hear that. The issue was not resolved with a single bug fix, but rather made up of some code changes up to 1.9.1. It is difficult to isolate the different patches to fix the issue without applying other code changes. Have you also tried to include intermediate commits, as Robert pointed out in his email dated 28/10/20? -- Benno