Bug 1414 - Unbound 1.6.4 and 1.6.5 segmentation faults
Unbound 1.6.4 and 1.6.5 segmentation faults
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.6.4
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-25 05:36 CEST by Phil
Modified: 2017-08-30 17:09 CEST (History)
2 users (show)

See Also:


Attachments
Build script, config, and straces (198.97 KB, application/x-compressed-tar)
2017-08-25 05:36 CEST, Phil
Details
Build script, config, and straces (198.90 KB, application/x-compressed-tar)
2017-08-25 05:39 CEST, Phil
Details
valgrind + gdb output (3.79 KB, application/x-compressed-tar)
2017-08-28 22:13 CEST, Phil
Details
valgrind + gdb patched output (2.89 KB, application/x-compressed-tar)
2017-08-29 17:18 CEST, Phil
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Phil 2017-08-25 05:36:54 CEST
Created attachment 436 [details]
Build script, config, and straces

Servers:
    
    ```
    Centos 7: unbound 1.6.4 seg faults
    Centos 7: unbound 1.6.5 seg faults
    Fedora 26: unbound 1.6.5 seg faults
    Fedora 26: unbound 1.6.4 was not tested
    ```

Running the following command against my Centos7 unbound server and Fedora26 unbound server, I am able to cause unbound 1.6.4 and 1.6.5 to segmentation fault. 

    ```
    cat domains.txt | parallel dig @${SERVER} +tcp +additional
    ```

I've attached straces that show the time around the seg faults, the unbound config, and the compilation flags. I can reproduce this issue quickly and reliably when I have some clients with access-control set to allow and everything else as blocked as follows.
    
    ```
    access-control: x.x.x.x/32 allow
    # access-control: y.y.y.y/32 allow
    ```

Let me know if you need any more information.
Comment 1 Phil 2017-08-25 05:39:53 CEST
Created attachment 437 [details]
Build script, config, and straces
Comment 2 Wouter Wijngaards 2017-08-25 08:14:35 CEST
Hi Phil,

Can you give me valgrind output, or perhaps a gdb stacktrace of when the failure occurs?  I don't see immediate humdingers in the traces you sent or in the config.

yum install valgrind and then valgrind unbound -d ... (with -d unbound stays attached to the console and thus also to valgrind, which makes it slower but catches the segfault).

Or gdb unbound and then use 'bt' to print a stacktrace from gdb.

Best regards, Wouter
Comment 3 Phil 2017-08-28 22:11:17 CEST
Wouter,

I ran the following commands on a Centos 7 node.

```
gdb ./unbound --batch --quiet -ex "run -c /etc/unbound/unbound.conf" -ex "thread apply all bt full" -ex "quit"

valgrind --show-reachable=yes --error-limit=no --log-file=/home/phil/unbound-${VERSION}-valgrind.log ./unbound -d -c /etc/unbound/unbound.conf 
```

You'll find the results of Centos 7 + Unbound 1.6.4 and 1.6.5 attached.
Comment 4 Phil 2017-08-28 22:13:07 CEST
Created attachment 442 [details]
valgrind + gdb output
Comment 5 Wouter Wijngaards 2017-08-29 09:29:09 CEST
Hi Phil,

Here is the patch for it, I think this should work:

Index: daemon/worker.c
===================================================================
--- daemon/worker.c	(revision 4315)
+++ daemon/worker.c	(working copy)
@@ -1111,6 +1111,7 @@
 	if(!query_info_parse(&qinfo, c->buffer)) {
 		verbose(VERB_ALGO, "worker parse request: formerror.");
 		log_addr(VERB_CLIENT,"from",&repinfo->addr, repinfo->addrlen);
+		memset(&qinfo, 0, sizeof(qinfo)); /* zero qinfo.qname */
 		if(worker_err_ratelimit(worker, LDNS_RCODE_FORMERR) == -1) {
 			comm_point_drop_reply(repinfo);
 			return 0;
Index: util/data/msgreply.c
===================================================================
--- util/data/msgreply.c	(revision 4315)
+++ util/data/msgreply.c	(working copy)
@@ -840,7 +840,9 @@
 	{
 		log_info("%s - - - %s - - - ", clientip_buf, rcode_buf);
 	} else {
-		dname_str(qinf->qname, qname_buf);
+		if(qinf->qname)
+			dname_str(qinf->qname, qname_buf);
+		else	snprintf(qname_buf, sizeof(qname_buf), "null");
 		pktlen = sldns_buffer_limit(rmsg);
 		sldns_wire2str_type_buf(qinf->qtype, type_buf, sizeof(type_buf));
 		sldns_wire2str_class_buf(qinf->qclass, class_buf, sizeof(class_buf));


Best regards, Wouter
Comment 6 Phil 2017-08-29 17:18:56 CEST
Created attachment 443 [details]
valgrind + gdb patched output

Wouter,

Thanks for the patch set. I've applied it to the latest unbound 1.6.5 download with 

```
patch -R < patchset.txt
```

Unfortunately the segmentation faults still happen. I've attached new valgrind and gdb output.
Comment 7 Wouter Wijngaards 2017-08-30 09:04:05 CEST
Hi Phil,

This should fix it?

Best regards, Wouter

Index: daemon/worker.c
===================================================================
--- daemon/worker.c	(revision 4316)
+++ daemon/worker.c	(revision 4317)
@@ -1009,6 +1009,7 @@
 	struct query_info* lookup_qinfo = &qinfo;
 	struct query_info qinfo_tmp; /* placeholdoer for lookup_qinfo */
 	struct respip_client_info* cinfo = NULL, cinfo_tmp;
+	memset(&qinfo, 0, sizeof(qinfo));
 
 	if(error != NETEVENT_NOERROR) {
 		/* some bad tcp query DNS formats give these error calls */
Comment 8 Phil 2017-08-30 16:10:57 CEST
Wouter,

Your second patch, when combined with the first patch, seems to have fixed the segmentation faults! I've been hitting my test box from 6 test nodes running the following command. Each node has a unique set of 250k domains

```
cat domain_list | parallel dig @${UNBOUND_IP} +tcp +additional`
```

As of this writing, each test node has run through it's list of domains and unbound has remained stable. Previously, unbound would segfault within moments of starting the queries. 

Thanks for the help and nice work! Much appreciated over here. :)
Comment 9 Wouter Wijngaards 2017-08-30 17:09:02 CEST
Hi Phil,

Nice to hear that!  And thank you for the detailed reports!
The fixes are committed for future release.

Best regards, Wouter