Bug 1394 - mix of serve-expired and response-ip could cause a crash
mix of serve-expired and response-ip could cause a crash
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
unspecified
Other All
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-02 20:28 CEST by JINMEI Tatuya
Modified: 2017-08-03 09:53 CEST (History)
2 users (show)

See Also:


Attachments
proposed patch (1.86 KB, application/octet-stream)
2017-08-02 20:28 CEST, JINMEI Tatuya
Details

Note You need to log in before you can comment on or make changes to this bug.
Description JINMEI Tatuya 2017-08-02 20:28:12 CEST
Created attachment 430 [details]
proposed patch

If I configure unbound as follows:

server:
	[some other trivial stuff]
	module-config: "respip iterator"
	serve-expired: yes
	response-ip: 192.0.2.100 redirect
	response-ip-data: 192.0.2.100 "CNAME cname.example.com."

set up an authoritative server for the example.com zone as follows:

respcname 1 IN A 192.0.2.100
cname 3600 IN A 192.0.2.1

and sends the following sequence of queries:

% dig respcname.example.com
% sleep 1
% dig respcname.example.com
% sleep 1
% dig respcname.example.com

then unbound crashed.

When unbound receives the second (and third) queries, respcname.example.com
has expired.  But since serve-expired is set to yes unbound uses it anyway.
And since the original A RDATA matches the response-ip rule, unbound replaces
it with the faked CNAME.  Because of the mixture of these, unbound first
triggers refetching (using the prefetch logic) the answer from the
authoritative server, and then looks up the cache again to complete
the CNAME chain.

Here, there are two defects:

- the cache entry lock is unlocked twice, once before calling
  reply_and_prefetch():

		lock_rw_unlock(&e->lock);
		reply_and_prefetch(worker, lookup_qinfo,
			sldns_buffer_read_u16_at(c->buffer, 2),
			repinfo, leeway);

  and when starting to chase the CNAME:

	/* We've found a partial reply ending with an
	 * alias.  [...] */
	lock_rw_unlock(&e->lock);

- reply_and_prefetch() eventually calls mesh_run() via mesh_new_prefetch(),
  and mesh_run() clears the scratchpad for the corresponding module env:

		regional_free_all(mstate->s.env->scratch);

  But this scratchpad is actually shared with the worker (worker->scratchpad),
  and the worker needs the data in the scratchpad to be valid until it
  completes the CNAME chasing.  So, depending on what kind of data is
  stored in which portion of the pad, it could cause a use-after-free
  situation with disruption such as a crash.

What happens due to the first detect would depend on the underlying rwlock
implementation, but in my local environment unbound simply crashes at
the second call to unlock.

What happens due to the second defect would depend on what kind of data
is stored in which part of the scratchpad at the time of
regional_free_all(), but in my local setup the third query causes
overridden data that is necessary to answer the query and unbound crashed.

I'm attaching a patch (for svn rev 4288) to fix these two issues.
Fixing the first one should be straightforward.  The second defect can
be fixed in various ways, but I think the safest way is to separate
the scratchpad for the worker and its module env. This will require
one more pad per worker (thread), but would be a more proactive
defense.

The patched code passed both 'make test' and 'make longtest'.
Comment 1 JINMEI Tatuya 2017-08-02 20:29:28 CEST
I forgot to mention this: while I only tested and confirmed these defects
by enabling serve-expired, I suspect the same disruption can happen due
to normal prefetch with response-ip based CNAME.
Comment 2 Wouter Wijngaards 2017-08-03 09:23:22 CEST
Hi Jinmei,

Thank you for the patch!  I have committed it.

Have you seen bug1393: problem with CNAME
https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=1393
someone has strange CNAME chain dropping off
Although right now I think it may be more qname-minimisation and not response-IP-CNAME that is interfering for him.  But if that bug report rings a bell for you, I'd be interested, because I cannot reproduce it.

Best regards, Wouter
Comment 3 Wouter Wijngaards 2017-08-03 09:53:43 CEST
Hi Jinmei,

Never mind bug1393: the user confirmed that qname-minimisation is involved (not response IP CNAME processing).  Sorry to bother you.

Best regards, Wouter