Bugzilla – Full Text Bug Listing
|Summary:||mix of serve-expired and response-ip could cause a crash|
|Product:||unbound||Reporter:||JINMEI Tatuya <jtatuya>|
|Component:||server||Assignee:||unbound team <unbound-team>|
Description JINMEI Tatuya 2017-08-02 20:28:12 CEST
Created attachment 430 [details] proposed patch If I configure unbound as follows: server: [some other trivial stuff] module-config: "respip iterator" serve-expired: yes response-ip: 192.0.2.100 redirect response-ip-data: 192.0.2.100 "CNAME cname.example.com." set up an authoritative server for the example.com zone as follows: respcname 1 IN A 192.0.2.100 cname 3600 IN A 192.0.2.1 and sends the following sequence of queries: % dig respcname.example.com % sleep 1 % dig respcname.example.com % sleep 1 % dig respcname.example.com then unbound crashed. When unbound receives the second (and third) queries, respcname.example.com has expired. But since serve-expired is set to yes unbound uses it anyway. And since the original A RDATA matches the response-ip rule, unbound replaces it with the faked CNAME. Because of the mixture of these, unbound first triggers refetching (using the prefetch logic) the answer from the authoritative server, and then looks up the cache again to complete the CNAME chain. Here, there are two defects: - the cache entry lock is unlocked twice, once before calling reply_and_prefetch(): lock_rw_unlock(&e->lock); reply_and_prefetch(worker, lookup_qinfo, sldns_buffer_read_u16_at(c->buffer, 2), repinfo, leeway); and when starting to chase the CNAME: /* We've found a partial reply ending with an * alias. [...] */ lock_rw_unlock(&e->lock); - reply_and_prefetch() eventually calls mesh_run() via mesh_new_prefetch(), and mesh_run() clears the scratchpad for the corresponding module env: regional_free_all(mstate->s.env->scratch); But this scratchpad is actually shared with the worker (worker->scratchpad), and the worker needs the data in the scratchpad to be valid until it completes the CNAME chasing. So, depending on what kind of data is stored in which portion of the pad, it could cause a use-after-free situation with disruption such as a crash. What happens due to the first detect would depend on the underlying rwlock implementation, but in my local environment unbound simply crashes at the second call to unlock. What happens due to the second defect would depend on what kind of data is stored in which part of the scratchpad at the time of regional_free_all(), but in my local setup the third query causes overridden data that is necessary to answer the query and unbound crashed. I'm attaching a patch (for svn rev 4288) to fix these two issues. Fixing the first one should be straightforward. The second defect can be fixed in various ways, but I think the safest way is to separate the scratchpad for the worker and its module env. This will require one more pad per worker (thread), but would be a more proactive defense. The patched code passed both 'make test' and 'make longtest'.
Comment 1 JINMEI Tatuya 2017-08-02 20:29:28 CEST
I forgot to mention this: while I only tested and confirmed these defects by enabling serve-expired, I suspect the same disruption can happen due to normal prefetch with response-ip based CNAME.
Comment 2 Wouter Wijngaards 2017-08-03 09:23:22 CEST
Hi Jinmei, Thank you for the patch! I have committed it. Have you seen bug1393: problem with CNAME https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=1393 someone has strange CNAME chain dropping off Although right now I think it may be more qname-minimisation and not response-IP-CNAME that is interfering for him. But if that bug report rings a bell for you, I'd be interested, because I cannot reproduce it. Best regards, Wouter