Bugzilla – Bug 1277
[ecs] SERVFAIL under load
Last modified: 2017-08-04 20:08:00 CEST
Version used: unbound 21 April 2017 (based on doc/changelog) So I don't have good repro steps.... and I have been tinkering around with settings to narrow down the issue empirically. under load (a few 10th of thousands QPS but this may depend on HW and multiple settings like threads and all), I started observing observing SERVFAIL being returned sporadically. I could narrow that down to ecs module and it seems to trigger when getting a lot of replies with scopes.... I am guessing this is bubbling up from the cache lookup/update path. I will try to gather some relevant logs, but debug wont cut it here as the whole process will get to a stall. I managed to get some logs with verbosity 2 but I don't see anything relevant in there except something along: ``` [1497034846] unbound[645:7] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone [1497034846] unbound[645:7] info: validator operate: query a.domain.com. A IN [1497034846] unbound[645:7] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone [1497034846] unbound[645:7] info: subnet operate: query a.domain.com. A IN [1497034846] unbound[645:7] debug: cache memory msg=331792 rrset=521739 infra=46251708 val=81030 subnet=299383 [1497034846] unbound[645:7] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new [1497034846] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN [1497034846] unbound[645:7] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass [1497034846] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN [1497034846] unbound[645:7] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass [1497034846] unbound[645:7] info: resolving a.domain.com. AAAA IN [1497034846] unbound[645:7] debug: return error response SERVFAIL ``` I will also try to do more troubleshooting, but suggestions/pointers to dig further are welcome here.
I suppose this may be a more accurate paste: ``` [1497034844] unbound[645:7] info: sending query: a.domain.com. A IN [1497034844] unbound[645:7] debug: sending to target: <.> 192.58.128.30#53 [1497034844] unbound[645:7] debug: cache memory msg=265073 rrset=271756 infra=46110226 val=66344 subnet=272608 [1497034844] unbound[645:7] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new [1497034844] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN [1497034844] unbound[645:7] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass [1497034844] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN [1497034844] unbound[645:7] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass [1497034844] unbound[645:7] info: resolving a.domain.com. AAAA IN [1497034844] unbound[645:7] debug: return error response SERVFAIL [1497034844] unbound[645:7] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone [1497034844] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN [1497034844] unbound[645:7] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone [1497034844] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN [1497034844] unbound[645:7] debug: cache memory msg=265073 rrset=271756 infra=46110226 val=66344 subnet=272608 ``` which seems to indicate that this SERVFAIL happens in the iterator
Hi Manu, I am looking into this but I don't know yet what is causing this issue. Do I understand correctly that you get a proper answer from the authoritative nameserver for the query resulting in a SERVFAIL answer? Does is only occur after receiving answers that include ECS records? Any chance you can increase the logging verbosity? Thanks, -- Ralph
Thanks Ralph, I am trying to get some repro on my side going on, also hacking on dnsperf to be able to read edns info from file (they currently support -E option but that is static) so to force executing that code path. My attempt at reproducing the issue so far were not fruitful but the setup was slightly different, for instance, the servers I am sending ECS to were configured in a stub-zone vs the traditional recursive path. In the end, it may not be ECS related per se, ECS just causes more recursion to happen and this is where it may breaks. The hosts don't expose any obvious resources contention and the auth servers I am sending ECS to are within the same rack so latency is minimal. > Do I understand correctly that you get a proper answer from the authoritative nameserver for the query resulting in a SERVFAIL answer? Does is only occur after receiving answers that include ECS records? This part I did not figure out, this has only reproduced under load and I did not get a full trace yet. I will try to get one and wrap my head around to find this out. > Any chance you can increase the logging verbosity? When doing so, the system will come to a stall. I did run a perf top and did not see anything bad (or obviously bad). Remember when I talked to you about having a kind of key/value store for stats... This would be something that would allow to easily instrument the code so to spot where the SERVFAIL is being triggered.
So, it seems in some cases, the query make it out and back, sometimes servfail is returned before any queries goes out, the timing between the 2 is pretty low... Timing between client query, recursion and reply: 2017-06-14 17:49:10.913262 client -> unbound Standard query 0x78d8 AAAA 2017-06-14 17:49:10.913332 unbound -> auth Standard query 0xdd5e AAAA + ECS/source netmask/scope 0 2017-06-14 17:49:10.918947 auth -> unbound Standard query response 0xdd5e AAAA + ECS/source netmask/scope 0 (no ECS for this domain) 2017-06-14 17:49:10.919007 unbound -> client Standard query response 0x78d8 Server failure AAAA Same potential behaviour when there is ECS responses with a scope != 0. Also, in some cases, the servfail is replied directly, without even a query downstream: Timing with direct servfail reply: 2017-06-14 17:49:10.825039 client -> unbound Standard query 0x3e8d A 2017-06-14 17:49:10.825117 unbound -> client Standard query response 0x3e8d Server failure A
Also, FYI, I tested against master: - Fix query for refetch_glue of stub leaking to internet. git-svn-id: http://unbound.nlnetlabs.nl/svn/trunk@4227 be551aaa-1e26-0410-a405-d3ace91eadb9 And it still happens.
ok.... I managed to repro and should be able to isolate to some specific config setting....
so... TL;DR this is the `ratelimiting` kicking in even though it is supposedly disabled for the domains impacted and subdomains. ECS triggers it as once we get to a certain QPS, chances are that we more cache misses. I had originally had suspicions against ratelimiting, but in my experiment in disabling it it was still reproducing. Now I came to realize that `unbound-control reload` was not updating ratelimiting settings... and basically while I thought ratelimiting was disabled... it was still active. Moving forward, I will: 1) confirm whether or not I can reproduce this without ECS . e.g is behaviour different with or without ECS? 2) compile a list to reproduce this easily.
ok.... so apparently the -1 trick never worked and I just misread the code when I updated the documentation.... I am pretty sure I had tested it but in hindsight, I think I got fooled by thinking that SIGHUP would reload those settings when ratelimit-{below,for}-domain don't get updated upon SIGHUP :s. Issue is that -1 is the default value that get assigned to `lim` and when `lim` == -1, we end up falling back to global rate limit. ``` int infra_find_ratelimit(struct infra_cache* infra, uint8_t* name, size_t namelen) { int labs = dname_count_labels(name); struct domain_limit_data* d = (struct domain_limit_data*) name_tree_lookup(&infra->domain_limits, name, namelen, labs, LDNS_RR_CLASS_IN); if(!d) return infra_dp_ratelimit; if(d->node.labs == labs && d->lim != -1) return d->lim; /* exact match */ /* find 'below match' */ if(d->node.labs == labs) d = (struct domain_limit_data*)d->node.parent; while(d) { if(d->below != -1) return d->below; d = (struct domain_limit_data*)d->node.parent; } return infra_dp_ratelimit; } ``` Attached is a patch that should allow setting ratelimit globally and override per ratelimit-{for,below}-domain by setting it to 0. Doc updated. I will also look into adding counters to the `stats` unbound-control call in a separate diff if it does not end up being complicated to access the stat structure.
Created attachment 408 [details] set unlimited ratelimit per domain
(In reply to Manu Bretelle from comment #9) > Created attachment 408 [details] > set unlimited ratelimit per domain Given that `lim` is an unsigned int, it may be better to check that it is >= 0 in `infra_ratelimit_cfg_insert` and/or `infra_ratelimit_inc`
Hi Manu, Thank you for the updates. I'm glad you found the cause of this issue. Using the value 0 to disable the domain ratelimit makes sense, the same value is used for the global ratelimit. I just committed that code. I also added a condition to check if limit!=0 in infra_ratelimit_exceeded(). As far as I can tell, lim is always a signed int and only used in expressions with other signed ints. Where do you see it as unsigned? Regards, -- Ralph
Thanks Ralph! (In reply to Ralph Dolmans from comment #11) > > As far as I can tell, lim is always a signed int and only used in > expressions with other signed ints. Where do you see it as unsigned? > Yes, it is always used as a signed it, but it is stored as an unsigned int and the default is -1, basically meaning `unconfigured` https://github.com/NLnetLabs/unbound/blob/master/services/cache/infra.c#L185
Created attachment 410 [details] ratelimiting stats counter Ok, I managed to get ratelimiting stats to be exposed.... but I am not sure including "daemon/worker.h" in "iterator/iterator.h" is super legit. That being said, the build is green: https://travis-ci.org/chantra/unbound/builds/243854962 Patch attached and viewable at https://github.com/chantra/unbound/commit/bfa0035f18efc4989fa01ba67c2870cf2b659f35
Comment on attachment 410 [details] ratelimiting stats counter example output: ``` ../unbound-git/unbound-control -c ../ecs_servfail_repro/unbound.conf stats_noreset | grep s_ratelimited thread0.num.queries_ratelimited=117 thread1.num.queries_ratelimited=78 thread2.num.queries_ratelimited=102 thread3.num.queries_ratelimited=95 thread4.num.queries_ratelimited=119 thread5.num.queries_ratelimited=101 thread6.num.queries_ratelimited=113 thread7.num.queries_ratelimited=123 thread8.num.queries_ratelimited=108 thread9.num.queries_ratelimited=114 thread10.num.queries_ratelimited=103 thread11.num.queries_ratelimited=81 thread12.num.queries_ratelimited=125 thread13.num.queries_ratelimited=122 thread14.num.queries_ratelimited=106 thread15.num.queries_ratelimited=97 thread16.num.queries_ratelimited=114 thread17.num.queries_ratelimited=123 thread18.num.queries_ratelimited=130 thread19.num.queries_ratelimited=106 total.num.queries_ratelimited=2177 ```
@ralph How do you feel about the counter? I am running it successfully but I am not super happy with the intermingling of the iterator and daemon code as I believe there may be cases (direct use of the library?) where this would not work. Having a more versatile way to bump counters would be great :) I will open a separate bug to start the discussion around it.
Hi Ralph, checking back on this. A counter incrementing as ratelimiting kicks in would be useful to troubleshoot those types of issues. The proposed patch is not great though, would you have a better way to get this working? Tks
Hi Manu, I have added a query ratelimit counter. See svn r4292. Regards, -- Ralph
Thanks Ralph!