Bug 1277 - [ecs] SERVFAIL under load
[ecs] SERVFAIL under load
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.6.2
Other All
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-09 21:52 CEST by Manu Bretelle
Modified: 2017-08-04 20:08 CEST (History)
3 users (show)

See Also:


Attachments
set unlimited ratelimit per domain (2.00 KB, patch)
2017-06-16 04:04 CEST, Manu Bretelle
Details | Diff
ratelimiting stats counter (4.97 KB, patch)
2017-06-17 00:55 CEST, Manu Bretelle
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Manu Bretelle 2017-06-09 21:52:59 CEST
Version used: unbound 21 April 2017 (based on doc/changelog)

So I don't have good repro steps.... and I have been tinkering around with settings to narrow down the issue empirically.

under load (a few 10th of thousands QPS but this may depend on HW and multiple settings like threads and all), I started observing observing SERVFAIL being returned sporadically.

I could narrow that down to ecs module and it seems to trigger when getting a lot of replies with scopes....

I am guessing this is bubbling up from the cache lookup/update path.

I will try to gather some relevant logs, but debug wont cut it here as the whole process will get to a stall. I managed to get some logs with verbosity 2 but I don't see anything relevant in there except something along:

```
[1497034846] unbound[645:7] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
[1497034846] unbound[645:7] info: validator operate: query a.domain.com. A IN
[1497034846] unbound[645:7] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
[1497034846] unbound[645:7] info: subnet operate: query a.domain.com. A IN
[1497034846] unbound[645:7] debug: cache memory msg=331792 rrset=521739 infra=46251708 val=81030 subnet=299383
[1497034846] unbound[645:7] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new
[1497034846] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN
[1497034846] unbound[645:7] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass
[1497034846] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN
[1497034846] unbound[645:7] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
[1497034846] unbound[645:7] info: resolving a.domain.com. AAAA IN
[1497034846] unbound[645:7] debug: return error response SERVFAIL
```

I will also try to do more troubleshooting, but suggestions/pointers to dig further are welcome here.
Comment 1 Manu Bretelle 2017-06-10 05:14:07 CEST
I suppose this may be a more accurate paste:

```
[1497034844] unbound[645:7] info: sending query: a.domain.com. A IN
[1497034844] unbound[645:7] debug: sending to target: <.> 192.58.128.30#53
[1497034844] unbound[645:7] debug: cache memory msg=265073 rrset=271756 infra=46110226 val=66344 subnet=272608
[1497034844] unbound[645:7] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new
[1497034844] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN
[1497034844] unbound[645:7] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass
[1497034844] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN
[1497034844] unbound[645:7] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
[1497034844] unbound[645:7] info: resolving a.domain.com. AAAA IN
[1497034844] unbound[645:7] debug: return error response SERVFAIL
[1497034844] unbound[645:7] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
[1497034844] unbound[645:7] info: validator operate: query a.domain.com. AAAA IN
[1497034844] unbound[645:7] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
[1497034844] unbound[645:7] info: subnet operate: query a.domain.com. AAAA IN
[1497034844] unbound[645:7] debug: cache memory msg=265073 rrset=271756 infra=46110226 val=66344 subnet=272608
```

which seems to indicate that this SERVFAIL happens in the iterator
Comment 2 Ralph Dolmans 2017-06-14 15:10:52 CEST
Hi Manu,

I am looking into this but I don't know yet what is causing this issue.

Do I understand correctly that you get a proper answer from the authoritative nameserver for the query resulting in a SERVFAIL answer? Does is only occur after receiving answers that include ECS records?

Any chance you can increase the logging verbosity?

Thanks,
-- Ralph
Comment 3 Manu Bretelle 2017-06-15 02:44:51 CEST
Thanks Ralph,

I am trying to get some repro on my side going on, also hacking on dnsperf to be able to read edns info from file (they currently support -E option but that is static) so to force executing that code path. My attempt at reproducing the issue so far were not fruitful but the setup was slightly different, for instance, the servers I am sending ECS to were configured in a stub-zone vs the traditional recursive path.

In the end, it may not be ECS related per se, ECS just causes more recursion to happen and this is where it may breaks. The hosts don't expose any obvious resources contention and the auth servers I am sending ECS to are within the same rack so latency is minimal.


> Do I understand correctly that you get a proper answer from the authoritative nameserver for the query resulting in a SERVFAIL answer? Does is only occur after receiving answers that include ECS records?

This part I did not figure out, this has only reproduced under load and I did not get a full trace yet. I will try to get one and wrap my head around to find this out.


> Any chance you can increase the logging verbosity?

When doing so, the system will come to a stall. I did run a perf top and did not see anything bad (or obviously bad).

Remember when I talked to you about having a kind of key/value store for stats... This would be something that would allow to easily instrument the code so to spot where the SERVFAIL is being triggered.
Comment 4 Manu Bretelle 2017-06-15 06:02:54 CEST
So, it seems in some cases, the query make it out and back, sometimes servfail is returned before any queries goes out, the timing between the 2 is pretty low...


Timing between client query, recursion and reply:

2017-06-14 17:49:10.913262  client -> unbound Standard query 0x78d8 AAAA
2017-06-14 17:49:10.913332  unbound -> auth Standard query 0xdd5e AAAA + ECS/source netmask/scope 0
2017-06-14 17:49:10.918947  auth -> unbound Standard query response 0xdd5e AAAA + ECS/source netmask/scope 0 (no ECS for this domain)
2017-06-14 17:49:10.919007  unbound -> client Standard query response 0x78d8 Server failure AAAA


Same potential behaviour when there is ECS responses with a scope != 0.


Also, in some cases, the servfail is replied directly, without even a query downstream:

Timing with direct servfail reply:
2017-06-14 17:49:10.825039 client -> unbound Standard query 0x3e8d A
2017-06-14 17:49:10.825117 unbound -> client Standard query response 0x3e8d Server failure A
Comment 5 Manu Bretelle 2017-06-15 20:45:44 CEST
Also, FYI, I tested against master:
- Fix query for refetch_glue of stub leaking to internet.
git-svn-id: http://unbound.nlnetlabs.nl/svn/trunk@4227 be551aaa-1e26-0410-a405-d3ace91eadb9

And it still happens.
Comment 6 Manu Bretelle 2017-06-15 20:54:28 CEST
ok.... I managed to repro and should be able to isolate to some specific config setting....
Comment 7 Manu Bretelle 2017-06-15 23:18:55 CEST
so... TL;DR this is the `ratelimiting` kicking in even though it is supposedly disabled for the domains impacted and subdomains.

ECS triggers it as once we get to a certain QPS, chances are that we more cache misses.

I had originally had suspicions against ratelimiting, but in my experiment in disabling it it was still reproducing. Now I came to realize that `unbound-control reload` was not updating ratelimiting settings... and basically while I thought ratelimiting was disabled... it was still active.


Moving forward, I will:
1) confirm whether or not I can reproduce this without ECS . e.g is behaviour different with or without ECS?
2) compile a list to reproduce this easily.
Comment 8 Manu Bretelle 2017-06-16 04:03:49 CEST
ok.... so apparently the -1 trick never worked and I just misread the code when I updated the documentation.... I am pretty sure I had tested it but in hindsight, I think I got fooled by thinking that SIGHUP would reload those settings when ratelimit-{below,for}-domain don't get updated upon SIGHUP :s.

Issue is that -1 is the default value that get assigned to `lim` and when `lim` == -1, we end up falling back to global rate limit.



```
int infra_find_ratelimit(struct infra_cache* infra, uint8_t* name,
    size_t namelen)
{
    int labs = dname_count_labels(name);
    struct domain_limit_data* d = (struct domain_limit_data*)
        name_tree_lookup(&infra->domain_limits, name, namelen, labs,
        LDNS_RR_CLASS_IN);
    if(!d) return infra_dp_ratelimit;

    if(d->node.labs == labs && d->lim != -1)
        return d->lim; /* exact match */

    /* find 'below match' */
    if(d->node.labs == labs)
        d = (struct domain_limit_data*)d->node.parent;
    while(d) {
        if(d->below != -1)
            return d->below;
        d = (struct domain_limit_data*)d->node.parent;
    }
    return infra_dp_ratelimit;
}
```

Attached is a patch that should allow setting ratelimit globally and override per ratelimit-{for,below}-domain by setting it to 0.
Doc updated.

I will also look into adding counters to the `stats` unbound-control call in a separate diff if it does not end up being complicated to access the stat structure.
Comment 9 Manu Bretelle 2017-06-16 04:04:38 CEST
Created attachment 408 [details]
set unlimited ratelimit per domain
Comment 10 Manu Bretelle 2017-06-16 04:09:54 CEST
(In reply to Manu Bretelle from comment #9)
> Created attachment 408 [details]
> set unlimited ratelimit per domain

Given that `lim` is an unsigned int, it may be better to check that it is >= 0 in `infra_ratelimit_cfg_insert` and/or `infra_ratelimit_inc`
Comment 11 Ralph Dolmans 2017-06-16 15:26:45 CEST
Hi Manu,

Thank you for the updates. I'm glad you found the cause of this issue.

Using the value 0 to disable the domain ratelimit makes sense, the same value is used for the global ratelimit. I just committed that code.

I also added a condition to check if limit!=0 in infra_ratelimit_exceeded().

As far as I can tell, lim is always a signed int and only used in expressions with other signed ints. Where do you see it as unsigned?

Regards,
-- Ralph
Comment 12 Manu Bretelle 2017-06-16 19:09:10 CEST
Thanks Ralph!

(In reply to Ralph Dolmans from comment #11)
> 
> As far as I can tell, lim is always a signed int and only used in
> expressions with other signed ints. Where do you see it as unsigned?
> 


Yes, it is always used as a signed it, but it is stored as an unsigned int and the default is -1, basically meaning `unconfigured`

https://github.com/NLnetLabs/unbound/blob/master/services/cache/infra.c#L185
Comment 13 Manu Bretelle 2017-06-17 00:55:31 CEST
Created attachment 410 [details]
ratelimiting stats counter

Ok, I managed to get ratelimiting stats to be exposed.... but I am not sure including "daemon/worker.h" in "iterator/iterator.h" is super legit.

That being said, the build is green: https://travis-ci.org/chantra/unbound/builds/243854962

Patch attached and viewable at https://github.com/chantra/unbound/commit/bfa0035f18efc4989fa01ba67c2870cf2b659f35
Comment 14 Manu Bretelle 2017-06-17 00:56:45 CEST
Comment on attachment 410 [details]
ratelimiting stats counter

example output:
```
../unbound-git/unbound-control -c ../ecs_servfail_repro/unbound.conf stats_noreset  | grep s_ratelimited
thread0.num.queries_ratelimited=117
thread1.num.queries_ratelimited=78
thread2.num.queries_ratelimited=102
thread3.num.queries_ratelimited=95
thread4.num.queries_ratelimited=119
thread5.num.queries_ratelimited=101
thread6.num.queries_ratelimited=113
thread7.num.queries_ratelimited=123
thread8.num.queries_ratelimited=108
thread9.num.queries_ratelimited=114
thread10.num.queries_ratelimited=103
thread11.num.queries_ratelimited=81
thread12.num.queries_ratelimited=125
thread13.num.queries_ratelimited=122
thread14.num.queries_ratelimited=106
thread15.num.queries_ratelimited=97
thread16.num.queries_ratelimited=114
thread17.num.queries_ratelimited=123
thread18.num.queries_ratelimited=130
thread19.num.queries_ratelimited=106
total.num.queries_ratelimited=2177
```
Comment 15 Manu Bretelle 2017-06-21 18:43:26 CEST
@ralph

How do you feel about the counter? I am running it successfully but I am not super happy with the intermingling of the iterator and daemon code as I believe there may be cases (direct use of the library?) where this would not work.

Having a more versatile way to bump counters would be great :) I will open a separate bug to start the discussion around it.
Comment 16 Manu Bretelle 2017-08-02 17:18:45 CEST
Hi Ralph,

checking back on this. A counter incrementing as ratelimiting kicks in would be useful to troubleshoot those types of issues. The proposed patch is not great though, would you have a better way to get this working?

Tks
Comment 17 Ralph Dolmans 2017-08-04 14:12:17 CEST
Hi Manu,

I have added a query ratelimit counter. See svn r4292.

Regards,
-- Ralph
Comment 18 Manu Bretelle 2017-08-04 20:08:00 CEST
Thanks Ralph!