Bug 4126 - RTT_band too low on VSAT links with 600+ms latency
RTT_band too low on VSAT links with 600+ms latency
Status: VERIFIED FIXED
Product: unbound
Classification: Unclassified
Component: server
unspecified
Other All
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-07-12 15:08 CEST by Nick Hibma
Modified: 2018-10-29 16:32 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Hibma 2018-07-12 15:08:56 CEST
On VSAT links the RTT typically is 600 - 650 ms. On every flush_infra/restart the first n requests (with n the number of forwarders) will take approximately 2.5s here to complete, while direct requests to 8.8.8.8 take the expected 700ms to complete. In the sequence below you can see that after 400ms a new request is sent out and the previous response ignored.

The VSAT dish is mounted on a river ship so connection loss is fairly frequent and becomes prohibitive because of this issue.

Perhaps the RTT_band and related variables should be keyed off a configuration variable indicating a 'typical RTT'? Or otherwise a late response should be accepted anyway?

Any help here would be appreciated. Currently we do a number of requests when the link comes up to get over this initial delay (which triggers a malfunction in our hotspot server that goes into Reduced Functionality Mode because of lack of DNS). I'd rather fix the actual problem.

# tcpdump -i em0 -n port 53 &

# echo nameserver 127.0.0.1 > /etc/resolv.conf
# ping www.bla.nl
12:33:21.026737 IP 10.252.128.186.52178 > 8.8.8.8.53: 1435+ [1au] A? www.bla.nl. (39)
12:33:21.404353 IP 10.252.128.186.36336 > 8.8.8.8.53: 33720+ [1au] A? www.bla.nl. (39)
12:33:21.676057 IP 8.8.8.8.53 > 10.252.128.186.52178: 1435 4/2/8 CNAME bla.nl., RRSIG, A 87.250.139.11, RRSIG (1053)
12:33:21.781398 IP 10.252.128.186.30735 > 8.8.8.8.53: 32530+ [1au] A? www.bla.nl. (39)
12:33:22.147490 IP 8.8.8.8.53 > 10.252.128.186.36336: 33720 4/2/8 CNAME bla.nl., RRSIG, A 87.250.139.11, RRSIG (1053)
12:33:22.410205 IP 8.8.8.8.53 > 10.252.128.186.30735: 32530 4/2/8 CNAME bla.nl., RRSIG, A 87.250.139.11, RRSIG (1053)
12:33:22.410405 IP 10.252.128.186.31530 > 8.8.4.4.53: 30449+ [1au] A? bla.nl. (35)
12:33:22.789369 IP 10.252.128.186.61150 > 8.8.4.4.53: 22013+ [1au] A? bla.nl. (35)
12:33:23.037678 IP 8.8.4.4.53 > 10.252.128.186.31530: 30449$ 2/0/1 A 87.250.139.11, RRSIG (217)
12:33:23.167351 IP 10.252.128.186.18946 > 8.8.4.4.53: 39171+ [1au] A? bla.nl. (35)
12:33:23.416726 IP 8.8.4.4.53 > 10.252.128.186.61150: 22013$ 2/0/1 A 87.250.139.11, RRSIG (217)
12:33:23.801800 IP 8.8.4.4.53 > 10.252.128.186.18946: 39171$ 2/0/1 A 87.250.139.11, RRSIG (217)
PING bla.nl (87.250.139.11): 56 data bytes
64 bytes from 87.250.139.11: icmp_seq=0 ttl=56 time=606.974 ms
^C
--- bla.nl ping statistics ---
2 packets transmitted, 1 packets received, 50.0% packet loss
round-trip min/avg/max/stddev = 606.974/606.974/606.974/0.000 ms



# echo nameserver 8.8.8.8 > /etc/resolv.conf
# ping www.bla.nl
12:33:34.570092 IP 10.252.128.186.23771 > 8.8.8.8.53: 47260+ A? www.bla.nl. (28)
12:33:35.439236 IP 8.8.8.8.53 > 10.252.128.186.23771: 47260 2/2/4 CNAME bla.nl., A 87.250.139.11 (200)
PING bla.nl (87.250.139.11): 56 data bytes
64 bytes from 87.250.139.11: icmp_seq=0 ttl=56 time=597.169 ms
Comment 1 Wouter Wijngaards 2018-07-12 15:28:47 CEST
Hi Nick,

This why there are forwarders, to put the forwarder on the end of that slow link.  And then send to the forwarder.  That makes multiple requests, and then sends the answer over the slow link.  And use caches.  The setup is have a cache with forwarder where you are now.  Then point the forwarder to another DNS machine, that does full resolution, at the other end of the slow link.  You are now using a public resolver for that, and I can expect that this works as well as people who advocate this solution for slow links say.

Unbound, then, has timeouts.  There is no actual 'expect this rtt' config.  You could edit code
iterator/iterator.h:86
#define UNKNOWN_SERVER_NICENESS 376
and util/rtt.c:64
        rtt->rttvar = 94;

Change the 376 value to 752
Change the 94 value to 188

Then you have an unbound that expected your type of RTT.  The timout is 750 ms, a little above the value you ping.

Unbound caches the timeouts in the infra cache.  Make sure it is large enough, this can be configured in unbound.conf.

I am not sure if that will really help.

To avoid lookup latency, if you had a large group of people, not one ADSL modem, I would enable prefetch: yes and serve-expired: yes

Best regards, Wouter
Comment 2 Nick Hibma 2018-07-12 15:42:48 CEST
Wouter, thanks for the fast and extensive responsive!

I don't quite understand what you are trying to say though. We are using unbound as a forwarder, as it allows us to increase TTL and to cache results. We forward to a public resolver from the ship. We are -not- doing full resolution on the ship of course. That would be silly.

Our config:

server:
        directory: "/var/unbound"
        pidfile: "/var/run/local_unbound.pid"
        verbosity: 0
        interface: 0.0.0.0
        access-control: 127.0.0.0/8 allow
        access-control: 192.168.0.0/16 allow
        access-control: 10.0.0.0/8 allow
        access-control: 172.16.0.0/12 allow
        infra-host-ttl: 1800
        cache-min-ttl: 300
        target-fetch-policy: "-1 -1 -1 -1 -1"
        prefetch: yes
        serve-expired: yes

forward-zone:
        name: "."
        forward-addr: 8.8.8.8
        forward-addr: 8.8.4.4
        forward-addr: 192.168.99.15@5353

remote-control:
        control-enable: yes
        control-interface: 127.0.0.1


Thanks for the pointers to prefetch and serve-expired. That first one I knew about, the second one I didn't.

Currently I do a few pings to prime the infra cache, but I guess I should try the increasing of the values you suggest. RTT_band does not need to be adapted?

Thanks.
Comment 3 Wouter Wijngaards 2018-07-12 15:51:38 CEST
Hi Nick,

Yes I think it is fine.  RTT_Band would only need to be increased if ping get really a lot higher (10000 msec or so), because variability then also gets higher.  But you cite 600-650 and the variability is not really higher, it is just slower.  So the same RTT band is going to be fine.

The edit should make unbound get up from start faster, I hope.  As an option I think it is difficult to comprehend for users; although useful for the first N requests...

aggressive-nsec and harden-below-nxdomain can speed up nxdomain responses.
prefetch-key can speed up DNSSEC validation.  These options are a bit similar to those two others I gave.

Also increasing the cache size, rrset-cache-size and msg-cache-size, gives more cache hits.  Set rrset to about 2x the msg amount.  Some people use Gbs for it, but then cache hits can get really good (eg. 90% and more), but that may depend on the population size and kind.

Best regards, Wouter
Comment 4 Nick Hibma 2018-07-12 15:59:30 CEST
Changing the values does not improve the situation. unbound is still ignoring responses that arrive after 700ms:

# ls -l /usr/sbin/unbound
-rwxr-xr-x  1 root  wheel  114372 Jul 12 13:57 /usr/sbin/unbound
# service local_unbound restart
Stopping local_unbound.
Waiting for PIDS: 67178.
Starting local_unbound.
Waiting for nameserver to start... good
# ping asd-123
13:57:36.563140 IP 10.252.128.186.50972 > 8.8.8.8.53: 31417+ [1au] A? asd-123. (36)
13:57:36.957801 IP 10.252.128.186.58388 > 8.8.8.8.53: 13697+ [1au] A? asd-123. (36)
13:57:37.309206 IP 8.8.8.8.53 > 10.252.128.186.50972: 31417 NXDomain 0/6/1 (1023)
13:57:37.335817 IP 10.252.128.186.5970 > 8.8.4.4.53: 26977+ [1au] A? asd-123. (36)
13:57:37.660921 IP 8.8.8.8.53 > 10.252.128.186.58388: 13697 NXDomain 0/6/1 (1023)
13:57:37.727777 IP 10.252.128.186.50032 > 8.8.4.4.53: 10167+ [1au] A? asd-123. (36)
13:57:38.104771 IP 10.252.128.186.14664 > 8.8.8.8.53: 52743+ [1au] A? asd-123. (36)
13:57:38.400195 IP 8.8.4.4.53 > 10.252.128.186.50032: 10167 NXDomain$ 0/6/1 (1023)
13:57:38.532558 IP 8.8.4.4.53 > 10.252.128.186.5970: 26977 NXDomain$ 0/6/1 (1023)
13:57:38.682959 IP 8.8.8.8.53 > 10.252.128.186.14664: 52743 NXDomain 0/6/1 (1023)
ping: cannot resolve asd-123: Unknown host
Comment 5 Wouter Wijngaards 2018-07-12 16:03:50 CEST
Hi Nick,

Try even bigger values, like
1128 for the 376
282 for the 94

That increases the timeouts further.

Best regards, Wouter
Comment 6 Nick Hibma 2018-07-12 16:16:05 CEST
It still takes 2+s to get an answer back, but the pattern of messages looks slightly different in some cases.

I think the problem is more related to the fact that answers are ignored at first, because they are too late. I don't mind the sending out of additional requests.
Comment 7 Wouter Wijngaards 2018-07-12 16:17:16 CEST
Hi,

Made a fix in the code repository.  You only have to edit the UNKNOWN_SERVER_NICENESS value, in iterator/iterator.h:86 to the desired new timeout value (in msec).  Instead of having to edit in multiple places.

Best regards, Wouter
Comment 8 Wouter Wijngaards 2018-07-12 16:18:03 CEST
Hi Nick,

If you set unbound with verbose logging, like 4, does it say the timeout that it picks for the new connection?  It should pick the value that you edited?

Best regards, Wouter
Comment 9 Wouter Wijngaards 2018-07-12 16:19:26 CEST
Hi,

For me that works, and it says:
serviced query UDP timeout=752 msec
when I enter the 752 value in the UNKNOWN_SERVER_NICENESS.

Best regards, Wouter
Comment 10 Wouter Wijngaards 2018-07-12 16:22:08 CEST
Hi,

And for returning packets it prints this line with the RTT for them:
measured roundtrip at 2 msec

(2 msec for me here at nlnetlabs.nl for a query to nlnetlabs.nl).

Those lines from the verbose logs could be interesting, eg. to see if it is working, and what RTTs you need to set.  752 or 1128 or so.

Best regards, Wouter
Comment 11 Nick Hibma 2018-07-12 23:22:16 CEST
Please note: This is the stock FreeBSD version. The files you are referring to do not get included in that version. The code depends on the .h files, but the md5 of the binary unbound between builds does not change.
Comment 12 Nick Hibma 2018-07-13 12:09:57 CEST
Ok, finally figured out how to rebuild things so that it worked (libunbound).

The suggested change works as advertised. The first lookup now succeeds and is used. No repeated requests.

# sudo service local_unbound restart
Stopping local_unbound.
Waiting for PIDS: 40661.
Starting local_unbound.
Waiting for nameserver to start... good
# unbound-control dump_infra
# ping -c 1 nu.nl
10:07:42.185776 IP 10.252.128.190.55627 > 8.8.4.4.53: 28946+ [1au] A? nu.nl. (34)
10:07:42.808511 IP 8.8.4.4.53 > 10.252.128.190.55627: 28946 4/0/1 A 52.85.201.184, A 52.85.201.51, A 52.85.201.140, A 52.85.201.98 (98)
PING nu.nl (52.85.201.184): 56 data bytes
64 bytes from 52.85.201.184: icmp_seq=0 ttl=242 time=649.057 ms
...
# unbound-control dump_infra
8.8.4.4 . ttl 1740 ping 77 var 367 rtt 1545 rto 1545 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
Comment 13 Wouter Wijngaards 2018-10-25 11:24:19 CEST
Hi Nick,

Thanks for the test!  I added the option unknown-server-time-limit: 376 to unbound.conf.  It sets the UNKNOWN_SERVER_NICENESS value in the code but using unbound.conf.  Set that option to 752 or 1128 and that should be the same as the solution you already tested but without having to patch the code.

Best regards, Wouter
Comment 14 Nick Hibma 2018-10-29 16:32:25 CET
That resolves my bug report. Thanks for implementing!

Nick Hibma
-- Open Source: We stand on the shoulders of giants.