Bug 4141

Summary: More randomness to rrset-roundrobin
Product: unbound Reporter: Damir Bikmukhametov <boco>
Component: serverAssignee: unbound team <unbound-team>
Status: RESOLVED FIXED    
Severity: enhancement CC: boco, cathya, edmonds, wouter
Priority: P5    
Version: unspecified   
Hardware: All   
OS: All   
Attachments: Traffic dump of a buggy router
Proposed patch

Description Damir Bikmukhametov 2018-08-03 10:32:51 CEST
Created attachment 519 [details]
Traffic dump of a buggy router

There are a lot of SOHO routers (f.e. TP-Link TL-WR1043ND) that somehow "generate" request ID instead of making it random. This way multiple sequential requests for the same host name will have the same request IDs and consequently the same rrset order in the reply. This breaks DNS load-balancing, especially when almost all of your customers use similar routers. Example dump attached.

Suggestion is to add some "randomness" to the starting RR selection algo:

===
+++ util/data/msgencode.c       2018-08-02 09:10:08.327422000 +0000
@@ -674,7 +674,7 @@ reply_info_encode(struct query_info* qin
        }
        /* roundrobin offset. using query id for random number.  With ntohs
         * for different roundrobins for sequential id client senders. */
-       rr_offset = RRSET_ROUNDROBIN?ntohs(id):0;
+       rr_offset = RRSET_ROUNDROBIN?(ntohs(id) + (uint16_t)time(NULL)):0;

        /* "prepend" any local alias records in the answer section if this
         * response is supposed to be authoritative.  Currently it should
===

Maybe there is a better way to go. It's just a POC.
Comment 1 Damir Bikmukhametov 2018-08-03 10:34:46 CEST
Created attachment 520 [details]
Proposed patch
Comment 2 Robert Edmonds 2018-08-04 05:50:37 CEST
time() is going to hit the vDSO, or perform a system call on older systems. Why not just use the already existing 'timenow' argument to reply_info_encode()?
Comment 3 Damir Bikmukhametov 2018-08-04 13:51:51 CEST
> time() is going to hit the vDSO, or perform a system call on older systems.

Yes, I know. That's why suggested patch is just a proof of concept (although, working).

> Why not just use the already existing 'timenow' argument to reply_info_encode()?

Since 'timenow' is always 0. Grep the sources, reply_info_answer_encode() is always called with '0' as 'timenow' parameter. And I've failed to find another source of randomness inside reply_info_encode().
Comment 4 Wouter Wijngaards 2018-08-06 11:03:03 CEST
Hi Damir,

The timenow is not zero for answer_from_cache.  And that would the main return for lots of queries happening.  So it would work for cache responses with timenow.  Is that good enough?

Best regards, Wouter
Comment 5 Damir Bikmukhametov 2018-08-06 11:19:52 CEST
Ah, I've overlooked this, sorry.

Yes, for replies from cache 'timenow' is pretty good, but here at Ufanet.RU we are balancing local-zone ('pptp.ufanet.ru') and that IS the problem.

Well, replacing time(NULL) with 'timenow' could work if we will serve our zone from some authoritative loopback-only server and forward requests to it from unbound. Am I right?
Comment 6 Damir Bikmukhametov 2018-08-06 20:28:18 CEST
> replacing time(NULL) with 'timenow' could work if we will serve our zone from
> some authoritative loopback-only server and forward requests to it from unbound.

I'm replying to myself: this works OK except for 'ANY' requests. Anyway, it's better than nothing. :-)
Comment 7 Wouter Wijngaards 2018-10-25 10:26:50 CEST
Hi Damir,

If the timenow is 0, I can use the time call, if it is not 0, I can use the timenow value and provide better roundrobin, this is what that codeline looks like.

        rr_offset = RRSET_ROUNDROBIN?ntohs(id)+(timenow?timenow:time(NULL)):0;

Thanks for the report, the fix is in the code repository.

Best regards, Wouter