Bugzilla – Bug 4141
More randomness to rrset-roundrobin
Last modified: 2018-10-25 10:26:50 CEST
Created attachment 519 [details]
Traffic dump of a buggy router
There are a lot of SOHO routers (f.e. TP-Link TL-WR1043ND) that somehow "generate" request ID instead of making it random. This way multiple sequential requests for the same host name will have the same request IDs and consequently the same rrset order in the reply. This breaks DNS load-balancing, especially when almost all of your customers use similar routers. Example dump attached.
Suggestion is to add some "randomness" to the starting RR selection algo:
+++ util/data/msgencode.c 2018-08-02 09:10:08.327422000 +0000
@@ -674,7 +674,7 @@ reply_info_encode(struct query_info* qin
/* roundrobin offset. using query id for random number. With ntohs
* for different roundrobins for sequential id client senders. */
- rr_offset = RRSET_ROUNDROBIN?ntohs(id):0;
+ rr_offset = RRSET_ROUNDROBIN?(ntohs(id) + (uint16_t)time(NULL)):0;
/* "prepend" any local alias records in the answer section if this
* response is supposed to be authoritative. Currently it should
Maybe there is a better way to go. It's just a POC.
Created attachment 520 [details]
time() is going to hit the vDSO, or perform a system call on older systems. Why not just use the already existing 'timenow' argument to reply_info_encode()?
> time() is going to hit the vDSO, or perform a system call on older systems.
Yes, I know. That's why suggested patch is just a proof of concept (although, working).
> Why not just use the already existing 'timenow' argument to reply_info_encode()?
Since 'timenow' is always 0. Grep the sources, reply_info_answer_encode() is always called with '0' as 'timenow' parameter. And I've failed to find another source of randomness inside reply_info_encode().
The timenow is not zero for answer_from_cache. And that would the main return for lots of queries happening. So it would work for cache responses with timenow. Is that good enough?
Best regards, Wouter
Ah, I've overlooked this, sorry.
Yes, for replies from cache 'timenow' is pretty good, but here at Ufanet.RU we are balancing local-zone ('pptp.ufanet.ru') and that IS the problem.
Well, replacing time(NULL) with 'timenow' could work if we will serve our zone from some authoritative loopback-only server and forward requests to it from unbound. Am I right?
> replacing time(NULL) with 'timenow' could work if we will serve our zone from
> some authoritative loopback-only server and forward requests to it from unbound.
I'm replying to myself: this works OK except for 'ANY' requests. Anyway, it's better than nothing. :-)
If the timenow is 0, I can use the time call, if it is not 0, I can use the timenow value and provide better roundrobin, this is what that codeline looks like.
rr_offset = RRSET_ROUNDROBIN?ntohs(id)+(timenow?timenow:time(NULL)):0;
Thanks for the report, the fix is in the code repository.
Best regards, Wouter