Bug 4219 - Secondaries not updated after serial change
Secondaries not updated after serial change
Status: NEW
Product: unbound
Classification: Unclassified
Component: server
1.8.3
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-01-15 14:57 CET by w
Modified: 2019-01-22 16:45 CET (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description w 2019-01-15 14:57:07 CET
I have two unbound servers acting as secondaries and updating themselves from a master.

The initial update (when there is not zone file) works (the zone file is created locally) and Unbound correctly responds to requests.

When the serial in the SOA of the master changes (increases), the zone is not reloaded. The refresh entry is set to 60 and the zone is still not updated after an hour or so.


---------------------------------

On the master:

root@srv ~# dig @127.0.0.1 swtk.info soa

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> @127.0.0.1 swtk.info soa
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5177
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;swtk.info.                     IN      SOA

;; ANSWER SECTION:
swtk.info.              600     IN      SOA     . 'dns.swtk.info'. 1547559343 60 180 1209600 600

;; AUTHORITY SECTION:
swtk.info.              600     IN      NS      .
swtk.info.              600     IN      NS      192.168.0.13.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 15 14:44:26 CET 2019
;; MSG SIZE  rcvd: 136

-------------------------------

On the secondary:

root@rpi1 /u/l/e/unbound# dig @127.0.0.1 salt.swtk.info

; <<>> DiG 9.11.4-4-Raspbian <<>> @127.0.0.1 salt.swtk.info
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 31291
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;salt.swtk.info.                        IN      A

;; AUTHORITY SECTION:
swtk.info.              600     IN      SOA     . 'dns.swtk.info'. 1547556684 60 180 1209600 600

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 15 14:46:20 CET 2019
;; MSG SIZE  rcvd: 93

-------------------------------

Please note that the SOA of the master (1547559343) > secondary (1547556684). Again, thi si sa case after >> 60 seconds of waiting
Comment 1 Wouter Wijngaards 2019-01-15 15:01:27 CET
Hi,

What is the auth-zone configuration?  Specifically for the master statement in there?

Can you give verbosity 4 logs of the attempts to fetch the new zone?  Perhaps it will say something.

One guess is you have master: "some name" and that name cannot be resolved now that the zone has timed out?  master: IP-address would probably work then.  That is just a guess, it could also be caused by a bug, but the logs would then help find out what is going wrong.

Best regards, Wouter
Comment 2 w 2019-01-15 15:29:52 CET
Sorry, I should have posted that in my initial question - thanks for following up

The configuration:

auth-zone:
        name: "swtk.info"
        master: 10.100.10.254
        allow-notify: 10.100.10.254
        zonefile: swtk.info.db

I increased the verbosity, removed the zone on the secondary and restarted the secondary.
It loaded the zone (fine so far)
I increased the zone on the master and the logs claim that this was detected (the log is enormous and somehow sensitive, so I postthe elements which I believe are relevant - please let me know if this is not enough):

Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: auth zone swtk.info.: soa probe serial is 1547561411
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: close fd 25
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: comm point start listening 24
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: answer cb
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: Incoming reply id = 57e3
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: Incoming reply addr = ip4 37.209.196.2 port 53 (len 16)
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: lookup size is 20 entries

The serial on the master is indeed 1547561411
The log claims that the zone transfer started
But the zone stays the same, with the older serial (the command is ran on the secondary, dig shows the same thing):

root@rpi1 /u/l/e/unbound# grep SOA swtk.info.db
swtk.info.      600     IN      SOA     . 'dns.swtk.info'. 1547559343 60 180 1209600 600

The serial probe and transfer attempt is done every few minutes:

Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:14:04 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:17:30 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:21:03 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:24:47 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer

I would like to highlight that the initial transfer is correct and the initial zone is loaded OK.
Comment 3 w 2019-01-15 15:34:57 CET
One more thing: unbound runs as root and the ownership of the zone file is root as well.
Comment 4 Wouter Wijngaards 2019-01-15 16:19:09 CET
Hi,

After that there should be log message 'xfr stopped' perhaps, or other log messages with the name of the zone in it?  Like 'xfr from <zone>: reason'?

Not a lot of log output for auth zone here, I could increase that to get more (now you are already swimming in log output I understand, but not the ones we need right now).

Best regards, Wouter
Comment 5 Wouter Wijngaards 2019-01-15 16:21:45 CET
Hi,

This is what it prints when the update worked "auth zone %s updated to serial %u".  Is that in the logs, you say the first time it worked?

I get the impression that it makes a channel to the master, perhaps with an IXFR, because the first transfer will have been AXFR (nothing there yet), but then the IXFR is different, server not answereing (is OK for server to answer NOTIMPL or failure, it should fall back, but perhaps something wrong here?) or unbound fails to do IXFR (but tested to work for me)?

Best regards, Wouter
Comment 6 w 2019-01-15 17:24:30 CET
OK, so the initial transfer goes like this (I filtered the interesting information):

Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info. timeout in 0 seconds
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info.: soa probe serial is 1547559343
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: xfr 10.100.10.254: contains SOA serial 1547559343
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: xfr 10.100.10.254: last AXFR packet
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info. updated to serial 1547559343
Jan 15 15:08:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info. timeout in 60 seconds

This looks great: the serial changed (from undefined as the zone does not exist), the transfer when though and there is a timeout set for 60 seconds


Then there is

Jan 15 15:09:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info.: soa probe serial is 1547559343
Jan 15 15:09:38 rpi1 unbound: [14558:0] debug: auth_zone master reports unchanged soa serial
Jan 15 15:09:38 rpi1 unbound: [14558:0] debug: auth_zone unchanged, new lease, wait
Jan 15 15:09:38 rpi1 unbound: [14558:0] debug: auth zone swtk.info. timeout in 60 seconds

which is still OK as the zone did not change.

Then the serial is increased and ...

Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: auth zone swtk.info.: soa probe serial is 1547561411
Jan 15 15:10:39 rpi1 unbound: [14558:0] debug: auth_zone updated, start transfer
Jan 15 15:10:51 rpi1 unbound: [14558:0] debug: xfr stopped, connection lost to 10.100.10.254
Jan 15 15:11:01 rpi1 unbound: [14558:0] debug: tcp took too long, dropped
Jan 15 15:11:01 rpi1 unbound: [14558:0] debug: close fd 9
Jan 15 15:11:01 rpi1 unbound: [14558:0] debug: xfr stopped, connection lost to 10.100.10.254
Jan 15 15:11:01 rpi1 unbound: [14558:0] debug: auth zone swtk.info. timeout in 183 seconds

I had a look between 15:10:39 and 15:10:51 and in the few hundred lines there there does not seem to be anything related.

I captured the traffic between the secondary and the master and there are SOA requests, and then IXFR ones

17:11:17.605846 IP (tos 0x0, ttl 64, id 39188, offset 0, flags [none], proto UDP (17), length 55)
    10.100.10.30.62471 > 10.100.10.254.53: [udp sum ok] 37124 SOA? swtk.info. (27)
17:11:17.609309 IP (tos 0x0, ttl 64, id 10112, offset 0, flags [DF], proto UDP (17), length 153)
    10.100.10.254.53 > 10.100.10.30.62471: [udp sum ok] 37124* q: SOA? swtk.info. 1/2/0 swtk.info. SOA . 'dns.swtk.info'. 1547561419 60 180 1209600 600 ns: swtk.info. NS ., swtk.info. NS 192.168.0.13. (125)
17:11:17.610149 IP (tos 0x0, ttl 64, id 51189, offset 0, flags [DF], proto TCP (6), length 60)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [S], cksum 0x9704 (correct), seq 1970491251, win 29200, options [mss 1460,sackOK,TS val 7833734 ecr 0,nop,wscale 7], length 0
17:11:17.613019 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.100.10.254.53 > 10.100.10.30.50926: Flags [S.], cksum 0x0c4c (correct), seq 624813862, ack 1970491252, win 28960, options [mss 1460,sackOK,TS val 1820070583 ecr 7833734,nop,wscale 7], length 0
17:11:17.613326 IP (tos 0x0, ttl 64, id 51190, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [.], cksum 0xab52 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 7833735 ecr 1820070583], length 0
17:11:17.613705 IP (tos 0x0, ttl 64, id 51191, offset 0, flags [DF], proto TCP (6), length 115)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [P.], cksum 0x0474 (correct), seq 1:64, ack 1, win 229, options [nop,nop,TS val 7833735 ecr 1820070583], length 6356463 [1n] IXFR? swtk.info. ns: swtk.info. SOA . . 1547559343 0 0 0 0 (61)
17:11:17.617758 IP (tos 0x0, ttl 64, id 43183, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.254.53 > 10.100.10.30.50926: Flags [.], cksum 0xab11 (correct), seq 1, ack 64, win 227, options [nop,nop,TS val 1820070587 ecr 7833735], length 0
17:11:17.618647 IP (tos 0x0, ttl 64, id 43184, offset 0, flags [DF], proto TCP (6), length 81)
    10.100.10.254.53 > 10.100.10.30.50926: Flags [P.], cksum 0x9af5 (correct), seq 1:30, ack 64, win 227, options [nop,nop,TS val 1820070587 ecr 7833735], length 2956463 q: IXFR? swtk.info. 0/0/0 (27)
17:11:17.618951 IP (tos 0x0, ttl 64, id 51192, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [.], cksum 0xaaf2 (correct), seq 64, ack 30, win 229, options [nop,nop,TS val 7833735 ecr 1820070587], length 0
17:11:27.620607 IP (tos 0x0, ttl 64, id 51193, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [F.], cksum 0xa708 (correct), seq 64, ack 30, win 229, options [nop,nop,TS val 7834736 ecr 1820070587], length 0
17:11:27.621014 IP (tos 0x0, ttl 64, id 34362, offset 0, flags [DF], proto TCP (6), length 60)
    10.100.10.30.50928 > 10.100.10.254.53: Flags [S], cksum 0xd140 (correct), seq 2087979594, win 29200, options [mss 1460,sackOK,TS val 7834736 ecr 0,nop,wscale 7], length 0
17:11:27.623890 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.100.10.254.53 > 10.100.10.30.50928: Flags [S.], cksum 0x5b35 (correct), seq 3317762778, ack 2087979595, win 28960, options [mss 1460,sackOK,TS val 1820080594 ecr 7834736,nop,wscale 7], length 0
17:11:27.624184 IP (tos 0x0, ttl 64, id 34363, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.30.50928 > 10.100.10.254.53: Flags [.], cksum 0xfa3c (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 7834736 ecr 1820080594], length 0
17:11:27.624558 IP (tos 0x0, ttl 64, id 34364, offset 0, flags [DF], proto TCP (6), length 115)
    10.100.10.30.50928 > 10.100.10.254.53: Flags [P.], cksum 0x3259 (correct), seq 1:64, ack 1, win 229, options [nop,nop,TS val 7834736 ecr 1820080594], length 6364916 [1n] IXFR? swtk.info. ns: swtk.info. SOA . . 1547559343 0 0 0 0 (61)
17:11:27.625061 IP (tos 0x0, ttl 64, id 43185, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.254.53 > 10.100.10.30.50926: Flags [F.], cksum 0x7ff2 (correct), seq 30, ack 65, win 227, options [nop,nop,TS val 1820080594 ecr 7834736], length 0
17:11:27.625370 IP (tos 0x0, ttl 64, id 51194, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.30.50926 > 10.100.10.254.53: Flags [.], cksum 0x7ff0 (correct), seq 65, ack 31, win 229, options [nop,nop,TS val 7834736 ecr 1820080594], length 0
17:11:27.629156 IP (tos 0x0, ttl 64, id 21396, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.10.254.53 > 10.100.10.30.50928: Flags [.], cksum 0xf9fa (correct), seq 1, ack 64, win 227, options [nop,nop,TS val 1820080599 ecr 7834736], length 0
17:11:27.630383 IP (tos 0x0, ttl 64, id 21397, offset 0, flags [DF], proto TCP (6), length 81)
    10.100.10.254.53 > 10.100.10.30.50928: Flags [P.], cksum 0xc8d8 (correct), seq 1:30, ack 64, win 227, options [nop,nop,TS val 1820080600 ecr 7834736], length 2964916 q: IXFR? swtk.info. 0/0/0 (27)


Indeed, maybe the IXFR does not go through for some reason? This is dnsmasq on the other side so I will have a look at that too.
As a workaround - would it be possible for unbound to always make a AWFR request (and not an IXFR one)? (assuming that thi si sthe problem)
Comment 7 w 2019-01-15 17:50:58 CET
I had a look on dnsmasq (the logging is not that verbose) and when the zone is first transferred I have

dnsmasq: query[SOA] swtk.info from 10.100.10.30
dnsmasq: auth swtk.info is <SOA>
dnsmasq: query[AXFR] swtk.info from 10.100.10.30
dnsmasq: auth swtk.info is <AXFR>

then checks of SOA

dnsmasq: query[SOA] swtk.info from 10.100.10.30
dnsmasq: auth swtk.info is <SOA>

(so far so good, they are all handled)

and when I change the serial

dnsmasq: query[SOA] swtk.info from 10.100.10.30
dnsmasq: auth swtk.info is <SOA>
dnsmasq: query[IXFR] swtk.info from 10.100.10.30

So the IXFR request is not handled. I had a look at the code and indeed, it is not.

My only hope in that case would be for unbound to fallback to AXFR, or only do AXFR (I did not find anything in the logs).
Comment 8 James 2019-01-18 05:41:36 CET
The IXFR request is not handled.
https://www.just4dummies.com/how-to-delete-instagram-account
Comment 9 Wouter Wijngaards 2019-01-22 16:36:11 CET
Hi,

Does this patch solve the problem for you?  After 3 attempts with a timeout it'll fallback to AXFR and transfer the zone?

Index: services/authzone.c
===================================================================
--- services/authzone.c	(revision 5048)
+++ services/authzone.c	(working copy)
@@ -88,6 +88,9 @@
 #define AUTH_HTTPS_PORT 443
 /* max depth for nested $INCLUDEs */
 #define MAX_INCLUDE_DEPTH 10
+/** number of timeouts before we fallback from IXFR to AXFR,
+ * because some versions of servers (eg. dnsmasq) drop IXFR packets. */
+#define NUM_TIMEOUTS_FALLBACK_IXFR 3
 
 /** pick up nextprobe task to start waiting to perform transfer actions */
 static void xfr_set_timeout(struct auth_xfer* xfr, struct module_env* env,
@@ -5636,6 +5639,20 @@
 		 * and continue task_transfer*/
 		verbose(VERB_ALGO, "xfr stopped, connection lost to %s",
 			xfr->task_transfer->master->host);
+
+		/* see if IXFR caused the failure, if so, try AXFR */
+		if(xfr->task_transfer->on_ixfr) {
+			xfr->task_transfer->ixfr_possible_timeout_count++;
+			if(xfr->task_transfer->ixfr_possible_timeout_count >=
+				NUM_TIMEOUTS_FALLBACK_IXFR) {
+				verbose(VERB_ALGO, "xfr to %s, fallback "
+					"from IXFR to AXFR (because of timeouts)",
+					xfr->task_transfer->master->host);
+				xfr->task_transfer->ixfr_fail = 1;
+				gonextonfail = 0;
+			}
+		}
+
 	failed:
 		/* delete transferred data from list */
 		auth_chunks_delete(xfr->task_transfer);
@@ -5645,6 +5662,7 @@
 		xfr_transfer_nexttarget_or_end(xfr, env);
 		return 0;
 	}
+	xfr->task_transfer->ixfr_possible_timeout_count = 0;
 
 	/* handle returned packet */
 	/* if it fails, cleanup and end this transfer */
Index: services/authzone.h
===================================================================
--- services/authzone.h	(revision 5048)
+++ services/authzone.h	(working copy)
@@ -378,6 +378,8 @@
 	 * data or add of duplicate data).  Flag is cleared once the retry
 	 * with axfr is done. */
 	int ixfr_fail;
+	/** we saw an ixfr-indicating timeout, count of them */
+	int ixfr_possible_timeout_count;
 	/** we are doing IXFR right now */
 	int on_ixfr;
 	/** did we detect the current AXFR/IXFR serial number yet, 0 not yet,


Best regards, Wouter
Comment 10 Wouter Wijngaards 2019-01-22 16:45:46 CET
Hi,

And additionally this patch (although the previous will probably solve your problem on its own, but it is a bug that needs fixing).  This fix makes it try the same master again when it finds out it needs to use AXFR.

Index: services/authzone.c
===================================================================
--- services/authzone.c	(revision 5052)
+++ services/authzone.c	(working copy)
@@ -5658,7 +5658,8 @@
 		auth_chunks_delete(xfr->task_transfer);
 		comm_point_delete(xfr->task_transfer->cp);
 		xfr->task_transfer->cp = NULL;
-		xfr_transfer_nextmaster(xfr);
+		if(gonextonfail)
+			xfr_transfer_nextmaster(xfr);
 		xfr_transfer_nexttarget_or_end(xfr, env);
 		return 0;
 	}

Or you could use the version of unbound from svn that also contains these fixes.

Best regards, Wouter