View Issue Details

IDProjectCategoryView StatusLast Update
0000962SymmetricDSBugpublic2019-04-26 17:43
Reporterrotten Assigned To 
Priorityhigh 
Status newResolutionopen 
Product Version3.2.0 
Summary0000962: FK constraint error when syncing configuration change
DescriptionBecause of the new foreign key constraints, changes to tables in the config channel need to be inserted in the remote nodes in the proper order.

They do not appear to be. Simple updates on the registration server cause a backlog of config update failures due to FK constraint issues on the remote node.

Further complicating this, is that when this happens the outgoing config batches get stuck in 'LD' status, and never throw an error flag.


Steps To ReproduceSet up a simple replication pair, and try toggling the node_group_link around.

Bring up a third node, and try adding triggers and trigger_routers and routes for it while replication is still running. Then try updating node_group_link.

They all get jammed up.
Additional InformationI'm testing this with an Oracle 10G registration server, a PostgreSQL 9.1 node, and a MySQL 5.5 node using the production 3.2.0 release.
Tagsconfiguration, data sync

Activities

rotten

2012-12-18 19:30

reporter   ~0000168

Oh, and all of the foreign keys are present on the registration server, so the config updates are being installed in the correct order there.

chenson

2012-12-18 20:48

administrator   ~0000169

Were you updating configuration on multiple nodes? Configuration is meant to be managed from a central node.

rotten

2012-12-19 13:49

reporter   ~0000171

I am updating the configuration from the registration server (central node). I've pushed changes directly to the other nodes only in experiments to try to clear the FK constraint complaints.

When I originally posted this I thought I fully understood what was happening (simply a change ordering problem). Now I am less sure that is exactly the root cause.

What I am observing is that some config channel changes seem to get stuck in 'LD' status in sym_outgoing_batch. Sent Count for those changes keeps climbing, seemingly indefinitely.

[ Note: all of the JVM's are on the same server, communicating with plain HTTP. I don't believe network issues would cause the repeated send retries. ]

At the same time, other config channel changes are showing up, and passing on through without any issues. (including heartbeats) [ but not all, sometimes they get stuck too ] So the later config changes, which would only be successful if the earlier ones hadn't gotten stuck in LD, sometimes throw the FK exceptions.

Restarting the nodes and the registration server does not seem to help drain the stuck config channel outgoing batches.

This morning I have this suspicion, which I'll test in a few minutes:

1) I'm currently doing push/pull once per second.
2) The "synchronizing triggers" activity, which happens after some config changes, takes 30 seconds to a minute to complete.
3) When a config change comes through that triggers a synchronizing triggers action on the remote node, the subsequent config changes get stuck.
4) Later, after the synchronizing triggers action is done on the remote node, new config channel changes usually go through ok, but the ones that got stuck, stay stuck.

I'll back off to 1 minute push/pull times, to give the synchronizing triggers action time to complete, and then try a bunch of config changes and see if any still get stuck.

rotten

2012-12-19 20:11

reporter   ~0000172

Slowing down the push/pull from 1 second to 60 seconds seems to have helped prevent config threads from getting stuck in LD status.

Next I'll try setting the push/pull thread counts to '1' to see if single threaded works more reliably than the default of '10' (at a 1 second refresh rate).

rotten

2012-12-20 14:28

reporter   ~0000176

Single vs multiple threaded did not seem to have the same impact as shortening the refresh rate. I think the problem is related to the repeated 'synchronizing triggers' activities, but I'm still not certain.

One thing that confused and really slowed down my tests is this SymmetricDS caveat I'm adding to my operational notes:

"""
Changing node_group_link while the cluster is running is dangerous. Almost always it will break replication. Once configured, do NOT change node_group_link settings unless you absolutely have to, and when you do, manually change it on all nodes while replication is down. Remove it from from sym_data before starting the cluster back up.
"""

Issue History

Date Modified Username Field Change
2012-12-18 19:29 rotten New Issue
2012-12-18 19:30 rotten Note Added: 0000168
2012-12-18 20:48 chenson Note Added: 0000169
2012-12-19 13:49 rotten Note Added: 0000171
2012-12-19 20:11 rotten Note Added: 0000172
2012-12-20 14:28 rotten Note Added: 0000176
2014-02-13 15:15 elong Priority normal => high
2019-04-22 14:42 elong Tag Attached: configuration
2019-04-22 14:42 elong Tag Attached: data sync
2019-04-26 17:43 elong Summary config replication trips over itself => FK constraint error when syncing configuration change