View Issue Details

IDProjectCategoryView StatusLast Update
0002747SymmetricDSBugpublic2017-02-01 01:06
Reporteriberras Assigned Tochenson  
Prioritynormal 
Status closedResolutionunable to reproduce 
Product Version3.8.2 
Summary0002747: ConcurrentManager not working correctly
DescriptionSince 3.8.1 or 3.8.2 im not sure I keep getting this error on the server side:

2016-08-23 23:21:52,521 WARN [serverdevelopment] [ConcurrentConnectionManager] [qtp855942730-16] Node '002' Channel 'default' requested a /sync/serverdevelopment/pull connection, but was rejected because it already has one

and on the client side I get:

INFO RouterService - Routed 1 data events in 4 ms
INFO PullService - server:000:000 at http://server.development.com:31415/sync/serverdevelopment was busy
INFO DefaultOfflineClientListener - Node '000' was too busy to accept the connection
INFO PullService - There was a failure while pulling data from server:000:000 on channel thread default. 0 rows and 0 batches were processed

And it gets stuck forever. This happens when I try to make an initial load. In normal pull/push this doesnt happen.
Steps To ReproduceIt doesn't happen all the time. I restart all symmetric deleting all the tables, and sometimes it happens, and sometimes it doesn't.
TagsNo tags attached.

Activities

admin

2016-09-30 14:28

administrator   ~0000870

Re-opening. Was accidentally closed when performing the release.

Zulla

2016-12-20 09:40

reporter   ~0000931

Is there a working workaround?

chenson

2016-12-20 13:30

administrator   ~0000932

Is this reproducible with later versions of 3.8?

iberras

2016-12-20 16:57

reporter   ~0000933

The only work around that i found was to delete all symmetric tables and start again. This is still happening in 3.8. There was another issue http://www.symmetricds.org/issues/view.php?id=2869 that is connected to this. It is very hard to replicate. The timing has to be just right. But when this happens, there is no turning back and everything gets stuck, and the only thing that solves this is to delete all the tables. Obviously this is not very convenient. Ask anything you need and i will send the log files and any other information.

chenson

2016-12-20 17:52

administrator   ~0000934

log files and jvm thread dumps when the system gets in this state would be helpful.

iberras

2016-12-20 18:20

reporter   ~0000935

Last edited: 2017-01-25 13:14

View 3 revisions

OK, tried to replicate the error with 3.8.11. Steps to reproduce what i did. Start an initial load that creates all the schema in the client. Stop the process in the middle. Restart the client and start over. I have to try several times, i dont know exactly at what moment this has to be done for the error to occurs, if when the tables are being created or in the final part when all the data is arriving. In all the test i did, i could not find a pattern. In the 5th try i started to see in the log the error. I think something changed in the las version because now i restarted the client and the initial load could go through after the error. But this doesnt happen all the time. Nothing is constant. Here is the log file. Symmetric didnt create the dump o the thread this time(it always does when this error occurs). If we need more evidence i will try to send the stack trace.

In the log file you will se a part that says "ESTOY AGREGANDO EL NODO a la WHITELIST". This is a listener i added to add the node to the whitelist so it doesnt get the busy error, but clearly it doesnt work.

-- removing content to shorten ticket ... see attached log-and-stack.txt

Zulla

2016-12-25 15:34

reporter   ~0000936

Happened to me only once.
Server ver 3.8.7
Clients ver 3.8.9

Seems like a serverside bug,
restarting the service on the server fixed the issue on the client.

iberras

2016-12-26 15:57

reporter   ~0000937

Sometimes it was fixed with a restart. But sometimes a hard restart( deleting all symmetric tables) was needed. I hope they can find the bug

chenson

2016-12-26 17:02

administrator   ~0000938

This is only during an initial load, correct?

iberras

2016-12-26 17:37

reporter   ~0000939

That is correct.

chenson

2016-12-26 19:16

administrator   ~0000940

Do you have more of the log file after the timeoutexception? I would have expected the sync to continue after that exception.

In your example I think the client timed out and tried to pull again. The pull was rejected because the server was busy extracting. When the server finishes extracting it gets the timeout exception. At that point the connection should be freed up for the client to re-pull.

iberras

2016-12-26 19:22

reporter   ~0000941

The sync does not continue. The "busy" log keeps going and going. That stack trace does not appear normally in this error, it was just a coincidence, i think. I will try to replicate again the error and send the log.

chenson

2016-12-26 19:29

administrator   ~0000942

Last edited: 2016-12-26 19:30

View 2 revisions

What would really be a big help along with the log is a thread dump. Can you use jstack -l {pid} to get that as well?

iberras

2016-12-26 19:58

reporter   ~0000943

Last edited: 2017-01-25 13:13

View 2 revisions

Ok. Could replicate the error easily. You have to stop the initial load in the last part, after the tables have loaded and the data is coming. If you stop in that exact moment you will see in the log fail de EOF. After that the system does not recover. If i start another initial load, the same error keeps coming. Im sending the full log file and the attempt of the stack trace. I have it is useful.

removing content to shorten ticket ...

Hope you will find the bug.

chenson

2016-12-28 00:12

administrator  

chenson

2016-12-28 00:13

administrator   ~0000945

I attached a patch jar that can be put in the patches directory. If you can recreate the issue again and *attach* another log file that would be great. It should contain information about the reserving and releasing of connections. It should give us a clue as to what is going on.

chenson

2016-12-28 21:03

administrator   ~0000946

I spent a couple of hours today trying to reproduce this, and so far have been unsuccessful.

chenson

2016-12-28 21:05

administrator   ~0000947

Also, are there any errors in wrapper.log?

iberras

2016-12-29 00:36

reporter   ~0000948

Im out, and i'll be back on the 15th of january. I have lame internet and i cannot replicate the error here. If i get a stable connection, i will try to replicate the error eith the patch you sent me. The wrapper does not show any errors.

iberras

2017-01-23 19:16

reporter  

symmetric.log (118,301 bytes)

iberras

2017-01-23 19:18

reporter   ~0000959

Hello. I replicated the error with the new version 14. I always get this error when the initial load is interrupted in the last stage where all the data is beign loaded, after all the schema was created.

I attached the log file with the patch. If you find that you need more information, please let me know!

chenson

2017-01-25 13:14

administrator  

log-and-stack.txt (366,010 bytes)

chenson

2017-01-25 13:15

administrator   ~0000961

Still can't explain this.

I see thread 15 reserve a connection start to extract and then (for some reason wait 3 minutes) and then eventually release the connection. The release was at the tail end of the log file. Did things continue on after that?

2017-01-23 15:49:02,706 INFO [serverdevelopment] [ConcurrentConnectionManager] [qtp1713411479-15] Reserving connection for 002-default
2017-01-23 15:50:36,312 ERROR [serverdevelopment] [DataExtractorService] [qtp1713411479-15] Failed to extract batch 002-5072 StackTraceKey.init [RuntimeException:764789863]
2017-01-23 15:53:36,318 INFO [serverdevelopment] [ConcurrentConnectionManager] [qtp1713411479-15] Releasing connection for 002-default

iberras

2017-01-25 13:56

reporter   ~0000962

You can see that in the end of there is a loop:

serverdevelopment/pull connection, but was rejected because it already has one
2017-01-23 15:50:28,326 INFO [serverdevelopment] [ConcurrentConnectionManager] [qtp1713411479-19] Reserving connection for 002
2017-01-23 15:50:28,352 INFO [serverdevelopment] [ConcurrentConnectionManager] [qtp1713411479-18] Reserving connection for 002-default
2017-01-23 15:50:28,471 INFO [serverdevelopment] [DataLoaderService] [qtp1713411479-18] 1 data and 1 batches loaded during push request from c-serverdevelopment:002:002.
2017-01-23 15:50:28,473 INFO [serverdevelopment] [ConcurrentConnectionManager] [qtp1713411479-18] Releasing connection for 002-default
2017-01-23 15:50:29,018 INFO [serverdevelopment] [RouterService] [serverdevelopment-job-20] Routed 1 data events in 294 ms

This keeps repeating without an end. The last part is because I stoped the execution, those three minutes. It continues and continues.

iberras

2017-01-26 14:18

reporter   ~0000963

Chenson, I've made a discovery! It took me a while to figure this out, but if I leave symmetric in that state "busy" for a period of 2-3 minutes, the system achieves to resolve the problem and starts the initial load again. I don't know why it takes so long to resolve the issue, but it does. For now I believe this is resolved. Thank you for all the time you invested in this. If you want the log file where everything works, let me know.

Issue History

Date Modified Username Field Change
2016-08-24 02:28 iberras New Issue
2016-09-30 01:01 admin Status new => closed
2016-09-30 14:28 admin Status closed => new
2016-09-30 14:28 admin Note Added: 0000870
2016-12-20 09:40 Zulla Note Added: 0000931
2016-12-20 13:30 chenson Note Added: 0000932
2016-12-20 16:57 iberras Note Added: 0000933
2016-12-20 17:52 chenson Note Added: 0000934
2016-12-20 18:20 iberras Note Added: 0000935
2016-12-25 15:34 Zulla Note Added: 0000936
2016-12-26 15:57 iberras Note Added: 0000937
2016-12-26 17:02 chenson Note Added: 0000938
2016-12-26 17:37 iberras Note Added: 0000939
2016-12-26 19:16 chenson Note Added: 0000940
2016-12-26 19:22 iberras Note Added: 0000941
2016-12-26 19:29 chenson Note Added: 0000942
2016-12-26 19:30 chenson Note Edited: 0000942 View Revisions
2016-12-26 19:58 iberras Note Added: 0000943
2016-12-28 00:12 chenson File Added: log-concurrency-patch.jar
2016-12-28 00:13 chenson Note Added: 0000945
2016-12-28 21:03 chenson Note Added: 0000946
2016-12-28 21:05 chenson Note Added: 0000947
2016-12-29 00:36 iberras Note Added: 0000948
2017-01-23 19:16 iberras File Added: symmetric.log
2017-01-23 19:18 iberras Note Added: 0000959
2017-01-25 13:12 chenson Note Edited: 0000935 View Revisions
2017-01-25 13:13 chenson Note Edited: 0000943 View Revisions
2017-01-25 13:14 chenson Note Edited: 0000935 View Revisions
2017-01-25 13:14 chenson File Added: log-and-stack.txt
2017-01-25 13:15 chenson Note Added: 0000961
2017-01-25 13:56 iberras Note Added: 0000962
2017-01-26 14:18 iberras Note Added: 0000963
2017-02-01 01:06 chenson Status new => closed
2017-02-01 01:06 chenson Assigned To => chenson
2017-02-01 01:06 chenson Resolution open => unable to reproduce