Table of Contents
Chapter 3 introduced numerous concepts and the analysis and design needed to create an implementation of SymmetricDS. This chapter re-visits each analysis step and documents how to turn a SymmetricDS design into reality through configuration of the various SymmetricDS tables. In addition, several advanced configuration options, not presented previously, will also be covered.
To get a SymmetricDS node running, it needs to be given an identity and it needs to know how to connect to the database it will be synchronizing. The preferred way to configure a SymmetricDS engine is to create a properties file in the engines directory. The SymmetricDS server will create an engine for each properties file found in the engines directory. When started up, SymmetricDS reads the synchronization configuration and state from the database. If the configuration tables are missing, they are created automatically (auto creation can be disabled). Basic configuration is described by inserting into the following tables (the complete data model is defined in Appendix A, Data Model).
NODE_GROUP - specifies the tiers that exist in a SymmetricDS network
NODE_GROUP_LINK - links two node groups together for synchronization
CHANNEL - grouping and priority of synchronizations
TRIGGER - specifies tables, channels, and conditions for which changes in the database should be captured
ROUTER - specifies the routers defined for synchronization, along with other routing details
TRIGGER_ROUTER - provides mappings of routers and triggers
During start up, triggers are verified against the database, and database triggers are installed on tables that require data changes to be captured. The Route, Pull and Push Jobs begin running to synchronize changes with other nodes.
Each node requires properties that allow it to connect to a database
and register with a parent node. Properties are configured in a file named
xxxxx.properties
that is placed in the engines directory of
the SymmetricDS install. The file is usually named according to the
engine.name, but it is not a requirement.
To give a node its identity, the following properties are required.
Any other properties found in conf/symmetric.properties
can
be overridden for a specific engine in an engine's properties file. If the
properties are changed in conf/symmetric.properties
they will
take effect across all engines deployed to the server. Note that you can
use the variable $(hostName)
to represent the host name
of the machine when defining these properties (for example,
external.id=$(hostName) ).
This is an arbitrary name that is used to access a specific engine using an HTTP URL. Each node configured in the engines directory must have a unique engine name. The engine name is also used for the domain name of registered JMX beans.
The node group that this node is a member of. Synchronization is specified between node groups, which means you only need to specify it once for multiple nodes in the same group.
The external id for this node has meaning to the user and provides integration into the system where it is deployed. For example, it might be a retail store number or a region number. The external id can be used in expressions for conditional and subset data synchronization. Behind the scenes, each node has a unique sequence number for tracking synchronization events. That makes it possible to assign the same external id to multiple nodes, if desired.
The URL where this node can be contacted for synchronization.
At startup and during each heartbeat, the node updates its entry in
the database with this URL. The sync url is of the format:
http://{hostname}:{port}/{webcontext}/sync/{engine.name}
.
The {webcontext} is blank for a standalone deployment. It will typically be the name of the war file for an application server deployment.
The {engine.name} can be left blank if there is only one engine deployed in a SymmetricDS server.
When a new node is first started, it is has no information about synchronizing. It contacts the registration server in order to join the network and receive its configuration. The configuration for all nodes is stored on the registration server, and the URL must be specified in the following property:
The URL where this node can connect for registration to receive its configuration. The registration server is part of SymmetricDS and is enabled as part of the deployment. This is typically equal to the value of the sync.url of the registration server.
Note that a registration server node is
defined as one whose registration.url
is either (a)
blank, or (b) identical to its sync.url
.
For a deployment where the database connection pool should be created using a JDBC driver, set the following properties:
The class name of the JDBC driver.
The JDBC URL used to connect to the database.
The database username, which is used to login, create, and update SymmetricDS tables.
The password for the database user.
A node, a single instance of SymmetricDS, is defined in the NODE table. Two other tables play a direct role in defining a node, as well The first is NODE_IDENTITY. The only row in this table is inserted in the database when the node first registers with a parent node. In the case of a root node, the row is entered by the user. The row is used by a node instance to determine its node identity.
The following SQL statements set up a top-level registration server as a node identified as "00000" in the "corp" node group.
insert into SYM_NODE (node_id, node_group_id, external_id, sync_enabled) values ('00000', 'corp', '00000', 1); insert into SYM_NODE_IDENTITY values ('00000');
The second table, NODE_SECURITY has rows created for each child node that registers with the node, assuming auto-registration is enabled. If auto registration is not enabled, you must create a row in NODE and NODE_SECURITY for the node to be able to register. You can also, with this table, manually cause a node to re-register or do a re-initial load by setting the corresponding columns in the table itself. Registration is discussed in more detail in Section 4.7, “Opening Registration”.
Node Groups are straightforward to configure and are defined in the NODE_GROUP table. The following SQL statements would create node groups for "corp" and "store" based on our retail store example.
insert into SYM_NODE_GROUP (node_group_id, description) values ('store', 'A retail store node'); insert into SYM_NODE_GROUP (node_group_id, description) values ('corp', 'A corporate node');
Similarly, Node Group links are established using a data event action of 'P' for Push and 'W' for Pull ("wait"). The following SQL statements links the "corp" and "store" node groups for synchronization. It configures the "store" nodes to push their data changes to the "corp" nodes, and the "corp" nodes to send changes to "store" nodes by waiting for a pull.
insert into SYM_NODE_GROUP_LINK (source_node_group, target_node_group, data_event_action) values ('store', 'corp', 'P'); insert into SYM_NODE_GROUP_LINK (source_node_group, target_node_group, data_event_action) values ('corp', 'store', 'W');
A node group link can be configured to use the same node group as the source and the target. This configuration allows a node group to sync with every other node in its group.
By categorizing data into channels and assigning them to TRIGGERs, the user gains more control and visibility into the flow of data. In addition, SymmetricDS allows for synchronization to be enabled, suspended, or scheduled by channels as well. The frequency of synchronization and order that data gets synchronized is also controlled at the channel level.
The following SQL statements setup channels for a retail store. An "item" channel includes data for items and their prices, while a "sale_transaction" channel includes data for ringing sales at a register.
insert into SYM_CHANNEL (channel_id, processing_order, max_batch_size, max_batch_to_send, extract_period_millis, batch_algorithm, enabled, description) values ('item', 10, 1000, 10, 0, 'default', 1, 'Item and pricing data'); insert into SYM_CHANNEL (channel_id, processing_order, max_batch_size, max_batch_to_send, extract_period_millis, batch_algorithm, enabled, description) values ('sale_transaction', 1, 1000, 10, 60000, 'transactional', 1, 'retail sale transactions from register');
Batching is the grouping of data, by channel, to be transferred and committed at the client together. There are three different out-of-the-box batching algorithms which may be configured in the batch_algorithm column on channel.
All changes that happen in a transaction are guaranteed to be batched together. Multiple transactions will be batched and committed together until there is no more data to be sent or the max_batch_size is reached.
Batches will map directly to database transactions. If there are many small database transactions, then there will be many batches. The max_batch_size column has no effect.
Multiple transactions will be batched and committed together until there is no more data to be sent or the max_batch_size is reached. The batch will be cut off at the max_batch_size regardless of whether it is in the middle of a transaction.
If a channel contains only tables that will be
synchronized in one direction and and data is routed to all the nodes in
the target node groups, then batching on the channel can be optimized to
share batches across nodes. This is an important feature when data needs
to be routed to thousands of nodes. When this mode is detected, you will
see batches created in OUTGOING_BATCH with the common_flag
set to
1.
There are also several size-related parameters that can be set by channel. They include:
Specifies the maximum number of data events to process within a batch for this channel.
Specifies the maximum number of batches to send for a given channel during a 'synchronization' between two nodes. A 'synchronization' is equivalent to a push or a pull. For example, if there are 12 batches ready to be sent for a channel and max_batch_to_send is equal to 10, then only the first 10 batches will be sent even though 12 batches are ready.
Specifices the maximum number of data rows to route for a channel at a time.
Based on your particular synchronization requirements, you can also specify whether old, new, and primary key data should be read and included during routing for a given channel. These are controlled by the columns use_old_data_to_route, use_row_data_to_route, and use_pk_data_to_route, respectively. By default, they are all 1 (true).
Finally, if data on a particular channel contains big lobs, you can set the column contains_big_lob to 1 (true) to provide SymmetricDS the hint that the channel contains big lobs. Some databases have shortcuts that SymmetricDS can take advantage of if it knows that the lob columns in DATA aren't going to contain large lobs. The definition of how large a 'big' lob is varies from database to database.
SymmetricDS captures synchronization data using database triggers.
SymmetricDS' Triggers are defined in the TRIGGER table. Each record is used by SymmetricDS when
generating database triggers. Database triggers are only generated when
a trigger is associated with a ROUTER whose source_node_group_id
matches the node group id of the current node.
The source_table_name
may contain the asterisk
('*') wildcard character so that one TRIGGER table entry can define synchronization for many
tables. System tables and any tables that start with the SymmetricDS
table prefix will be excluded. A list of wildcard tokens can also be
supplied. If there are multiple tokens, they should be delimited with a
comma. A wildcard token can also start with a bang ('!') to indicate an
exclusive match. Tokens are always evalulated from left to right. When a
table match is made, the table is either added to or removed from the
list of tables. If another trigger already exists for a table, then that
table is not included in the wildcard match (the explictly defined
trigger entry take precendence).
When determining whether a data change has occurred or not, by
defalt the triggers will record a change even if the data was updated to
the same value(s) they were originally. For example, a data change will
be captured if an update of one column in a row updated the value to the
same value it already was. There is a global property,
trigger.update.capture.changed.data.only.enabled
(false by default), that allows you to override this behavior. When set
to true, SymmetricDS will only capture a change if the data has truly
changed (i.e., when the new column data is not equal to the old column
data).
trigger.update.capture.changed.data.only.enabled
is
currently only supported in the MySQL, DB2 and Oracle
dialects.The following SQL statement defines a trigger that will capture data for a table named "item" whenever data is inserted, updated, or deleted. The trigger is assigned to a channel also called 'item'.
insert into SYM_TRIGGER (trigger_id,source_table_name,channel_id,last_update_time,create_time) values ('item', 'item', 'item', current_timestamp, current_timestamp);
Note that many databases allow for multiple triggers of the same type to be defined. Each database defines the order in which the triggers fire differently. If you have additional triggers beyond those SymmetricDS installs on your table, please consult your database documentation to determine if there will be issues with the ordering of the triggers.
Two lobs-related settings are also available on TRIGGER:
Specifies whether to capture lob data as the trigger is firing or to stream lob columns from the source tables using callbacks during extraction. A value of 1 indicates to stream from the source via callback; a value of 0, lob data is captured by the trigger.
Provides a hint as to whether this trigger will capture big lobs data. If set to 1 every effort will be made during data capture in trigger and during data selection for initial load to use lob facilities to extract and store data in the database.
Occasionally, you may find that you need to capture and save away a piece of data present in another table when a trigger is firing.
This data is typically needed for
the purposes of determining where to 'route' the data to once routing takes place. Each trigger definition contains an optional
external_select
field which can be used to specify the data to be captured.
Once captured, this data is available during routing in DATA's external_data
field.
For these cases, place a SQL select statement which returns the data item you need for routing in external_select
.
An example of the use of external select can be found in Section 4.6.2.6, “Utilizing External Select when Routing”.
Routers provided in the base implementation currently include:
The mapping between the set of triggers and set of routers is many-to-many. This means that one trigger can capture changes and route to multiple locations. It also means that one router can be defined an associated with many different triggers.
The simplest router is a router that sends all the data that is captured by its associated triggers to all the nodes that belong to the target node group defined in the router. A router is defined as a row in the ROUTER table. It is then linked to triggers in the TRIGGER_ROUTER table.
The following SQL statement defines a router that will send data from the 'corp' group to the 'store' group.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, create_time, last_update_time) values ('corp-2-store','corp', 'store', current_timestamp, current_timestamp);
The following SQL statement maps the 'corp-2-store' router to the item trigger.
insert into SYM_TRIGGER_ROUTER (trigger_id, router_id, initial_load_order, create_time, last_update_time) values ('item', 'corp-2-store', 1, current_timestamp, current_timestamp);
Sometimes requirements may exist that require data to be routed
based on the current value or the old value of a column in the table
that is being routed. Column routers are configured by setting the
router_type
column on the ROUTER table to
column
and setting the
router_expression
column to an equality expression
that represents the expected value of the column.
The first part of the expression is always the column name. The column name should always be defined in upper case. The upper case column name prefixed by OLD_ can be used for a comparison being done with the old column data value.
The second part of the expression can be a constant value, a token that represents another column, or a token that represents some other SymmetricDS concept. Token values always begin with a colon (:).
Consider a table that needs to be routed to all nodes in the target group only when a status column is set to 'READY TO SEND.' The following SQL statement will insert a column router to accomplish that.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-ok','corp', 'store', 'column', 'STATUS=READY TO SEND', current_timestamp, current_timestamp);
Consider a table that needs to be routed to all nodes in the target group only when a status column changes values. The following SQL statement will insert a column router to accomplish that. Note the use of OLD_STATUS, where the OLD_ prefix gives access to the old column value.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-status','corp', 'store', 'column', 'STATUS!=:OLD_STATUS', current_timestamp, current_timestamp);
Consider a table that needs to be routed to only nodes in the target group whose STORE_ID column matches the external id of a node. The following SQL statement will insert a column router to accomplish that.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-id','corp', 'store', 'column', 'STORE_ID=:EXTERNAL_ID', current_timestamp, current_timestamp);
Attributes on a NODE that can be referenced with tokens include:
Captured EXTERNAL_DATA is also available for routing as a virtual column.
Consider a table that needs to be routed to a redirect node defined by its external id in the REGISTRATION_REDIRECT table. The following SQL statement will insert a column router to accomplish that.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-redirect','corp', 'store', 'column', 'STORE_ID=:REDIRECT_NODE', current_timestamp, current_timestamp);
More than one column may be configured in a router_expression. When more than one column is configured, all matches are added to the list of nodes to route to. The following is an example where the STORE_ID column may contain the STORE_ID to route to or the constant of ALL which indicates that all nodes should receive the update.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-multiple-matches','corp', 'store', 'column', 'STORE_ID=ALL or STORE_ID=:EXTERNAL_ID', current_timestamp, current_timestamp);
The NULL keyword may be used to check if a column is null. If the column is null, then data will be routed to all nodes who qualify for the update. This following is an example where the STORE_ID column is used to route to a set of nodes who have a STORE_ID equal to their EXTERNAL_ID, or to all nodes if the STORE_ID is null.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-multiple-matches','corp', 'store', 'column', 'STORE_ID=NULL or STORE_ID=:EXTERNAL_ID', current_timestamp, current_timestamp);
A lookup table may contain the id of the node where data needs
to be routed. This could be an existing table or an ancillary table
that is added specifically for the purpose of routing data. Lookup
table routers are configured by setting the
router_type
column on the ROUTER table to
lookuptable
and setting a list of configuration
parameters in the router_expression
column.
Each of the following configuration parameters are required.
This is the name of the lookup table.
This is the name of the column on the table that is being routed. It will be used as a key into the lookup table.
This is the name of the column that is the key on the lookup table.
This is the name of the column that contains the external_id of the node to route to on the lookup table.
Note that the lookup table will be read into memory and cached for the duration of a routing pass for a single channel.
Consider a table that needs to be routed to a specific store, but the data in the changing table only contains brand information. In this case, the STORE table may be used as a lookup table.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-ok','corp', 'store', 'lookuptable', 'LOOKUP_TABLE=STORE KEY_COLUMN=BRAND_ID LOOKUP_KEY_COLUMN=BRAND_ID EXTERNAL_ID_COLUMN=STORE_ID', current_timestamp, current_timestamp);
Sometimes routing decisions need to be made based on data that
is not in the current row being synchronized. A 'subselect' router can be used
in these cases. A 'subselect' is configured with a router_expression
that is a
SQL select statement which returns a result set of the node ids that
need routed to. Column tokens can be used in the SQL expression and
will be replaced with row column data. The overhead of using this
router type is high because the 'subselect' statement runs for each
row that is routed. It should not be used for tables that have a lot
of rows that are updated. It also has the disadvantage that if the
data being relied on to determine the node id has been deleted before
routing takes place, then no results would be returned and
routing would not happen.
The router_expression
you specify is appended to the
following SQL statement in order to select the node ids:
select c.node_id from sym_node c where c.node_group_id=:NODE_GROUP_ID and c.sync_enabled=1 and ...
As you can see, you have access to information about the node currently under consideration for routing
through the 'c' alias, for example c.external_id
.
There are two node-related tokens you can use in your expression:
Column names representing data for the row in question are prefixed with a colon as well, for example:
:EMPLOYEE_ID
, or :OLD_EMPLOYEE_ID
. Here, the OLD_ prefix indicates the value before
the change in cases where the old data has been captured.
For an example, consider the case where an Order table and a OrderLineItem table need to be routed to a specific store. The Order table has a column named order_id and STORE_ID. A store node has an external_id that is equal to the STORE_ID on the Order table. OrderLineItem, however, only has a foreign key to its Order of order_id. To route OrderLineItems to the same nodes that the Order will be routed to, we need to reference the master Order record.
There are two possible ways to solve this in
SymmetricDS. One is to configure a 'subselect' router_type on the
ROUTER table, shown below (The other possible
approach is to use an external_select
to capture the data via a trigger for use in
a column match router, demonstrated in Section 4.6.2.6, “Utilizing External Select when Routing”).
Our solution utilizing subselect compares the external id of the current node with the store id from the Order table where the order id matches the order id of the current row being routed:
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store','corp', 'store', 'subselect', 'c.external_id in (select STORE_ID from order where order_id=:ORDER_ID)', current_timestamp, current_timestamp);
As a final note, please note in this example that the parent row in Order must still exist at the moment of routing for the child rows (OrderLineItem) to route, since the select statement is run when routing is occurring, not when the change data is first captured.
When more flexibility is needed in the logic to choose the nodes to route to, then the a scripted router may be used. The currently available scripting language is Bean Shell. Bean Shell is a Java-like scripting language. Documentation for the Bean Shell scripting language can be found at http://www.beanshell.org.
The router_type for a Bean Shell scripted router is 'bsh'. The router_expression is a valid Bean Shell script that:
targetNodes
collection
which is bound to the script Also bound to the script evaluation is a list of
nodes
. The list of nodes
is a list of
eligible org.jumpmind.symmetric.model.Node
objects. The
current data column values and the old data column values are bound to
the script evaluation as Java object representations of the column
data. The columns are bound using the uppercase names of the columns.
Old values are bound to uppercase representations that are prefixed
with 'OLD_'.
If you need access to any of the SymmetricDS services, then the
instance of org.jumpmind.symmetric.ISymmetricEngine
is
accessible via the bound engine
variable.
In the following example, the node_id is a combination of STORE_ID and WORKSTATION_NUMBER, both of which are columns on the table that is being routed.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-bsh','corp', 'store', 'bsh', 'targetNodes.add(STORE_ID + "-" + WORKSTATION_NUMBER);', current_timestamp, current_timestamp);
The same could also be accomplished by simply returning the node id. The last line of a bsh script is always the return value.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-bsh','corp', 'store', 'bsh', 'STORE_ID + "-" + WORKSTATION_NUMBER', current_timestamp, current_timestamp);
The following example will synchronize to all nodes if the FLAG column has changed, otherwise no nodes will be synchronized. Note that here we make use of OLD_, which provides access to the old column value.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-flag-changed','corp', 'store', 'bsh', 'FLAG != null && !FLAG.equals(OLD_FLAG)', current_timestamp, current_timestamp);
The next example shows a script that iterates over each eligible node and checks to see if the trimmed value of the column named STATION equals the external_id.
insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-trimmed-station','corp', 'store', 'bsh', 'for (org.jumpmind.symmetric.model.Node node : nodes) { if (STATION != null && node.getExternalId().equals(STATION.trim())) { targetNodes.add(node.getNodeId()); } }', current_timestamp, current_timestamp);
There may be times when you wish to route based on a piece of data that exists in
a table other than the one being routed. The approach, first discussed in
Section 4.6.2.4, “Subselect Router”,
is to utlize an external_select
to save away data in external_data
, which can then
be referenced during routing.
Reconsider subselect's Order / OrderLineItem example (found in Section 4.6.2.4, “Subselect Router”), where routing for the line item is accomplished by linking to the "header" Order row. As an alternate way of solving the problem, we will now use External Select combined with a column match router.
In this version of the solution, the STORE_ID is captured from the Order table in the EXTERNAL_DATA column when the trigger fires. The router is configured to route based on the captured EXTERNAL_DATA to all nodes whose external id matches the captured external data.
insert into SYM_TRIGGER (trigger_id,source_table_name,channel_id,external_select, last_update_time,create_time) values ('orderlineitem', 'orderlineitem', 'orderlineitem','select STORE_ID from order where order_id=$(curTriggerValue).$(curColumnPrefix)order_id', current_timestamp, current_timestamp); insert into SYM_ROUTER (router_id, source_node_group_id, target_node_group_id, router_type, router_expression, create_time, last_update_time) values ('corp-2-store-ext','corp', 'store', 'column', 'EXTERNAL_DATA=:EXTERNAL_ID', current_timestamp, current_timestamp);
Note the syntax $(curTriggerValue).$(curColumnPrefix). This translates into "OLD_" or "NEW_" based on the DML type being run. In the case of Insert or Update, it's NEW_. For Delete, it's OLD_ (since there is no new data). In this way, you can access the DML-appropriate value for your select statement.
The advantage of this approach over the 'subselect' approach is that it guards against the (somewhat unlikely) possibility that the master Order table row might have been deleted before routing has taken place. This external select solution also is a bit more efficient than the 'subselect' approach, although the triggers produced do run the extra external_select SQL inline with application database updates.
Two important controls can be configured for a specific Trigger / Router combination: Initial Loads and Ping Back. The parameters for these can be found in the Trigger / Router mapping table, TRIGGER_ROUTER.
An initial load is the process of seeding tables at a target node with data from its parent node. When a node connects and data is extracted, after it is registered and if an initial load was requested, each table that is configured to synchronize to the target node group will be given a reload event in the order defined by the end user. A SQL statement is run against each table to get the data load that will be streamed to the target node. The selected data is filtered through the configured router for the table being loaded. If the data set is going to be large, then SQL criteria can optionally be provided to pair down the data that is selected out of the database.
An initial load can not occur until after a node is registered.
An initial load is requested by setting the
initial_load_enabled
column on NODE_SECURITY to
1 on the row for the target node in the parent
node's database. You can configure SymmetricDS to automatically perform an initial load
when a node registers by setting the parameter auto.reload
to true.
Regardless of how the initial load is initiated, the next time the source node routes data, reload
batches will be inserted. At the same time reload batches are
inserted, all previously pending batches for the node are marked as
successfully sent.
Note that if the parent node that a node is registering with
is not a registration server node (as can
happen with a registration redirect or certain non-tree structure
node configurations) the parent node's NODE_SECURITY entry must exist
at the parent node and have a non-null value for column
initial_load_time
. Nodes can't be registered to
non-registration-server nodes without this value being set one way
or another (i.e., manually, or as a result of an initial load
occuring at the parent node).
SymmetricDS recognizes that an initial load has completed when
the initial_load_time
column on the target node is
set to a non-null value.
An initial load is accomplished by inserting reload batches in a
defined order according to the initial_load_order
column on TRIGGER_ROUTER. If
the initial_load_order
column contains a negative
value the associated table will NOT be loaded. If
the initial_load_order
column contains the same
value for multiple tables, SymmetricDS will attempt to order the
tables according to foreign key constraints. If there are cyclical
constraints, then foreign keys might need to be turned off or the
initial load will need to be manually configured based on knowledge of
how the data is structured.
Initial load data is always queried from the source database table. All data is passed through the configured router to filter out data that might not be targeted at a node.
There are several parameters that can be used to specify what, if anything, should be done to the table on the target database just prior to loading the data. Note that the parameters below specify the desired behavior for all tables in the initial load, not just one.
initial.load.delete.first / initial.load.delete.first.sql
By default, an initial load will not delete existing rows from a target table
before loading the data. If a delete is desired, the parameter
initial.load.delete.first
can be set to true. If true,
the command found in initial.load.delete.first.sql
will be run on each table prior to loading the data.
Thd default value for initial.load.delete.first.sql
is delete from %s
,
but could be changed if needed.
Note that additional reload batches are created, in the correct order, to achieve the delete.
initial.load.create.first
By default, an initial load will not create the table on the target if it doesn't aleady exist.
If the desired behavior is to create the table on the target if it is not present,
set the parameter intial.load.create.first
to true. SymmetricDS will
attempt to create the table and indexes on the target database before doing the initial load. (Additional batches
are created to represent the table schema).
An efficient way to select a subset of data from a table for an
initial load is to provide an initial_load_select
clause on TRIGGER_ROUTER.
This clause, if present, is applied as a where
clause to the SQL used to select the data to be loaded. The clause may
use "t" as an alias for the table being loaded, if needed. The
$(externalId)
token can be used for subsetting the
data in the where clause.
In cases where routing is done using a feature like Section 4.6.2.4, “Subselect Router”, an
initial_load_select
clause matching the subselect's
criteria would be a more efficient approach. Some routers will check
to see if the initial_load_select
clause is
provided, and they will not execute assuming that
the more optimal path is using the
initial_load_select
statement.
One example of the use of an initial load select would be if you
wished to only load data created more recently than the start of year
2011. Say, for example, the column created_time
contains the creation date. Your
initial_load_select
would read
created_time > ts {'2011-01-01 00:00:00.0000'}
(using whatever timestamp format works for your database). This then
gets applied as a where
clause when selecting data
from the table.
When providing an initial_load_select
be
sure to test out the criteria against production data in a query
browser. Do an explain plan to make sure you are properly using
indexes.
The default behavior for initial loads is to load data from the registration server or parent node, to a client node.
Occasionally, there may be need to do a one-time intial load of data in the opposite or "reverse" direction, namely from a client
node to the registration node. To achieve this, set the parameter auto.reload.reverse
to be true, but only for the specific
node group representing the client nodes. This will cause a one time reverse load of data, for tables configured with non-negative initial load orders, to be
batched at the point when registration of the client node is occurring. These batches are then sent to the parent or registration node.
This capability might be needed, for example, if there is data already present in the client that doesn't exist in the parent but needs to.
Occasionally the decision of what data to load initially results
in additional triggers. These triggers, known as Dead
Triggers, are configured such that they do not capture any
data changes. A "dead" Trigger is one that does not capture data
changes. In other words, the sync_on_insert
,
sync_on_update
, and
sync_on_delete
properties for the Trigger are all
set to false. However, since the Trigger is specified, it
will be included in the initial load of data for
target Nodes.
Why might you need a Dead Trigger? A dead Trigger might be used to load a read-only lookup table, for example. It could also be used to load a table that needs populated with example or default data. Another use is a recovery load of data for tables that have a single direction of synchronization. For example, a retail store records sales transaction that synchronize in one direction by trickling back to the central office. If the retail store needs to recover all the sales transactions from the central office, they can be sent are part of an initial load from the central office by setting up dead Triggers that "sync" in that direction.
The following SQL statement sets up a non-syncing dead Trigger
that sends the sale_transaction
table to the
"store" Node Group from the "corp" Node Group during an initial load.
insert into sym_trigger (TRIGGER_ID,SOURCE_CATALOG_NAME, SOURCE_SCHEMA_NAME,SOURCE_TABLE_NAME,CHANNEL_ID, SYNC_ON_UPDATE,SYNC_ON_INSERT,SYNC_ON_DELETE, SYNC_ON_INCOMING_BATCH,NAME_FOR_UPDATE_TRIGGER, NAME_FOR_INSERT_TRIGGER,NAME_FOR_DELETE_TRIGGER, SYNC_ON_UPDATE_CONDITION,SYNC_ON_INSERT_CONDITION, SYNC_ON_DELETE_CONDITION,EXTERNAL_SELECT, TX_ID_EXPRESSION,EXCLUDED_COLUMN_NAMES, CREATE_TIME,LAST_UPDATE_BY,LAST_UPDATE_TIME) values ('SALE_TRANSACTION_DEAD',null,null, 'SALE_TRANSACTION','transaction', 0,0,0,0,null,null,null,null,null,null,null,null,null, current_timestamp,'demo',current_timestamp); insert into sym_router (ROUTER_ID,TARGET_CATALOG_NAME,TARGET_SCHEMA_NAME, TARGET_TABLE_NAME,SOURCE_NODE_GROUP_ID,TARGET_NODE_GROUP_ID,ROUTER_TYPE, ROUTER_EXPRESSION,SYNC_ON_UPDATE,SYNC_ON_INSERT,SYNC_ON_DELETE, CREATE_TIME,LAST_UPDATE_BY,LAST_UPDATE_TIME) values ('CORP_2_STORE',null,null,null, 'corp','store',null,null,1,1,1, current_timestamp,'demo',current_timestamp); insert into sym_trigger_router (TRIGGER_ID,ROUTER_ID,INITIAL_LOAD_ORDER, INITIAL_LOAD_SELECT,CREATE_TIME,LAST_UPDATE_BY,LAST_UPDATE_TIME) values ('SALE_TRANSACTION_DEAD','CORP_2_REGION',100,null, current_timestamp,'demo',current_timestamp);
As discussed in Section 3.6.3.2, “Circular References and "Ping Back"”
SymmetricDS, by default, avoids circular data changes. When a trigger
fires as a result of SymmetricDS itself (such as the case when sync on
incoming batch is set), it records the originating source node of the
data change in source_node_id
. During routing, if
routing results in sending the data back to the originating source
node, the data is not routed by default. If instead you wish to route
the data back to the originating node, you can set the
ping_back_enabled
column for the needed particular
trigger / router combination. This will cause the router to "ping" the
data back to the originating node when it usually would not.
Node registration is the act of setting up a new NODE and NODE_SECURITY so that when the new
node is brought online it is allowed to join the system. Nodes are only
allowed to register if rows exist for the node and the
registration_enabled
flag is set to 1. If the
auto.registration
SymmetricDS property is set to true,
then when a node attempts to register, if registration has not already
occurred, the node will automatically be registered.
SymmetricDS allows you to have multiple nodes with the same
external_id
. Out of the box, openRegistration will open
a new registration if a registration already exists for a node with the
same external_id. A new registration means a new node with a new
node_id
and the same external_id
will be created. If you want to re-register the same node you can use the
reOpenRegistration()
JMX method which takes a
node_id
as an argument.
New as of SymmetricDS 2.4, SymmetricDS is now able to transform synchronized data by way of configuration (previously, for most cases a custom data loader would need to have been written). This transformation can take place on a source node or on a target node, as the data is being loaded or extracted. With this new feature you can, for example:
Copy a column from a source table to two (or more) target table columns,
Merge columns from two or more source tables into a single row in a target table,
Insert constants in columns in target tables based on source data synchronizations,
Insert multiple rows of data into a single target table based on one change in a source table,
Apply a Bean Shell script to achieve a custom transform when loading into the target database.
These transformations can take place either on the target or on the source, and as data is either being extracted or loaded. In either case, the transformation is initiated due to existence of a source synchronization trigger. The source trigger creates the synchronization data, while the transformation configuration decides what to do with the sychronization data as it is either being extracted from the source or loaded into the target. You have the flexibility of defining different transformation behavior depending on whether the source change that triggered the synchronization was an Insert, Update, or Delete. In the case of Delete, you even have options on what exactly to do on the target side, be it a delete of a row, setting columns to specific values, or absolutely nothing at all.
A few key concepts are important to keep in mind to understand how SymmetricDS performs transformations. The first concept is that of the "source operation" or "source DML type", which is the type of operation that occurred to generate the synchronization data in the first place (i.e., an insert, a delete, or an update). Your transformations can be configured to act differently based on the source DML type, if desired. When transforming, by default the DML action taken on the target matches that of the action taken on the row in the source (although this behavior can be altered through configuration if needed). If the source DML type is an Insert, for example, the resulting transformation DML(s) will be Insert(s).
Another important concept is the way in which transforms are applied. Each source operation may map to one or more transforms and result in one or more operations on the target tables. Each of these target operations are performed as independent operations in sequence and must be "complete" from a SQL perspective. In other words, you must define columns for the transformation that are sufficient to fill in any primary key or other required data in the target table if the source operation was an Insert, for example.
Finally, please note that the tranformation engine relies on a source trigger / router existing to supply the source data for the transformation. The transform configuration will never be used if the source table and target node group does not have a defined trigger / router combination for that source table and target node group.
SymmetricDS stores its transformation configuration in two configuration tables, TRANSFORM_TABLE and TRANSFORM_COLUMN. Defining a transformation involves configuration in both tables, with the first table defining which source and destination tables are involved, and the second defining the columns involved in the transformation and the behavior of the data for those columns. We will explain the various options available in both tables and the various pre-defined transformation types.
To define a transformation, you will first define the source table and target table that applies to a particular transformation. The source and target tables, along with a unique identifier (the transform_id column) are defined in TRANSFORM_TABLE. In addition, you will specify the source_node_group_id and target_node_group_id to which the transform will apply, along with whether the transform should occur on the Extract step or the Load step (transform_point). All of these values are required.
Three additional configuration settings are also defined at the source-target table level: the order of the transformations, the behavior when deleting, and whether an update should always be attempted first. More specifically,
For each transformation defined in TRANSFORM_TABLE, the columns to be transformed (and how they are transformed) are defined in TRANSFORM_COLUMN. This column-level table typically has several rows for each transformation id, each of which defines the source column name, the target column name, as well as the following details:
There are several pre-defined transform types available in
SymmetricDS. Additional ones can be defined by creating and configuring
an extension point which implements the IColumnTransform
interface. The pre-defined transform types include the following (the
transform_type entry is shown in parentheses):
Ssystem_date
is the
current system date, system_timestamp
is the current
system date and time, source_node_id
is the node id of
the source, target_node_id
is the node id of the
target, and null
is a null value.n
, the beginning index), or a pair
of comma-separated integers (n,m
- the beginning and
ending index). The transform behaves as the Java substring function
would using the specified values in transform_expression.COLUMN_NAME
is a variable
for a source column in the row, where the variable name is the
column name in uppercase; currentValue
is the value of
the current source column; oldValue
is the old value of
the source column for an updated row; sqlTemplate
is a
org.jumpmind.db.sql.ISqlTemplate
object for querying or
updating the database; channelId
is a reference to the
channel on which the transformation is happening;
sourceNode
is a
org.jumpmind.symmetric.model.Node
object that
represents the node from where the data came;
targetNode
is a
org.jumpmind.symmetric.model.Node
object that
represents the node where the data is being loaded.New as of SymmetricDS 3.1, SymmetricDS is now capable of taking actions upon the load of certain data via configurable load filters. This new configurable option is in additon to the already existing option of writing a class that implements IDatabaseWriterFilter. A configurable load filter watches for specific data that is being loaded and then takes action based on the load of that data.
Specifying which data to action is done by specifying a souce and target node group (data extracted from this node group, and loaded into that node group), and a target catalog, schema and table name. You can decide to take action on rows that are inserted, updated and/or deleted, and can also further delineate which rows of the target table to take action on by specifying additional criteria in the bean shell script that is executed in response to the loaded data. As an example, old and new values for the row of data being loaded are available in the bean shell script, so you can action rows with a certain column value in old or new data.
The action taken is based on a bean shell script that you can provide as part of the configuration. Actions can be taken at different points in the load process including before write, after write, at batch complete, at batch commit and/or at batch rollback.
SymmetricDS stores its load filter configuration in a single table called LOAD_FILTER. The load filter table allows you to specify the following:
As part of the bean shell load filters, SymmetricDS provides certain variables for use in the bean shell script. Those variables include:
The following is an example of a load filter that watches a table named TABLE_TO_WATCH being loaded from the Server Node Group to the Client Node Group for inserts or updates, and performs an initial load on a table named "TABLE_TO_RELOAD" for KEY_FIELD on the reload table equal to a column named KEY_FIELD on the TABLE_TO_WATCH table.
insert into sym_load_filter (LOAD_FILTER_ID, LOAD_FILTER_TYPE, SOURCE_NODE_GROUP_ID, TARGET_NODE_GROUP_ID, TARGET_CATALOG_NAME, TARGET_SCHEMA_NAME, TARGET_TABLE_NAME, FILTER_ON_UPDATE, FILTER_ON_INSERT, FILTER_ON_DELETE, BEFORE_WRITE_SCRIPT, AFTER_WRITE_SCRIPT, BATCH_COMPLETE_SCRIPT, BATCH_COMMIT_SCRIPT, BATCH_ROLLBACK_SCRIPT, HANDLE_ERROR_SCRIPT, CREATE_TIME, LAST_UPDATE_BY, LAST_UPDATE_TIME, LOAD_FILTER_ORDER, FAIL_ON_ERROR) values ('TABLE_TO_RELOAD','BSH','Client','Server',NULL,NULL, 'TABLE_TO_WATCH',1,1,0,null, 'engine.getDataService().reloadTable(context.getBatch().getSourceNodeId(), table.getCatalog(), table.getSchema(), "TABLE_TO_RELOAD","KEY_FIELD=''" + KEY_FIELD + "''");' ,null,null,null,null,sysdate,'userid',sysdate,1,1);
Conflict detection and resolution is new as of SymmetricDS 3.0. Conflict detection is the act of determining if an insert, update or delete is in "conflict" due to the target data row not being consistent with the data at the source prior to the insert/update/delete. Conflict resolution is the act of figuring out what to do when a conflict is detected.
Conflict detection and resolution strategies are configured in the CONFLICT table. They are configured at minimum for a specific NODE_GROUP_LINK . The configuration can also be specific to a CHANNEL and/or table.
Conflict detection is configured in the
detect_type
and
detect_expression
columns of
CONFLICT
. The value for
detect_expression
depends on the
detect_type
. Conflicts are detected while data is being loaded into a target system.
Indicates that only the primary key is used to detect a conflict. If a row exists with the same primary key, then no conflict is detected during an update or a delete. Updates and deletes rows are resolved using only the primary key columns. If a row already exists during an insert then a conflict has been detected.
Indicates that all of the old data values are used to detect a conflict. Old data is the data values of the row on the source system prior to the change. If a row exists with the same old values on the target system as they were on the source system, then no conflict is detected during an update or a delete. If a row already exists during an insert then a conflict has been detected.
Note that some platforms do not support comparisons of binary columns. Conflicts in binary column values will not be detected on the following platforms: DB2, DERBY, ORACLE, and SQLSERVER.
Indicates that the primary key plus any data that has changed on the source system will be used to detect a conflict. If a row exists with the same old values on the target system as they were on the source system for the columns that have changed on the source system, then no conflict is detected during an update or a delete. If a row already exists during an insert then a conflict has been detected.
Note that some platforms do not support comparisons of binary columns. Conflicts in binary column values will not be detected on the following platforms: DB2, DERBY, ORACLE, and SQLSERVER.
Indicates that the primary key plus a timestamp column (as configured in
detect_expression
) will indicate whether a conflict has occurred. If the target timestamp column is not equal to the
old source timestamp column, then a conflict has been detected. If a row already exists during an
insert then a conflict has been detected.
Indicates that the primary key plus a version column (as configured in
detect_expression
) will indicate whether a conflict has occurred. If the target version column is not equal to the old
source version column, then a conflict has been detected. If a row already exists during an insert
then a conflict has been detected.
Conflict resolution is configured in the
resolve_type
column.
Indicates that when a conflict is detected the system should automatically apply the changes anyways.
If the source operation was an insert, then an update will be attempted. If the source operation was
an update and the row does not exist, then an insert will be attempted. If the source operation was a
delete and the row does not exist, then the delete will be ignored. The
resolve_changes_only
flag controls whether all columns will be updated or only columns that have changed will be updated
during a fallback operation.
Indicates that when a conflict is detected the system shoudl automatically ignore the incoming
change. The
resolve_row_only
column controls whether the entire batch should be ignore or just the row in conflict.
Indicates that when a conflict is detected the batch will remain in error until manual intervention
occurs. The row in error is inserted into the
INCOMING_ERROR
table. In order to resolve, the
resolve_data
column can be manually filled out which will be used on the next load attempt instead of the original
source data. The
resolve_ignore
flag can also be used to indicate that the row should be ignored on the next load attempt.
Indicates that when a conflict is detected by USE_TIMESTAMP or USE_VERSION that the either the source or the target will win based on the which side has the newer timestamp or higher version number.