Wednesday, November 19, 2008

Deleting absent, updating existing and inserting new rows in a table

Synchronizing a table based on a flat files is one of the more common tasks when dealing with databases. It should be a straight forward task to delete removed records, update changed records and insert new records but I often encounter horrible ways of achieving this goal.

I recently came across an implementation using a DTS package with a data-driven task in SQL Server 2000. For each line in the file it would launch several queries to detect if it was a new, updated or unchanged record (all glued together with VBScript) and then launch the appropriate query to update the database.

Searching a bit on the internet I came across this simple solution. The best way of doing this fast and reliably is by using the JOIN and LEFT JOIN clauses in your UPDATE, INSERT and DELETE commands to determine the status of the record. I would like to show my implementation of slightly more complicated requirements.

Preparation

First start by bulk inserting your data file into a staging table that has no constraints or indexes. This is the fastest way to get your data into the database. For the synchronisation queries (see below), you might want to consider using transactions if there is a risk of data corruption. You could also gain some speed by disabling the indexes while running the insert query.

In the following examples I will try to synchronize the accounts table using the tmp_accounts table as the staging table.

Deleting removed rows

DELETE accounts 
FROM accounts acc LEFT JOIN tmp_accounts tmp
ON  acc.category = tmp.category
AND acc.code     = tmp.code
WHERE acc.code IS NULL
   OR acc.category IS NULL

Fields code and category together uniquely define an account.
Using the LEFT JOIN with the FROM clause will ensure that all records from the accounts table are selected, even if they don't exist in tmp_accounts. The records that don't exist in tmp_accounts will have a NULL value for every field. So all records that exist in the accounts table but have NULL values in their tmp_accounts fields, need to be deleted.

Updating changed rows

UPDATE accounts SET
  description = tmp.description,
  type        = tmp.type,
  section     = tmp.section,
  audit_date  = tmp.audit_date,
  audit_user  = tmp.audit_user
FROM accounts     acc
JOIN tmp_accounts tmp
ON  acc.category = tmp.category
AND acc.code     = tmp.code
WHERE acc.description <> tmp.description
   OR acc.type        <> tmp.type
   OR acc.section     <> tmp.section

By using the JOIN clause, we are sure to work only on records that exist in both the accounts table and the tmp_accounts table. Accounts need to be update in case their description, type or section has changed.

Audit_user and audit_date are automatically filled in the tmp_accounts table using a DEFAULT and are always kept up-to-date in the accounts table.

Inserting new rows

INSERT INTO accounts
( category,
  code,
  description, 
  type,
  section,
  title_only, 
  division,
  override,
  allow_manual_entry, 
  audit_date, 
  audit_user
)
SELECT
  tmp.category,
  tmp.code, 
  tmp.description,
  tmp.type,
  tmp.section,
  tmp.title_only, 
  tmp.division,
  tmp.override,
  tmp.allow_manual_entry, 
  tmp.audit_date,
  tmp.audit_user
FROM tmp_accounts  tmp
LEFT JOIN accounts acc
ON  tmp.category = acc.category
AND tmp.code     = acc.code
WHERE acc.code IS NULL
   OR acc.category IS NULL

Using a LEFT JOIN makes sure we are working with all records in the tmp_accounts table, even if they don't exist in the accounts table. Records that don't exist in the accounts table will have NULL values for all of their fields. So all records that exist in tmp_accounts and have NULL values for the accounts fields, have to be inserted.

Audit_user and audit_date are automatically filled in the tmp_accounts table using a DEFAULT and are always kept up-to-date in the accounts table

No comments: