MacLochlainns Weblog

Michael McLaughlin's Technical Blog

Site Admin

MySQL Merge gone Awry

with 13 comments

Sometimes it gets tiresome when people take shots at Oracle, MySQL, SQL Server 2008, or PostgreSQL. When I went to the MySQL for Database Administrators, the instructor mentioned a number of times how many Oracle people he’d get in his class. It was said almost as if all the Oracle customers were migrating to MySQL, which I don’t think is the case. If I’m wrong just post a comment.

After writing SQL since 1985 (IBM SQL/DS), I’ve come to appreciate a number of the features in the Oracle database. This probably makes sense because I worked there for over eight years, and probably would still be there if I hadn’t left to teach at University. I’m constantly amazed as I explore and look for points of commonality across SQL dialects and PL/SQL dialects. At least, I think it’s fair to call T-SQL stored programs and MySQL stored programs PL/SQL dialects.

This blog post is about the MERGE statement and it’s close MySQL cousin, the ON DUPLICATE KEY UPDATE statement (that only works with single row INSERT statements). I’ve recently added a post demonstrating the REPLACE INTO command syntax. You may also be interested in a newer blog post about importing external data inside a MySQL procedure with cursor loops. Upfront, I have to vote for Oracle’s MERGE statement because it’s more complete as an implementation. By the way, T-SQL supports the same MERGE syntax. They’re so exact it wasn’t worth taking the space to show the syntax for the SQL Server 2008 Express product.

Oracle’s Merge Statement

The downside of Oracle’s MERGE statement is that it takes a LOT of typing. The upside from my perspective is that it enforces that you must use the primary key column. The same is not true with the MySQL syntax.

Here’s a quick example that you can cut and paste into your environment for Oracle Database 11g or remove the Oracle specific FROM dual and it’ll run in SQL Server. You can also see how to leverage joins and imports with the MERGE statement in this more recent blog post.

-- Conditionally drop the table and sequence.
BEGIN
  FOR i IN (SELECT NULL
            FROM   user_tables
            WHERE  TABLE_NAME = 'SYSTEM_USER') LOOP
    EXECUTE IMMEDIATE 'DROP TABLE system_user CASCADE CONSTRAINTS';
  END LOOP;
  FOR i IN (SELECT NULL
            FROM   user_sequences
            WHERE  sequence_name = 'SYSTEM_USER_S1') LOOP
    EXECUTE IMMEDIATE 'DROP SEQUENCE system_user_s1';
  END LOOP;
END;
/
 
-- Create the table.
CREATE TABLE system_user
( system_user_id        NUMBER        CONSTRAINT pk_su   PRIMARY KEY
, system_user_name      VARCHAR2(20)  CONSTRAINT nn_su_1 NOT NULL
, system_user_group_id  NUMBER        CONSTRAINT nn_su_2 NOT NULL
, system_user_type      NUMBER        CONSTRAINT nn_su_3 NOT NULL
, first_name            VARCHAR2(20)
, middle_name           VARCHAR2(10)
, last_name             VARCHAR2(20)
, created_by            NUMBER        CONSTRAINT nn_su_4 NOT NULL
, creation_date         DATE          CONSTRAINT nn_su_5 NOT NULL
, last_updated_by       NUMBER        CONSTRAINT nn_su_6 NOT NULL
, last_update_date      DATE          CONSTRAINT nn_su_7 NOT NULL);
 
-- Create the sequence with a default start value of 1.
CREATE SEQUENCE system_user_s1;
 
-- Insert new row.
INSERT INTO system_user
VALUES
( system_user_s1.NEXTVAL
, 'SYSADMIN'
, 1
, 1
, NULL
, NULL
, NULL
, 1
, SYSDATE - 1
, 1
, SYSDATE - 1);
 
-- Insert new or merge into existing row.
MERGE INTO system_user target
USING (SELECT   1 AS system_user_id
       ,       'SYSADMIN' AS system_user_name
       ,        1 AS system_user_group_id
       ,        1 AS system_user_type
       ,       'Samuel' AS first_name
       ,        'the' AS middle_name
       ,       'Lamanite' AS last_name
       ,        1 AS created_by
       ,        SYSDATE AS creation_date
       ,        1 AS last_updated_by
       ,        SYSDATE AS last_update_date
       FROM     dual) SOURCE
ON (target.system_user_id = SOURCE.system_user_id)
WHEN MATCHED THEN
  UPDATE SET first_name = 'Samuel'
  ,          middle_name = 'the'
  ,          last_name = 'Lamanite'
  ,          last_updated_by = 1
  ,          last_update_date = SYSDATE
WHEN NOT MATCHED THEN
  INSERT
  ( target.system_user_id
  , target.system_user_name
  , target.system_user_group_id
  , target.system_user_type
  , target.first_name
  , target.middle_name
  , target.last_name
  , target.created_by
  , target.creation_date
  , target.last_updated_by
  , target.last_update_date )  
  VALUES
  ( SOURCE.system_user_id
  , SOURCE.system_user_name
  , SOURCE.system_user_group_id
  , SOURCE.system_user_type
  , SOURCE.first_name
  , SOURCE.middle_name
  , SOURCE.last_name
  , SOURCE.created_by
  , SOURCE.creation_date
  , SOURCE.last_updated_by
  , SOURCE.last_update_date );

MySQL On Duplicate Key Update Statement

The downside of MySQL’s ON DUPLICATE KEY UPDATE statement is that it takes it lets you use an override signature on the INSERT that then creates a new row when it shouldn’t. I logged a bug when I discovered that behavior earlier today but they don’t quite agree that it’s a bug. The upside is that the typing is MUCH shorter provided you remember to use the auto increment key column in the SELECT clause.

Here’s a quick example of what not to do! The next one shows you what you should do. You can see how to tie together MySQL subroutines with an INSERT ... ON DUPLICATE KEY statement in this more recent blog entry.

-- Conditionally drop the table.
DROP TABLE IF EXISTS system_user;
 
-- Create the table.
CREATE TABLE system_user
( system_user_id              INT UNSIGNED  PRIMARY KEY AUTO_INCREMENT
, system_user_name            CHAR(20)      NOT NULL
, system_user_group_id        INT           NOT NULL
, system_user_type            INT           NOT NULL
, first_name                  CHAR(20)
, middle_name                 CHAR(10)
, last_name                   CHAR(20)
, created_by                  INT           NOT NULL
, creation_date               DATE          NOT NULL
, last_updated_by             INT           NOT NULL
, last_update_date            DATE          NOT NULL);
 
-- Insert new row.
INSERT INTO system_user
( system_user_name
, system_user_group_id
, system_user_type
, created_by
, creation_date
, last_updated_by
, last_update_date )
VALUES
('SYSADMIN'
, 1
, 1
, 1
, DATE_SUB(NOW(),INTERVAL 1 DAY)
, 1
, DATE_SUB(NOW(),INTERVAL 1 DAY));
 
-- Insert new or merge into existing row.
INSERT INTO system_user
( system_user_name
, system_user_group_id
, system_user_type
, first_name
, middle_name
, last_name
, created_by
, creation_date
, last_updated_by
, last_update_date )
VALUES
('SYSADMIN'
, 1
, 1
,'Samuel'
,'the'
,'Lamanite'
, 1
, NOW()
, 1
, NOW())
ON DUPLICATE KEY 
UPDATE first_name = 'Samuel'
,      middle_name = 'the'
,      last_name = 'Lamanite'
,      last_updated_by = 1
,      last_update_date = UTC_DATE();

This script ends up inserting two rows when only one should be present. Why did that happen? Great question! You can fix this by adding a unique key defined for the columns that make up the natural key for the SYSTEM_USER table. As noted by a comment below, the unique key must only include columns that are NOT NULL constrained. Here’s the results without such a unique key:

mysql> SELECT * FROM system_user\G
*************************** 1. ROW ***************************
      system_user_id: 1
    system_user_name: SYSADMIN
system_user_group_id: 1
    system_user_type: 1
          first_name: NULL
         middle_name: NULL
           last_name: NULL
          created_by: 1
       creation_date: 2009-05-24
     last_updated_by: 1
    last_update_date: 2009-05-24
*************************** 2. ROW ***************************
      system_user_id: 2
    system_user_name: SYSADMIN
system_user_group_id: 1
    system_user_type: 1
          first_name: Samuel
         middle_name: NULL
           last_name: Lamanite
          created_by: 1
       creation_date: 2009-05-25
     last_updated_by: 1
    last_update_date: 2009-05-25
2 ROWS IN SET (0.02 sec)

The correct way to do this in MySQL is shown in the next example.

-- Conditionally drop table.
DROP TABLE IF EXISTS system_user;
 
-- Create table.
CREATE TABLE system_user
( system_user_id              INT UNSIGNED  PRIMARY KEY AUTO_INCREMENT
, system_user_name            CHAR(20)      NOT NULL
, system_user_group_id        INT           NOT NULL
, system_user_type            INT           NOT NULL
, first_name                  CHAR(20)
, middle_name                 CHAR(10)
, last_name                   CHAR(20)
, created_by                  INT           NOT NULL
, creation_date               DATE          NOT NULL
, last_updated_by             INT           NOT NULL
, last_update_date            DATE          NOT NULL);
 
-- Insert new row.
INSERT INTO system_user
( system_user_name
, system_user_group_id
, system_user_type
, created_by
, creation_date
, last_updated_by
, last_update_date )
VALUES
('SYSADMIN'
, 1
, 1
, 1
, DATE_SUB(UTC_DATE(),INTERVAL 1 DAY)
, 1
, DATE_SUB(UTC_DATE(),INTERVAL 1 DAY));
 
-- Insert new or merge into existing row.
INSERT INTO system_user
VALUES
( 1
,'SYSADMIN'
, 1
, 1
,'Samuel'
,'the'
,'Lamanite'
, 1
, NOW()
, 1
, NOW())
ON DUPLICATE KEY
UPDATE first_name = 'Samuel'
,      middle_name = 'the'
,      last_name = 'Lamanite'
,      last_updated_by = 1
,      last_update_date = NOW();

This ensures that the auto increment column values are matched. It returns what you’d expect, a single row inserted into or updated in the table.

mysql> SELECT * FROM system_user\G
*************************** 1. ROW ***************************
      system_user_id: 1
    system_user_name: SYSADMIN
system_user_group_id: 1
    system_user_type: 1
          first_name: Samuel
         middle_name: NULL
           last_name: Lamanite
          created_by: 1
       creation_date: 2009-05-24
     last_updated_by: 1
    last_update_date: 2009-05-25
1 ROW IN SET (0.00 sec)

I hope this saves somebody from an insertion anomaly.

Written by maclochlainn

May 25th, 2009 at 6:03 pm

Posted in MySQL,Oracle,sql

13 Responses to 'MySQL Merge gone Awry'

Subscribe to comments with RSS or TrackBack to 'MySQL Merge gone Awry'.

  1. I appreciate the nice Oracle/MySQL comparison on those, as it helps understanding the similarity. I only wish every technical post were so readable… Thanks!

    MC

    11 Jun 09 at 8:28 am

  2. By far, the best explanation of a MySQL merge/upsert. Thank you!! -

    Mark Lentz

    14 Aug 09 at 4:11 pm

  3. I’m confused by this example, and specifically why the MySQL behavior is surprising. With the first (“wrong”) MySQL ON DUPLICATE… code segment, why would the second record be considered a replacement for the first? The override signature contains no fields participating in a UNIQUE constraint, so it seems to me like the statement is saying the second insertion is intended to create a new record for ‘SYSADMIN’ — there’s no “duplicate” condition to fail. If `system_user_name` were declared UNIQUE, then I would expect the UPDATE to occur rather than the INSERT. But in that scenario, I would want the option to omit the AUTO_INCREMENT column so that if the user didn’t already exist, it would be created with a new primary key ID.

    If the AUTO_INCREMENT field were required and omitting it resulted in an error, it would make the “ON DUPLICATE KEY UPDATE” clause much more limited in applicability. For example, if I have a table schema like this:

    CREATE TABLE `found_words`
    ( `id` INTEGER AUTO_INCREMENT PRIMARY KEY NOT NULL,
      `word` VARCHAR(255) UNIQUE NOT NULL,
      `frequency` INTEGER NOT NULL DEFAULT 0 );

    And after using the table for some time, I want to add the word “hello”:

    INSERT INTO `found_words` ( `word`, `frequency` )
    VALUES ( 'hello', 1 )
    ON DUPLICATE KEY UPDATE
        `frequency` = `frequency` + VALUES(`frequency`) ;

    If “hello” already exists, I wish to perform the update. I don’t know nor wish to know its `id` number. It’s only there to serve as a surrogate key anyway. If “hello” doesn’t exist, I wish to perform the insert. I do not wish to provide an `id` number, so MySQL will assign it the next available one. But if `id` were required to be in the signature, I would be unable to use this clause and would have to resort to a separate INSERT, followed by a test for success and an UPDATE on failure.

    I guess what I’m confused on is, in the example, why should the second ‘SYSADMIN’ record be considered a replacement for the first? Why, in terms of the database definition given, isn’t it a legitimate new second record? Or, to go further, if there were already two ‘SYSADMIN’ records present (possibly from two separate “INSERT” commands) and a *third* INSERT … ON DUPLICATE KEY UPDATE clause were to be issued, how would the database engine know which of the two records should be updated (assuming the intended behavior should be UPDATE and not INSERT)?

    Adam Peterson

    7 Dec 09 at 9:26 pm

  4. It can get lost in the text sometimes, can’t it? ;-) In the blog text, I wrote: “I logged a bug when I discovered that behavior earlier today but they don’t quite agree that it’s a bug.”

    So, you and MySQL development agree that it isn’t a bug. :-) I disagree because the phrase ON DUPLICATE KEY implies a key validation. Therefore, I believe the statement shouldn’t parse when the key isn’t provided, and yes that would make this syntax more restrictive.

    The way it behaves makes it subject to data insertion anomalies. While I appreciate your observation that a surrogate key serves no purpose internally in the table when a unique constraint governs the natural key, how often is that truth maintained in database designs? The surrogate key’s primary role is to externally identify a row to other tables that hold a foreign key column that points to it. Surrogate keys let us change the internal natural key when we evolve a table without having to re-write all joins to the table. Unfortunately, sometimes data doesn’t get organized into unique rows (tuples). In that event, some folks add the surrogate key as a qualifier for unique rows.

    In the example, the table would be defined like this with a UNIQUE constraint. You notice that it’s a multiple column unique constraint because you may wish more than one SYSADMIN in your user access control list.

    CREATE TABLE system_user
    ( system_user_id              INT UNSIGNED  PRIMARY KEY AUTO_INCREMENT
    , system_user_name            CHAR(20)      NOT NULL
    , system_user_group_id        INT           NOT NULL
    , system_user_type            INT           NOT NULL
    , first_name                  CHAR(20)
    , middle_name                 CHAR(10)
    , last_name                   CHAR(20)
    , created_by                  INT           NOT NULL
    , creation_date               DATE          NOT NULL
    , last_updated_by             INT           NOT NULL
    , last_update_date            DATE          NOT NULL
    , CONSTRAINT UNIQUE (system_user_name,first_name,middle_name,last_name));

    Let’s make a couple assumptions. First, that there’s only one type of INSERT statement in our form, the second one with an ON DUPLICATE KEY UPDATE clause. Two, that the INSERT statement is fired by a SUBMIT button in a web form. The user clicks submit button but notices that they forgot to enter the first, middle, and last name for the SYSADMIN. The user immediately enters the three missing fields and clicks the submit button again. A unique constraint is not violated with the second insertion because it traverses a set of columns, and two records are inserted with different SYSTEM_USER_ID values. Now, complicate matters. Assume the form contains a set of child elements that may be application responsibilities. The second submission duplicates them and links them to the second inserted user. Voilá, you now have garbage in your database.

    The worst hasn’t happened yet! When another user attempts to create a new SYSADMIN, and by coincidence leaves off the first, middle, and last names. The UNIQUE constraint catches it and updates the existing record with null values, rather than inserting a new record. Let’s assume this is a transaction or association table that only holds the foreign keys of the SYSTEM_USER and APPLICATION_RESPONSIBILITY tables. Based on that assumption, any related rows in the USER_APPLICATION_RESPONSIBILITY table are: (1) updated with a new time stamp provided there’s a UNIQUE constraint on the paired set of foreign keys; (2) left untouched when existing records aren’t in the set of selected application responsibilities for the new SYSADMIN; and (3) inserted when they don’t previously exist.

    The nightmare now lives in the database, violated business rules; and mapping rows of foreign surrogate keys that hold incorrect relationships between the SYSTEM_USER and APPLICATION_RESPONSIBILITY tables. Quite a little problem left to the application development or DBA team to cleanup. All of this could be avoided if the behavior of the ON DUPLICATE KEY UPDATE were restricted, or a better architecture chosen.

    Good design must foremost protect the data, the first type of protection is from insertion anomalies. MySQL’s implementation of the ON DUPLICATE KEY UPDATE doesn’t prevent a potentially risky behavior, while the Oracle MERGE statement does prevent it. I hope that helps you understand my perspective. Also, I think any design that relies on the behavior provided in your FOUND_WORDS examples is generally at least suboptimal, or risky, if not simply a bad solution.

    maclochlainn

    7 Dec 09 at 10:46 pm

  5. Well, I can see logic in issuing an error (or at least a warning) when an “ON DUPLICATE…” clause is applied to an INSERT that doesn’t have any fields participating in a UNIQUE constraint. In that case, the “ON DUPLICATE” is vestigal, and the fact that the user added it where it is impossible to fire suggests that the programmer has made a logic error.

    However, it also still makes sense to me that I would want to be able to omit the AUTO_INCREMENT field and allow the DB engine to manage it when it is only other uniqueness constraint(s) that I’m worried about. I’m not very familiar with Oracle’s SQL, but it appears to have this handy “.NEXTVAL” syntax for specifying that you aren’t necessarily referring to an already existing record. If it didn’t have that, then I don’t see a natural way for the MERGE syntax to express add-or-update semantics where the surrogate key is not the pertinent constraint. (And since MySQL doesn’t have “.NEXTVAL”, making it illegal to omit this field would be restrictive in ways that it would not be for Oracle.)

    I am curious if you could elaborate on the inefficiencies in the “found_words” approach. What would be a better approach? I’m primarily a programmer and not a DBA, so I’d be interested in insights about what makes it suboptimal.

    Adam Peterson

    8 Dec 09 at 1:24 pm

  6. The appropriate clause for SQLite is ‘ON CONFLICT’. The URL for the support documentation is found at

    http://www.sqlite.org/lang_conflict.html

    Options when a conflict occurs are:

    1 – ROLLBACK (ends current transaction)
    2 – ABORT (ends current transaction without a rollback)
    3 – FAIL (ends current transaction. Is the same as not using the ON CONFLICT clause)
    4 – IGNORE (current transaction NOT ended. Data insert/update NOT done)
    5 – REPLACE (current transaction NOT ended. Data insert/update IS done)

    Lee Barney

    16 Feb 10 at 5:43 pm

  7. Adam,

    I finally had time to write something that might help you. You can check this post on using the INSERT ... ON DUPLICATE KEY effectively to import data.

    As to Oracle’s sequences versus MySQL’s sequences, I’d say that they both have various advantages and disadvantages. You can check here where I wrote about comparative automatic numbering in Oracle, MySQL, and SQL Server. They all use sequences. Oracle defines sequences as independent objects, MySQL and SQL Server define them as dependent objects, but in all cases they more or less work the same way. The sequence_name.NEXTVAL has no direct equivalent in MySQL, other than as: (1) a NULL value in the position of the AUTO_INCREMENT column in a default INSERT statement; or (2) the omission of the column in the overriding call signature to an INSERT statement.

    As to a recommendation for your implementation, I’m not sure that a good one can be provided without copies of your UML Use Cases and Activity Diagrams. The quip (unintended most likely) about programmer not a DBA is funny. I simply believe that you can’t build database centric applications without being cognizant of both skills. Folks who market themselves as this or that tend to lean one way or another on solutions, rather than look for holistic solutions leveraging the design strengths of all elements in the solution architecture.

    My original complaint about your description of your code was that it appeared to rely too much on implicit behaviors. Unfortunately, your description of the solution in the comment left too much room to the imagination to know whether or not it is a good solution. I personally believe good programming practices avoid implicit behaviors where more efficient explicit behaviors exist. Explicit behaviors also lend themselves to greater sustainability and lower support costs. Whether or not this applies in your solution is a matter of conjecture until I see the design architecture.

    maclochlainn

    24 Feb 10 at 3:31 pm

  8. [...] following demonstrates how to perform a left join from and exernal source. [...]

  9. Is this so difficult to implement in MySQL instead of –

    ON DUPLICATE KEY

    implement

    ON DUPLICATE ()

    ?? I think that would solve most of the problems. Isn’t it?

    Ankur

    13 Jul 11 at 6:38 pm

  10. I didn’t find it difficult, here’s a full working copy. It’s the second example there.

    Thanks for the comment.

    maclochlainn

    13 Jul 11 at 10:35 pm

  11. Sorry, your link is not click’able’. Could you please provide the link again. Thanks.

    Ankur

    14 Jul 11 at 12:04 pm

  12. Oops, don’t know what happened but it’s fixed now.

    maclochlainn

    15 Jul 11 at 9:21 pm

  13. Good post. Your MySQL examples are good, but I think calling out the diffs between the correct and incorrect examples would help. In the second (correct) one you are specifying the value of the PK in the last insert stmt. That little detail makes all the difference.

    Also, if you were to have a unique index on the table (other than the PK), perhaps on system_user_name, the first (incorrect) MySQL example would work properly.

    One gotcha that I have noticed in working with upserts and unique indexes on MySQL: if one of the columns in a multi-column unique index contains a null value the on duplicate key update syntax will not work. Don’t ask me how I learned this or how long it took me to figure out what went wrong with the program I was writing. Solution: make SURE all of your columns in a unique index are not nullable.

    cminor9

    9 Aug 11 at 9:40 am

Leave a Reply