MySQL Merge gone Awry
Sometimes it gets tiresome when people take shots at Oracle, MySQL, SQL Server 2008, or PostgreSQL. When I went to the MySQL for Database Administrators, the instructor mentioned a number of times how many Oracle people he’d get in his class. It was said almost as if all the Oracle customers were migrating to MySQL, which I don’t think is the case. If I’m wrong just post a comment.
After writing SQL since 1985 (IBM SQL/DS), I’ve come to appreciate a number of the features in the Oracle database. This probably makes sense because I worked there for over eight years, and probably would still be there if I hadn’t left to teach at University. I’m constantly amazed as I explore and look for points of commonality across SQL dialects and PL/SQL dialects. At least, I think it’s fair to call T-SQL stored programs and MySQL stored programs PL/SQL dialects.
This blog post is about the MERGE statement and it’s close MySQL cousin, the ON DUPLICATE KEY UPDATE statement (that only works with single row INSERT statements). I’ve recently added a post demonstrating the REPLACE INTO command syntax. You may also be interested in a newer blog post about importing external data inside a MySQL procedure with cursor loops. Upfront, I have to vote for Oracle’s MERGE statement because it’s more complete as an implementation. By the way, T-SQL supports the same MERGE syntax. They’re so exact it wasn’t worth taking the space to show the syntax for the SQL Server 2008 Express product.
Oracle’s Merge Statement
The downside of Oracle’s MERGE statement is that it takes a LOT of typing. The upside from my perspective is that it enforces that you must use the primary key column. The same is not true with the MySQL syntax.
Here’s a quick example that you can cut and paste into your environment for Oracle Database 11g or remove the Oracle specific FROM dual and it’ll run in SQL Server. You can also see how to leverage joins and imports with the MERGE statement in this more recent blog post.
-- Conditionally drop the table and sequence. BEGIN FOR i IN (SELECT NULL FROM user_tables WHERE TABLE_NAME = 'SYSTEM_USER') LOOP EXECUTE IMMEDIATE 'DROP TABLE system_user CASCADE CONSTRAINTS'; END LOOP; FOR i IN (SELECT NULL FROM user_sequences WHERE sequence_name = 'SYSTEM_USER_S1') LOOP EXECUTE IMMEDIATE 'DROP SEQUENCE system_user_s1'; END LOOP; END; / -- Create the table. CREATE TABLE system_user ( system_user_id NUMBER CONSTRAINT pk_su PRIMARY KEY , system_user_name VARCHAR2(20) CONSTRAINT nn_su_1 NOT NULL , system_user_group_id NUMBER CONSTRAINT nn_su_2 NOT NULL , system_user_type NUMBER CONSTRAINT nn_su_3 NOT NULL , first_name VARCHAR2(20) , middle_name VARCHAR2(10) , last_name VARCHAR2(20) , created_by NUMBER CONSTRAINT nn_su_4 NOT NULL , creation_date DATE CONSTRAINT nn_su_5 NOT NULL , last_updated_by NUMBER CONSTRAINT nn_su_6 NOT NULL , last_update_date DATE CONSTRAINT nn_su_7 NOT NULL); -- Create the sequence with a default start value of 1. CREATE SEQUENCE system_user_s1; -- Insert new row. INSERT INTO system_user VALUES ( system_user_s1.NEXTVAL , 'SYSADMIN' , 1 , 1 , NULL , NULL , NULL , 1 , SYSDATE - 1 , 1 , SYSDATE - 1); -- Insert new or merge into existing row. MERGE INTO system_user target USING (SELECT 1 AS system_user_id , 'SYSADMIN' AS system_user_name , 1 AS system_user_group_id , 1 AS system_user_type , 'Samuel' AS first_name , 'the' AS middle_name , 'Lamanite' AS last_name , 1 AS created_by , SYSDATE AS creation_date , 1 AS last_updated_by , SYSDATE AS last_update_date FROM dual) SOURCE ON (target.system_user_id = SOURCE.system_user_id) WHEN MATCHED THEN UPDATE SET first_name = 'Samuel' , middle_name = 'the' , last_name = 'Lamanite' , last_updated_by = 1 , last_update_date = SYSDATE WHEN NOT MATCHED THEN INSERT ( target.system_user_id , target.system_user_name , target.system_user_group_id , target.system_user_type , target.first_name , target.middle_name , target.last_name , target.created_by , target.creation_date , target.last_updated_by , target.last_update_date ) VALUES ( SOURCE.system_user_id , SOURCE.system_user_name , SOURCE.system_user_group_id , SOURCE.system_user_type , SOURCE.first_name , SOURCE.middle_name , SOURCE.last_name , SOURCE.created_by , SOURCE.creation_date , SOURCE.last_updated_by , SOURCE.last_update_date ); |
MySQL On Duplicate Key Update Statement
The downside of MySQL’s ON DUPLICATE KEY UPDATE statement is that it takes it lets you use an override signature on the INSERT that then creates a new row when it shouldn’t. I logged a bug when I discovered that behavior earlier today but they don’t quite agree that it’s a bug. The upside is that the typing is MUCH shorter provided you remember to use the auto increment key column in the SELECT clause.
Here’s a quick example of what not to do! The next one shows you what you should do. You can see how to tie together MySQL subroutines with an INSERT ... ON DUPLICATE KEY statement in this more recent blog entry.
-- Conditionally drop the table. DROP TABLE IF EXISTS system_user; -- Create the table. CREATE TABLE system_user ( system_user_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT , system_user_name CHAR(20) NOT NULL , system_user_group_id INT NOT NULL , system_user_type INT NOT NULL , first_name CHAR(20) , middle_name CHAR(10) , last_name CHAR(20) , created_by INT NOT NULL , creation_date DATE NOT NULL , last_updated_by INT NOT NULL , last_update_date DATE NOT NULL); -- Insert new row. INSERT INTO system_user ( system_user_name , system_user_group_id , system_user_type , created_by , creation_date , last_updated_by , last_update_date ) VALUES ('SYSADMIN' , 1 , 1 , 1 , DATE_SUB(NOW(),INTERVAL 1 DAY) , 1 , DATE_SUB(NOW(),INTERVAL 1 DAY)); -- Insert new or merge into existing row. INSERT INTO system_user ( system_user_name , system_user_group_id , system_user_type , first_name , middle_name , last_name , created_by , creation_date , last_updated_by , last_update_date ) VALUES ('SYSADMIN' , 1 , 1 ,'Samuel' ,'the' ,'Lamanite' , 1 , NOW() , 1 , NOW()) ON DUPLICATE KEY UPDATE first_name = 'Samuel' , middle_name = 'the' , last_name = 'Lamanite' , last_updated_by = 1 , last_update_date = UTC_DATE(); |
This script ends up inserting two rows when only one should be present. Why did that happen? Great question! You can fix this by adding a unique key defined for the columns that make up the natural key for the SYSTEM_USER table. As noted by a comment below, the unique key must only include columns that are NOT NULL constrained. Here’s the results without such a unique key:
mysql> SELECT * FROM system_user\G *************************** 1. ROW *************************** system_user_id: 1 system_user_name: SYSADMIN system_user_group_id: 1 system_user_type: 1 first_name: NULL middle_name: NULL last_name: NULL created_by: 1 creation_date: 2009-05-24 last_updated_by: 1 last_update_date: 2009-05-24 *************************** 2. ROW *************************** system_user_id: 2 system_user_name: SYSADMIN system_user_group_id: 1 system_user_type: 1 first_name: Samuel middle_name: NULL last_name: Lamanite created_by: 1 creation_date: 2009-05-25 last_updated_by: 1 last_update_date: 2009-05-25 2 ROWS IN SET (0.02 sec) |
The correct way to do this in MySQL is shown in the next example.
-- Conditionally drop table. DROP TABLE IF EXISTS system_user; -- Create table. CREATE TABLE system_user ( system_user_id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT , system_user_name CHAR(20) NOT NULL , system_user_group_id INT NOT NULL , system_user_type INT NOT NULL , first_name CHAR(20) , middle_name CHAR(10) , last_name CHAR(20) , created_by INT NOT NULL , creation_date DATE NOT NULL , last_updated_by INT NOT NULL , last_update_date DATE NOT NULL); -- Insert new row. INSERT INTO system_user ( system_user_name , system_user_group_id , system_user_type , created_by , creation_date , last_updated_by , last_update_date ) VALUES ('SYSADMIN' , 1 , 1 , 1 , DATE_SUB(UTC_DATE(),INTERVAL 1 DAY) , 1 , DATE_SUB(UTC_DATE(),INTERVAL 1 DAY)); -- Insert new or merge into existing row. INSERT INTO system_user VALUES ( 1 ,'SYSADMIN' , 1 , 1 ,'Samuel' ,'the' ,'Lamanite' , 1 , NOW() , 1 , NOW()) ON DUPLICATE KEY UPDATE first_name = 'Samuel' , middle_name = 'the' , last_name = 'Lamanite' , last_updated_by = 1 , last_update_date = NOW(); |
This ensures that the auto increment column values are matched. It returns what you’d expect, a single row inserted into or updated in the table.
mysql> SELECT * FROM system_user\G *************************** 1. ROW *************************** system_user_id: 1 system_user_name: SYSADMIN system_user_group_id: 1 system_user_type: 1 first_name: Samuel middle_name: NULL last_name: Lamanite created_by: 1 creation_date: 2009-05-24 last_updated_by: 1 last_update_date: 2009-05-25 1 ROW IN SET (0.00 sec) |
I hope this saves somebody from an insertion anomaly.
I appreciate the nice Oracle/MySQL comparison on those, as it helps understanding the similarity. I only wish every technical post were so readable… Thanks!
MC
11 Jun 09 at 8:28 am
By far, the best explanation of a MySQL merge/upsert. Thank you!! -
Mark Lentz
14 Aug 09 at 4:11 pm
I’m confused by this example, and specifically why the MySQL behavior is surprising. With the first (“wrong”) MySQL ON DUPLICATE… code segment, why would the second record be considered a replacement for the first? The override signature contains no fields participating in a UNIQUE constraint, so it seems to me like the statement is saying the second insertion is intended to create a new record for ‘SYSADMIN’ — there’s no “duplicate” condition to fail. If `system_user_name` were declared UNIQUE, then I would expect the UPDATE to occur rather than the INSERT. But in that scenario, I would want the option to omit the AUTO_INCREMENT column so that if the user didn’t already exist, it would be created with a new primary key ID.
If the AUTO_INCREMENT field were required and omitting it resulted in an error, it would make the “ON DUPLICATE KEY UPDATE” clause much more limited in applicability. For example, if I have a table schema like this:
And after using the table for some time, I want to add the word “hello”:
If “hello” already exists, I wish to perform the update. I don’t know nor wish to know its `id` number. It’s only there to serve as a surrogate key anyway. If “hello” doesn’t exist, I wish to perform the insert. I do not wish to provide an `id` number, so MySQL will assign it the next available one. But if `id` were required to be in the signature, I would be unable to use this clause and would have to resort to a separate INSERT, followed by a test for success and an UPDATE on failure.
I guess what I’m confused on is, in the example, why should the second ‘SYSADMIN’ record be considered a replacement for the first? Why, in terms of the database definition given, isn’t it a legitimate new second record? Or, to go further, if there were already two ‘SYSADMIN’ records present (possibly from two separate “INSERT” commands) and a *third* INSERT … ON DUPLICATE KEY UPDATE clause were to be issued, how would the database engine know which of the two records should be updated (assuming the intended behavior should be UPDATE and not INSERT)?
Adam Peterson
7 Dec 09 at 9:26 pm
It can get lost in the text sometimes, can’t it?
In the blog text, I wrote: “I logged a bug when I discovered that behavior earlier today but they don’t quite agree that it’s a bug.”
So, you and MySQL development agree that it isn’t a bug.
I disagree because the phrase
ON DUPLICATE KEYimplies a key validation. Therefore, I believe the statement shouldn’t parse when the key isn’t provided, and yes that would make this syntax more restrictive.The way it behaves makes it subject to data insertion anomalies. While I appreciate your observation that a surrogate key serves no purpose internally in the table when a unique constraint governs the natural key, how often is that truth maintained in database designs? The surrogate key’s primary role is to externally identify a row to other tables that hold a foreign key column that points to it. Surrogate keys let us change the internal natural key when we evolve a table without having to re-write all joins to the table. Unfortunately, sometimes data doesn’t get organized into unique rows (tuples). In that event, some folks add the surrogate key as a qualifier for unique rows.
In the example, the table would be defined like this with a
UNIQUEconstraint. You notice that it’s a multiple column unique constraint because you may wish more than one SYSADMIN in your user access control list.Let’s make a couple assumptions. First, that there’s only one type of
INSERTstatement in our form, the second one with anON DUPLICATE KEY UPDATEclause. Two, that theINSERTstatement is fired by aSUBMITbutton in a web form. The user clicks submit button but notices that they forgot to enter the first, middle, and last name for the SYSADMIN. The user immediately enters the three missing fields and clicks the submit button again. A unique constraint is not violated with the second insertion because it traverses a set of columns, and two records are inserted with different SYSTEM_USER_ID values. Now, complicate matters. Assume the form contains a set of child elements that may be application responsibilities. The second submission duplicates them and links them to the second inserted user. Voilá, you now have garbage in your database.The worst hasn’t happened yet! When another user attempts to create a new SYSADMIN, and by coincidence leaves off the first, middle, and last names. The
UNIQUEconstraint catches it and updates the existing record with null values, rather than inserting a new record. Let’s assume this is a transaction or association table that only holds the foreign keys of the SYSTEM_USER and APPLICATION_RESPONSIBILITY tables. Based on that assumption, any related rows in the USER_APPLICATION_RESPONSIBILITY table are: (1) updated with a new time stamp provided there’s aUNIQUEconstraint on the paired set of foreign keys; (2) left untouched when existing records aren’t in the set of selected application responsibilities for the new SYSADMIN; and (3) inserted when they don’t previously exist.The nightmare now lives in the database, violated business rules; and mapping rows of foreign surrogate keys that hold incorrect relationships between the SYSTEM_USER and APPLICATION_RESPONSIBILITY tables. Quite a little problem left to the application development or DBA team to cleanup. All of this could be avoided if the behavior of the
ON DUPLICATE KEY UPDATEwere restricted, or a better architecture chosen.Good design must foremost protect the data, the first type of protection is from insertion anomalies. MySQL’s implementation of the
ON DUPLICATE KEY UPDATEdoesn’t prevent a potentially risky behavior, while the OracleMERGEstatement does prevent it. I hope that helps you understand my perspective. Also, I think any design that relies on the behavior provided in yourFOUND_WORDSexamples is generally at least suboptimal, or risky, if not simply a bad solution.maclochlainn
7 Dec 09 at 10:46 pm
Well, I can see logic in issuing an error (or at least a warning) when an “ON DUPLICATE…” clause is applied to an INSERT that doesn’t have any fields participating in a UNIQUE constraint. In that case, the “ON DUPLICATE” is vestigal, and the fact that the user added it where it is impossible to fire suggests that the programmer has made a logic error.
However, it also still makes sense to me that I would want to be able to omit the AUTO_INCREMENT field and allow the DB engine to manage it when it is only other uniqueness constraint(s) that I’m worried about. I’m not very familiar with Oracle’s SQL, but it appears to have this handy “.NEXTVAL” syntax for specifying that you aren’t necessarily referring to an already existing record. If it didn’t have that, then I don’t see a natural way for the MERGE syntax to express add-or-update semantics where the surrogate key is not the pertinent constraint. (And since MySQL doesn’t have “.NEXTVAL”, making it illegal to omit this field would be restrictive in ways that it would not be for Oracle.)
I am curious if you could elaborate on the inefficiencies in the “found_words” approach. What would be a better approach? I’m primarily a programmer and not a DBA, so I’d be interested in insights about what makes it suboptimal.
Adam Peterson
8 Dec 09 at 1:24 pm
The appropriate clause for SQLite is ‘ON CONFLICT’. The URL for the support documentation is found at
http://www.sqlite.org/lang_conflict.html
Options when a conflict occurs are:
1 – ROLLBACK (ends current transaction)
2 – ABORT (ends current transaction without a rollback)
3 – FAIL (ends current transaction. Is the same as not using the ON CONFLICT clause)
4 – IGNORE (current transaction NOT ended. Data insert/update NOT done)
5 – REPLACE (current transaction NOT ended. Data insert/update IS done)
Lee Barney
16 Feb 10 at 5:43 pm
Adam,
I finally had time to write something that might help you. You can check this post on using the
INSERT ... ON DUPLICATE KEYeffectively to import data.As to Oracle’s sequences versus MySQL’s sequences, I’d say that they both have various advantages and disadvantages. You can check here where I wrote about comparative automatic numbering in Oracle, MySQL, and SQL Server. They all use sequences. Oracle defines sequences as independent objects, MySQL and SQL Server define them as dependent objects, but in all cases they more or less work the same way. The
sequence_name.NEXTVALhas no direct equivalent in MySQL, other than as: (1) aNULLvalue in the position of theAUTO_INCREMENTcolumn in a defaultINSERTstatement; or (2) the omission of the column in the overriding call signature to anINSERTstatement.As to a recommendation for your implementation, I’m not sure that a good one can be provided without copies of your UML Use Cases and Activity Diagrams. The quip (unintended most likely) about programmer not a DBA is funny. I simply believe that you can’t build database centric applications without being cognizant of both skills. Folks who market themselves as this or that tend to lean one way or another on solutions, rather than look for holistic solutions leveraging the design strengths of all elements in the solution architecture.
My original complaint about your description of your code was that it appeared to rely too much on implicit behaviors. Unfortunately, your description of the solution in the comment left too much room to the imagination to know whether or not it is a good solution. I personally believe good programming practices avoid implicit behaviors where more efficient explicit behaviors exist. Explicit behaviors also lend themselves to greater sustainability and lower support costs. Whether or not this applies in your solution is a matter of conjecture until I see the design architecture.
maclochlainn
24 Feb 10 at 3:31 pm
[...] following demonstrates how to perform a left join from and exernal source. [...]
MySQL REPLACE INTO is equivalent to Oracle MERGE
16 May 10 at 1:19 pm
Is this so difficult to implement in MySQL instead of -
ON DUPLICATE KEY
implement
ON DUPLICATE ()
?? I think that would solve most of the problems. Isn’t it?
Ankur
13 Jul 11 at 6:38 pm
I didn’t find it difficult, here’s a full working copy. It’s the second example there.
Thanks for the comment.
maclochlainn
13 Jul 11 at 10:35 pm
Sorry, your link is not click’able’. Could you please provide the link again. Thanks.
Ankur
14 Jul 11 at 12:04 pm
Oops, don’t know what happened but it’s fixed now.
maclochlainn
15 Jul 11 at 9:21 pm
Good post. Your MySQL examples are good, but I think calling out the diffs between the correct and incorrect examples would help. In the second (correct) one you are specifying the value of the PK in the last insert stmt. That little detail makes all the difference.
Also, if you were to have a unique index on the table (other than the PK), perhaps on system_user_name, the first (incorrect) MySQL example would work properly.
One gotcha that I have noticed in working with upserts and unique indexes on MySQL: if one of the columns in a multi-column unique index contains a null value the on duplicate key update syntax will not work. Don’t ask me how I learned this or how long it took me to figure out what went wrong with the program I was writing. Solution: make SURE all of your columns in a unique index are not nullable.
cminor9
9 Aug 11 at 9:40 am