Friday, January 30, 2009

Removing Duplicate Rows The Lazy Way

There is an easy way to remove duplicate rows in a table albeit not so space-efficient (but hey, space is cheap). Removing duplicate rows, now matter how crazy it is, is a very common task in database maintenance and data warehousing. I've once worked on an SSIS project before which involved one big and dirty database. Unique key constraints were non-existent making the tables very susceptible to duplicates. I was faced with similar situation today when verifying the bugs claimed by the developers on the business objects. It turned out to be a data-related issue. There are duplicates rows because the database developers failed to create unique indices on some of the business tables. Hunting for those rows can be done using complicated queries involving DISTINCT and temp tables but they're too tedious. I always opt for the lazy approach instead. I call it such because it does not require that much thinking. Here are the steps:

1) Create a duplicate table to contain the pristine data. This can be done easily in SQL Server Management Studio through the Script Table as context menu item as shown below:

image

2) Run the script but make sure you rename the table.

3) Create a unique index with the option to ignore duplicate key. The column that defines this index is the one duplicated in the dirty table. Unfortunately, there is no click-click way to creating an index with such an option. You need to execute this command:

CREATE UNIQUE INDEX IX_NoDuplicates (Code) WITH IGNORE_DUP_KEY

where the name of the index is IX_NoDuplicates and the column defining it is Code.

4) The last step is to execute the INSERT-SELECT statement to copy the rows

INSERT NoDuplicates (Code) SELECT Code FROM WithDuplicates

Voila! No more duplicates!

This particular option is not created in SQL Server solely for this purpose. There are instances when reporting a duplicate error does not contribute any value to the business. It merely lengthen development time. I'm talking, for example, of a simple reference table like EmploymentType. Does it really matter to the company if an HR personnel happened to enter value "Contractual" the second time?

2 comments:

  1. Some of my former students taking up computer studies were able to open your blog which is linked to mine. They were interested about the site where you published your article. Some also would like to get in touch with you. I told them you're quite busy but I continued that if you have time you may be able to answer some of their questions. I just requested them to identify me as their source of link. Keep posting some practical solutions to programming problems. Share funny insights too similar to "received" that was wrongly spelled.

    ReplyDelete
  2. In that case, I'll post stuffs targeting beginners every now and then. They can also send their questions to gio_bani@yahoo.com. I might not be able to reply promptly though but I'll always try.

    ReplyDelete