In the past few months I've noticed an increased number of posts questioning practices around deleting and "virtually" deleting data from databases. Since some of the concerns around this practice have to do with the impact of soft deletes on scalability of a database-based application, I thought it would be a good topic for my ongoing series on building scalable databases.
Soft deleting an item from a database means that the row or entity is marked as deleted but not physically removed from the database. Instead it is hidden from normal users of the system but may be accessible by database or system administrators.
For example, let's consider this sample database of XBox 360 games I own
Now consider what happens if I decide that I'm done with Call of Duty 4: Modern Warfare now that I own Call of Duty: Modern Warfare 2. The expected thing to do would then be to remove the entry from my database using a query such as
DELETE FROM games WHERE name='Call of Duty 4: Modern Warfare';
This is what is considered a "hard" delete.
But then what happens if my friends decide to use my list of games to decide which games to get me for Christmas? A friend might not realize I'd previously owned the game and might get it for me again. Thus it might be preferable if instead of deleting items from the database they were removed from consideration as games I currently own but still could be retrieved in special situations. To address this scenario I'd add an IsDeleted column as shown below
Then for typical uses an application would interact with the following view of the underlying table
CREATE VIEW current_games AS SELECT Name, Category, ESRB, GameSpotScore, Company FROM games WHERE IsDeleted=False;
but when my friends ask me for a list of all of the games I have, I can provide the full list of all the games I've ever owned from the original games table if needed. Now that we understand how one would use soft deletes we can discuss the arguments against this practice.
games
Ayende Rahien makes a cogent argument against soft deletes in his post Avoid Soft Deletes where he writes
One of the annoyances that we have to deal when building enterprise applications is the requirement that no data shall be lost. The usual response to that is to introduce a WasDeleted or an IsActive column in the database and implement deletes as an update that would set that flag. Simple, easy to understand, quick to implement and explain. It is also, quite often, wrong. The problem is that deletion of a row or an entity is rarely a simple event. It effect not only the data in the model, but also the shape of the model. That is why we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have a parent Order. And that is just the simplest of issues. ... Let us say that we want to delete an order. What should we do? That is a business decision, actually. But it is one that is enforced by the DB itself, keeping the data integrity. When we are dealing with soft deletes, it is easy to get into situations where we have, for all intents and purposes, corrupt data, because Customer’s LastOrder (which is just a tiny optimization that no one thought about) now points to a soft deleted order.
One of the annoyances that we have to deal when building enterprise applications is the requirement that no data shall be lost. The usual response to that is to introduce a WasDeleted or an IsActive column in the database and implement deletes as an update that would set that flag.
Simple, easy to understand, quick to implement and explain.
It is also, quite often, wrong.
The problem is that deletion of a row or an entity is rarely a simple event. It effect not only the data in the model, but also the shape of the model. That is why we have foreign keys, to ensure that we don’t end up with Order Lines that don’t have a parent Order. And that is just the simplest of issues. ... Let us say that we want to delete an order. What should we do? That is a business decision, actually. But it is one that is enforced by the DB itself, keeping the data integrity.
When we are dealing with soft deletes, it is easy to get into situations where we have, for all intents and purposes, corrupt data, because Customer’s LastOrder (which is just a tiny optimization that no one thought about) now points to a soft deleted order.
Ayende is right that adding an IsDeleted flag mean that you can no longer take advantage of database triggers for use when cleaning up database state when a deletion occurs. This sort of cleanup now has to moved up into the application layer.
There is another set of arguments against soft deletes in Richard Dingwall's post entitled The Trouble with Soft Delete where he points out the following problems
Complexity To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows. This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query. Performance At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance. However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself. Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes. Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.
To prevent mixing active and inactive data in results, all queries must be made aware of the soft delete columns so they can explicitly exclude them. It’s like a tax; a mandatory WHERE clause to ensure you don’t return any deleted rows.
This extra WHERE clause is similar to checking return codes in programming languages that don’t throw exceptions (like C). It’s very simple to do, but if you forget to do it in even one place, bugs can creep in very fast. And it is background noise that detracts away from the real intention of the query.
At first glance you might think evaluating soft delete columns in every query would have a noticeable impact on performance. However, I’ve found that most RDBMSs are actually pretty good at recognizing soft delete columns (probably because they are so commonly used) and does a good job at optimizing queries that use them. In practice, filtering inactive rows doesn’t cost too much in itself.
Instead, the performance hit comes simply from the volume of data that builds up when you don’t bother clearing old rows. For example, we have a table in a system at work that records an organisations day-to-day tasks: pending, planned, and completed. It has around five million rows in total, but of that, only a very small percentage (2%) are still active and interesting to the application. The rest are all historical; rarely used and kept only to maintain foreign key integrity and for reporting purposes.
Interestingly, the biggest problem we have with this table is not slow read performance but writes. Due to its high use, we index the table heavily to improve query performance. But with the number of rows in the table, it takes so long to update these indexes that the application frequently times out waiting for DML commands to finish.
These arguments seem less valid than Ayende's especially when the alternatives proposed are evaluated. Let's look at the aforementioned problems and the proposed alternatives in turn.
Richard Dingwall argues that soft deletes add unnecessary complexity to the system since all queries have to be aware of the IsDeleted column(s) in the database. As I mentioned in my initial description of soft deletes this definitely does not have to be the case. The database administrator can create views which the core application logic interacts with (i.e. the current_games table in my example) so that only a small subset of system procedures need to actually know that the soft deleted columns even still exist in the database.
current_games
A database becoming so large that data manipulation becomes slow due to having to update indexes is a valid problem. However Richard Dingwall's suggested alternative excerpted below seems to trade one problem for a worse one
The memento pattern Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require. It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:
Soft delete only supports undoing deletes, but the memento pattern provides a standard means of handling all undo scenarios your application might require.
It works by taking a snapshot of an item just before a change is made, and putting it aside in a separate store, in case a user wants to restore or rollback later. For example, in a job board application, you might have two tables: one transactional for live jobs, and an undo log that stores snapshots of jobs at previous points in time:
The problem I have with this solution is that if your database is already grinding to a halt simply because you track which items are active/inactive in your database, how much worse would the situation be if you now store every state transition in the database as well? Sounds like you're trading one performance problem for a much worse one.
The real problem seems to be that the database has gotten too big to be operated on in an efficient manner on a single machine. The best way to address this is to partition or shard the database. In fact, you could even choose to store all inactive records on one database server and all active records on another. Those interested in database sharding can take a look at a more detailed discussion on database sharding I wrote earlier this year.
Another alternative proposed by both Ayende Rahien and Richard Dingwall is to delete the data but use database triggers to write to an audit log in the cases where auditing is the primary use case for keeping soft deleted entries in the database. This works in the cases where the only reason for soft deleting entries is for auditing purposes. However there are many real world situations where this is not the case.
One use case for soft deleting is to provide an "undo" feature in an end user application. For example, consider a user synchronizes the contact list on their phone with one in the cloud (e.g. an iPhone or Windows Mobile/Windows Phone connecting to Exchange or an Android phone connecting to Gmail). Imagine that the user now deletes a contact from their phone because they do not have a phone number for the person only to find out that person has also been deleted from their address book in the cloud. At that point, an undo feature is desirable.
Other use cases could be the need to reactivate items that have been removed from the database but with their state intact. For example, when people return to Microsoft who used to work there in the past their seniority for certain perks takes into account their previous stints at the company. Similarly, you can imagine a company restocking an item that they had pulled from their shelves because they have become popular due to some new fad (e.g. Beatles memorabilia is back in style thanks to The Beatles™: Rock Band™).
The bottom line is that an audit log may be a useful replacement for soft deletes in some scenarios but it isn't the answer to every situation where soft deletes are typically used.
So far we haven't discussed how hard deletes should fit in a world of soft deletes. In some cases, soft deletes eventually lead to hard deletes. In the example of video games I've owned I might decide that if a soft deleted item is several years old or is a game from an outdated console then it might be OK to delete. So I'd create a janitor process that would scan the database periodically to seek out soft deleted entries to permanently delete. In other cases, some content may always be hard deleted since there are no situations where one might consider keeping them around for posterity. An example of the latter is comment or trackback spam on a blog post.
Udi Dahan wrote a rebuttal to Ayende Rahien's post where he question my assertion above that there are situations where one wants to hard delete data from the database in his post Don’t Delete – Just Don’t where he writes
Model the task, not the data Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it). As we broaden our perspective to more parts of the system, we see this same pattern repeating: Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late. Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled. Jobs aren’t deleted – they’re filled (or their requisition is revoked). In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered. Statuses In all the examples above, what we see is a replacement of the technical action ‘delete’ with a relevant business action. At the entity level, instead of having a (hidden) technical WasDeleted status, we see an explicit business status that users need to be aware of.
Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).
As we broaden our perspective to more parts of the system, we see this same pattern repeating:
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late. Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled. Jobs aren’t deleted – they’re filled (or their requisition is revoked).
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.
Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.
Jobs aren’t deleted – they’re filled (or their requisition is revoked).
In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.
In all the examples above, what we see is a replacement of the technical action ‘delete’ with a relevant business action. At the entity level, instead of having a (hidden) technical WasDeleted status, we see an explicit business status that users need to be aware of.
I tend to agree with Udi Dahan's recommendation. Instead of a technical flag like IsDeleted, we should model the business process. So my database table of games I owned should really be called games_I_have_owned with the IsDeleted column replaced with something more appropriate such as CurrentlyOwn. This is a much better model of the real-life situation than my initial table and the soft deleted entries are now clearly part of the business process as opposed to being part of some internal system book keeping system.
games_I_have_owned
IsDeleted
CurrentlyOwn
Advocating that items be never deleted is a tad extreme but I'd actually lean closer to that extreme than most. Unless the data is clearly worthless (e.g. comment spam) or the cost is truly prohibitive (e.g. you're storing large amounts of binary data) then I'd recommend keeping the information around instead of assuming the existence of a DELETE clause in your database is a requirement that you use it.
Now Playing: 50 Cent - Baby By Me (feat. Ne-Yo)