Dealing with deletes

Do you ever soft delete data? Do you always hard delete it? Why? Explore when and why you might want to hard/soft delete the piece of data in a transactional data store.

Most applications are CRUD applications where they deal with state directly. That is, you create something and store it then read it later, edit it and/or delete it. This is a common pattern across web programming where you can have a user system, e-commerce system and so on. If you think about it then there is nothing special about that pattern. While true, deletes are a special case, why you might say? Deletes can be soft (marked as deleted) and/or hard. This article will go into the differences and the considerations you will need to think about with those deletes. This article will assume that we’re dealing with a transactional data store that stores the current state of data and not the events of what happened over time. Ok. Let’s say you have a user profile data store, a user signs up then hours later they decide to remove their account for some reason. Here you will be presented with a decision to make, will you permanently delete that user’s data or soft delete it? Let’s entertain both scenarios. Soft deleting a user means the user is still their but we somehow mark it as deleted. What did we gain? The user is technically still in the system and so we can track that user for analytics. Hard deleting on the other hand will mean the record of such user is permanently deleted. That means we will disrupt the user profile state. Something that was there, is not there anymore. We can’t even track that change because it is gone. Now that We’ve entertained both scenarios, you might say hard deletes are easier to go about because there is nothing to worry about after a piece of state is gone whereas for soft deletes, you will always need to explicitly check for if a non-deleted user existed or not and so there is more to worry about with soft deletes? Answer is wrong, there is consequences of both ways, how those consequences affect you will determine how to got about deletes. Why consider a hard delete in the first place? Think about the piece of data, will this data be used for analytics? Does it give you business value now or in the future? Is it a metric you want to track? Do you really need to keep track of such data? If the answer is yes to any of those questions then don’t hard delete. Deleting the data means it is gone forever and it might mess up metrics down the line, this is often something engineers don’t often think about leading to data issues. A common issue would be like why has the number of users decreased just now while it was higher a minute before. On the other hand, data is often valuable, you may not need it all of its pieces but the record of it is important for analytics. This is where you may consider soft deleting for when a user removes their account. One common issue with soft deleting a user data is do you have a user’s permission to keep it after they deleted their account? or do governmental regulations permit you to keep a record of that data? The quick answer is hard deletions because you usually don’t have permission to. The better answer in my opinion is soft deletions. Let’s see why, if you care about the record of the data then it will need to be there however you also don’t want to store sensitive information identifying a user because of some legal obligation and so the common approach here is to redact the data from the fields where they identify a user like a user name, email, gender and so on then keep the record. Redacting it would simply mean emptying all the sensitive fields. Why not soft delete? If you have some non functional data (like temporary metadata for the next batch job) that you benefit from just at the instant it is used and is rarely used/needed after that instant then you really don’t need to keep the data and can proceed hard deleting it immediately or after a certain amount of time. Often times you can hard delete even if you want to track all data with the current and past state. This is where sometimes you may have an analytics specific data warehouse and/or lake where the deleted data is dumped. You can just add what you care about, that would probably mean excluding user identifiable information and store it there. Once it is deleted then you can have a new analytics version of the data with a deleted marker. Without getting off-topic here, if this is something that interests you then you may find out more about it if you google ‘Slowly Changing Dimensions’ which is a very important topic in data analytics. There is also programmability concerns with soft deletes. Some questions pop up like should you exclude soft deleted records by default? Do you need to note why the data was marked deleted? Is it better if we had a state representing the data than just an explicit deletion marker? After considering the benefits and drawbacks of how to store deleted data. It is really your decision to make based on the design that you are going for. More often than not, data costs impact how we store data though keep in mind the value of the it.