Taking Data Integrity Srsly

Data Validity Spot Checks, No-Delete Policies, Database Constraints, and Care with NULLs vs. FALSEs etc.

This article is part of my Confessions of an Unintentional CTO book, which is currently available to read for free online.

Having maintained Oxbridge Notes for seven years now (and counting), I’ve come to value data integrity as one of the supreme goods in any well oiled and happily functioning web application. While it’s true that various programming tools and technologies will come and go during your business’s lifespan, your data, by contrast, may well remain relevant far into the future. And because of its special longevity, it’s especially important for us to guarantee our data’s integrity integrity.

When I speak of data integrity, I’m referring to a certain tidiness and consistency within an application’s stored data. What’s especially important to me is the continuous validity of whatever data validity checks my application exacts upon incoming data. For example: Do all my users have validly formed email addresses? Are all my Order records associated with user records? Is any proposed product name unique amongst all the other product names in my system? Are all my users’ avatars stored within standard photo file formats such as PNG? Do the state attributes of all my Payment objects fall into one of the three permitted values—"cart", "payment", or "completed"?

Most web applications have data validity checks that run whenever a new record gets saved or an old one gets modified. Given this state of affairs, why do I nevertheless insist on making a fuss about data integrity? For the simple reason that invalid data still manages to seep its way into databases despite the usual validations.

The vectors of contamination are manifold. For illustrative purposes I’ll list the situations that caused my database to become tainted at some time or other:

  • Whenever someone (or some other program) directly modified my database through the database instead of through my Ruby on Rails application, they bypassed the validations written in my Ruby object-relational mapper and were thereby permitted to save inconsistent data.

  • Certain previous bugs caused data to become malformed. Whenever my post-bug cleanup operation was inadequate, clumps of aberrant data would linger around and haunt the database.

  • Data created many years ago—back when my programming skill was significantly less advanced—was subject to comparatively relaxed data validations. When I subsequently tightened these validations, I sometimes forgot to retrospectively apply the more stringent standards to pre-existing data. The result was an ugly sort of caste system within my database.

  • Underspecification of how data ought to be represented in the database (e.g. as NULLs or as empty strings) led to data assuming inconsistent forms, which in turn led to unexpected results returned from future database queries.

  • Database migrations over the years sometimes left out data transformations which would have been needed to massage pre-existing data into the right shape.

  • Rare system outages caused processes to hiccup and abort halfway through some action I had always assumed would be atomic, thereby leaving behind small, scattered bits of inconsistent data.

  • Multithreaded/multi-server applications introduced the possibility of race conditions which wreaked havoc on uniqueness validations. For example, imagine if two threads simultaneously receive the same proposed name for an attribute that’s supposed to be unique. Since no pre-existing record has reserved this name, the validator in each thread gives the green light and the application writes this name to the database. This results in you having two non-uniquely named records despite the presence of application-level database validations that were supposed to prevent this from ever happening.

Whatever its causes, inconsistent data is always a serious problem. It festers within your database, only to strike at inopportune times in the future by means of bugs occurring in otherwise solid and stable code. Perhaps the worst thing about having poor data integrity is that it introduces a particularly nasty type of instability—one that is difficult to detect through your usual automated tests, for these assume your data is always platonically formed according to the most recent ideals encapsulated within your data validations.

So, given all that, what can we do to improve our data integrity?

1. Data Validity Spot Checks

If I were to recommend one single course of action, it would be to schedule regular data validation checks over the entire database.

This involves recycling the data validation functionality used during normal application flows. The only difference is that instead of waiting for a user to modify a particular record, we instead loop through every record and check that it still meets our validation requirements.

This approach doesn’t remove data validity problems for us automatically, but it does alert us of issues and arrange them in a way that’s efficient for us to deal with. After all: It’s better to lump identical problems into a single big batch rather than disperse them across multiple staff members over long periods of time during which the company might forget the steps needed for a solution and thus waste time figuring them out again. Keep in mind that data integrity issues tend to occur in clusters, and are usually caused by some deeper root cause.

The first time you run one of these special data validation scripts, expect to spend an afternoon or two fixing the dodgy data. Once through this initial ordeal, subsequent runs will usually complete without a hitch.

The basic idea elucidated, I’d like to now introduce some subtlety: Validity is not necessarily a black-or-white affair—there are various shades of imperfection. In particular, some problems hardly warrant declaring a record invalid, yet they are not so insignificant as to justify sweeping said issue under the rug.

Let me give you an example: I sell Word documents on my website, and in order to provide comprehensive search functionality, I feed the text from these documents into a search engine index. At times this indexing fails, for example when there are network connectivity issues or when my text extraction library meets with an unusual Word file. When these problems occur, I don’t want to invalidate my otherwise intact DigitalDownload objects, but I do wish to preserve the error so that I can revisit it later. My solution was to mark these issues in the database with a flag. (Technical details for the extra geeky reader: I use a bit field to store a collection of boolean attributes in a single database integer column. This has the advantage that I do not need to pollute the database with additional columns should I wish to add additional flags.)

Knowing that all this failure information is available, I then built a service that scans the database for files known to not have been indexed completely and then attempts to index them once more. This service gets run at various dates in the future, with the dual hopes that fleeting network downtime should have passed and that the constantly improving text-extraction program that powers the system will have been improved to handle even more exotic Word files.

2. No-Delete Policies

Even though deletion is one of the fundamental database operations, its widespread availability at the web application level is misguided at best and foolhardy at worst. Now that’s quite the bold statement, so please allow me to present my argument.

By way of background information, I’d like to point out the exponentially decreasing cost of data storage. For many software businesses today, data storage costs are mere rounding errors — especially when compared to the high costs incurred by hiring competent software engineers.

The core of my argument against deleting is that we programmers often lack the foresight to envision how data will become necessary or important in the future. Because our fallibility in this respect doesn’t mesh with the irreversible permanence of deletion, the most reasonable architectural decision is for us to keep old data hanging around, just in case.

What are some specific situations in which data initially assumed to be defunct may later turn out to be necessary?

  • Business reports—whether they be existing reports or future ones not yet dreamed up—are often generated by means of complicated multi-table joins. The absence of just a single record can sever the link in this long chain of database joins and result in your report generation being complicated or even prevented altogether.

  • In some business models (e.g. two-sided platforms for digital downloads), you, as the business operator, have contractual obligations that are dependent on the continued presence of certain bits of data. Say, for example, a customer buys a digital download from a supplier on your platform. Two seconds after the purchase, the supplier deletes all his digital products, including the one the customer just bought. If you had allowed your suppliers to do "true" deletions, then the customer would be unable to download the products they had just bought and would therefore have a legal claim against you as the business owner.

  • The absence of old data inhibits your ability to answer certain customer service questions or requests. For example, if a customer gets audited five years down the line, they may contact you to demand you reprint all their old invoices. You won’t be able to do this if you’ve since deleted the underlying data (say because you no longer sell those products).

Having established my arguments for why absolute deletion is often a foolhardy course of action, I’ll now share implementation details for a no-delete policy: The simplest solution involves adding a deleted_at date column to every table with records that ought to stick around. From here on out, you will “delete” records by setting their deleted_at columns to the current time. (Symmetrically, you can undelete a record by setting its deleted_at column back to nil.) In order for this deleted status to be respected in the web application’s functionality, you will need to rewrite parts of the code to hide deleted records from appearing. For example, in the digital downloads example above, deleted files should no longer appear in the frontend as options for prospective customers to buy. But by way of contrast, these same 'deleted' records should still be accounted for in any all-time sales reports or vis-à-vis invoice regeneration.

This brings us to a more general point about implementing a no-delete policy: It’s not as simple as flipping a switch. At every place the old record might conceivably appear—be that alone or as part of a list (e.g. as one of the products in a category)—you’ll need to decide whether the 'deleted' record ought to be displayed or hidden. And be warned that your scope of care needs to extend beyond raw database finders—you will end up needing to consider the effects of 'deleting' on search indexes, automatically generated sitemaps, and more. Closely related to all this is the issue of uniqueness validations—should these checks be constrained by 'deleted' records or should they ignore them? The answer depends on the functionality in question.

Before you accuse me of sounding like an anti-deletion fanatic, I do allow that you should "hard delete" a record in special circumstances. Though as the documentation of Paranoia (a Rails library for no-delete functionality) warns, “aim this method away from face when using”.

3. Database Constraints

Database constraints provide a second, more fundamental layer of defence against inconsistent data when added to application-level validations. Inexperienced developers often argue that there’s no need to write database constraints if your web framework (e.g. Rails) already carries out the same checks at the application level. But after considering the above situations in which inconsistent data can wrest its way into the database despite application-level validations, you should now see the folly in these developers’ arguments.

Which exact database constraints are available to you depends on your choice of DB engine. For the sake of concrete demonstration, I will assume you are using the popular Postgres database and will therefore provide code examples in raw SQL. That said, I’m almost certain your chosen web application framework will have features or plugins to ease the creation of these constraints for you.

A. NOT NULL

This guarantees that a given column can never be empty in any record, and is roughly equivalent to a "presence" validator. Here’s how to add a NOT NULL constraint to the name column during initial table creation:

sql CREATE TABLE users ( … name varchar NOT NULL, … );

B. UNIQUE

This guarantees that a given column is unique within your table. In the example below you’ll notice there’s also a NOT NULL constraint. This is intended to demonstrate how various constraints can be added to a single column:

sql CREATE TABLE users ( … name varchar NOT NULL UNIQUE, );

If a UNIQUE constraint acts upon a group of columns as opposed to a single column, we write the constraint as follows:

sql CREATE TABLE users ( … first_name varchar NOT NULL, surname varchar NOT NULL, UNIQUE (first_name, surname) );

C. REFERENCES

This guarantees the values in a given column match the values in a second column in some other "foreign" table. This constraint is used for things like ensuring child records have existent parent records, or for preventing deletion of parent records that have dependent children records. Below I demonstrate how to add a REFERENCES constraint by altering an existing table. (The two previous examples showed how to add constraints to newly created tables.)

```sql

Relevant bits from the two tables in question

users ( id INT PRIMARY KEY NOT NULL, … );

todos ( … user_id INT NOT NULL, );

Adding the constraint

ALTER TABLE todos ADD CONSTRAINT todosuseridfk FOREIGN KEY (userid) REFERENCES users (id); ```

With the foreign key in place, any operation—delete or update—that causes a todo to point to a non-existent user will now fail.

D. CHECK

This guarantees that your data matches certain requirements, such as inclusion within a list of permitted string values or within certain numerical bounds or between certain calendar dates. It’s also possible to employ regular expression checks here, as shown with the email column below:

sql CREATE TABLE users ( … age INT CHECK (age > 17), activation_state varchar CHECK (activation_state IN ('normal','trial', 'expired_trial')), email varchar CHECK (email ~* '^[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+$') );

4. Be Vigilant about the Distinction in SQL between NULL, FALSE, and the Empty String

SQL database engines (typically) view the empty string (‘’) as something distinct from NULL, the absence of data. The upshot of this is that an SQL query for records with email fields that equal the empty string will not return records with NULLs in this column. This is a problem if, either through inexperience or forgetfulness, you write code that represents missing email addresses as empty strings on some occasions and as NULLs on others—even though your SQL finders assume that absent emails are exclusively represented by one of these forms. Whenever you query your database with one of these finders, you’ll only see a subset of the available records, leading to frustration and confusion.

This is a really, really common problem with new web developers. And it doesn’t just occur with respect to columns with string data types. For example, you could easily run into the same issue with boolean columns if you represent falsity as FALSE in some cases and NULL in others.

What should one do about all this? Other than staying vigilant, the most important thing is to attach stable meanings to NULLs, empty strings, and so on, and then stick to these distinctions throughout your codebase (instead of varying it on a table-by-table basis). For example, imagine you had an addressline2 column in your customer_addresses table. While some users will have a second address line, many will have simpler addresses containing only a single line. You might decide that the empty string communicates “the user has told us that they have no second address line”, whereas NULL communicates “the data is missing either because we haven’t asked the customer for it yet or because there has been a system failure".

5. Use Database Transactions

This topic has been covered to death elsewhere online, so I’ll only give it a brief mention because I’m a completionist.

Database transactions ensure that a sequence of database changes either occurs in full or not at all. They are necessary whenever it’s possible that a premature ending mid-sequence could lead to an incorrect, inconsistent, or otherwise problematic state of affairs. Below I will provide some examples, though please note that I will be describing unrealistically naive algorithms for the sake of having convenient and concise explanations.

Banking: Say I intend to pay you £100 with my online bank. The first step in the bank’s algorithm is to withdraw £100 from my account, whereas the second step is to bump your balance by the same amount. Now imagine that as soon as my account is docked (but before yours is increased), the bank’s software crashes. When their system gets rebooted, my account balance will be £100 less, even though your account balance will have remained exactly the same. As such, the cash has effectively disappeared into thin air.

Ticket-selling software: One of the goals of ticket-selling software is to prevent selling more tickets than there are seats. In order to ensure this, the ticket-selling algorithm will keep track of the count of remaining tickets. Now let’s set up a crash scenario by imagining that I click “buy" on a ticket. The first step the algorithm takes is to decrement the remaining tickets counter and only after this is done does the ticket get assigned my name. Assuming the system crashes just before the assignment to me, the counter will remain stuck in this decremented position­—even though I never received or paid for a ticket! The overall result will be that the software undersells its venue instead of simply protecting against overselling, as it was supposed to.

These issues could have been prevented by the use of database transactions which roll back the initial withdrawal or counter should the later steps in their respective algorithms fail to occur. This prevents these ugly in-between states from ever occurring.


More Articles: Click here for full archive

Staying Organised When Advertising

How to structure campaigns, why to not delete data, when to schedule reviews


Silent Failures Must Be Given a Voice

The most dangerous bugs are those you remain chronically unaware of–as all too often occurs with failing scheduled scripts


A Lasting Marriage: Lessons Learned After 7 Years With Google Analytics

A deep guide to developing a loving relationship founded on clean data, great communication, and appropriate caution