When a user supplies a web application some data we intend to house in our database, we wouldn’t dream of saving it without first whipping the data into a tidy, cleanly formatted shape. For example:
- We strip trailing or leading whitespace from email addresses, avoiding the uniqueness validation failure of allowing both “jack.kinsella@example.com" and "jack.example@example.com__". Not only does trailing whitespace allow near-duplicate records, but it also provides a crappy user experience in that certain users will be unable to log in and will remain completely unaware that their accidental trailing whitespace transformed their login details into a subtle lookalike.
- We remove dashes and other non-numerical characters from phone numbers, again to ensure that our validations genuinely guarantee uniqueness.
- We lowercase permalink slugs on the grounds that the HTTP protocol requires URLs be case insensitive. If we stored differing content on the URLs “/anatomy-notes” and “/ANATOMY-NOTES”, the HTTP protocol would imply that one of these records would never be accessible through its URL, it being clobbered by the other case variation.
- We coerce empty strings into NULLs (or vice versa) on string columns. SQL database engines view the empty string (‘’) as something distinct from NULL. The upshot of this is that an SQL query for, say, email fields containing the empty string will not return records with NULLs in their email field. Likewise, an SQL query for email fields containing NULL will not return records with empty strings in that position. This is a problem if, either through inexperience, forgetfulness, or poor understanding of the default behaviour of your web framework, you push code that represents missing email addresses as empty strings on some occasions and NULLs on others, yet your SQL finders assume that missing emails are always in one of these forms. As such, your SQL finders will return only a subset of the available records, causing you frustration and despair… (* But see my extra note below)
For similar reasons, we coerce NULLs into FALSEs in boolean columns. If you represent falsity as both FALSE and NULL in your DB, then you’ve turned a boolean question that should have two possible answers (TRUE or FALSE) into one with three possible answers (TRUE, FALSE, or NULL). As such, expect your application logic to fail in myriad unexpected ways. (* But see extra note below)
(* EXTRA NOTE: In some cases you will want to explicitly encode significance into the distinction between empty strings and NULLs. For example, imagine you have a column in your customer_addresses table called “address_line_2”. This is an optional extra tidbit of address data; most users have simple addresses that require only a single address line, although a few require a second line. In this situation, you might decide the empty string communicates “the user has told us that they have no second address line”, whereas NULL communicates “this data is missing”, perhaps because we haven’t asked the customer for it yet or because our system failed and lost this data. If you’re going for this distinguishing approach, the sanity of other programmers demands that you document the meaning of your distinction and keep it consistent throughout your entire project (instead of varying it table by table).)
At the start of your web application’s life, you’ll have only a handful of database-backed records to take care of, so it’s fairly straightforward to ensure your data is consistently formatted. But as your database (and its schema) grows, and as new programmers slot into your team, the demands of consistency become haltingly clunky to police.
At this point, you might resort to creating documentation that instructs team members on how to format database records before saving them to the database. Although this approach will improve data consistency, it is hardly watertight. There’s no guarantee that your team members will remember to follow your instructions, of which there may be a great many, thereby leaving scope for error.
A more elegant solution to this problem is project-wide attribute normalisation. Here’s the basic idea: At a software layer close to your object relational mapper code, insert a module that automatically formats database attributes matching certain names (e.g. permalink, name, or email) or certain types (e.g. boolean, string) with formatters of your liking (e.g. lowercasing or stripping trailing whitespace). This will guarantee that any record with those attribute names or column types will have those attributes formatted in the correct way—every time, guaranteed.
Obviously, this demands that your team reuse the same attribute names across database tables, and there is again scope for human failure and forgetfulness. But the surface area of code to be policed in order to pull off this second solution is far less than that of the approach depending on following documentation; now, instead of monitoring all code that interacts with the database fields, we must only monitor a single file: the database schema. Furthermore, I submit that having recurring database field names is both natural and good practice. It’s natural because the same entities keep cropping up within a large web application: permalinks, emails, names, titles, prices, categories, states (from state_machines), etc. And it is good practice because reusing the same column names for the same conceptual ideas keeps the entire system consistent and more intuitive to newcomers.
For an example of an open source library that does attribute normalisation in the Rails world, check out attribute_normalizer by Michael Deering.