It's summertime: tune-up that database
Data Quality's Dirty Little Secret
With the summer doldrums officially upon us, this is a great time to give that marketing database a thorough cleaning and tune-up, so it's in top-notch condition by the time the busy holiday season is upon us. Much like the car we drive to work every day, our marketing database is largely taken for granted unless it starts doing something weird. But just as your car needs periodic oil changes, filters, spark plugs, and tire pressure adjustments; databases run better when given a little TLC. Here are some things you can do to rev up your database for the busy fall and holiday seasons.
Take out the trash
All databases contain garbage data, such as bad email addresses, invalid phone numbers, dates, or dollars that are wildly out of range. Here are a few tricks that can make finding bad data easier:
- Run frequency counts on fields that are supposed to be unique, like email addresses or phone numbers. To avoid being inundated with output, you can instruct the database to only return cases where the data is not unique. Often you'll see phony addresses such as “email@example.com” or phone numbers such as “999-999-9999” assigned to several different consumers.
- Ask the database to display the maximum and minimum value of each field, filtering out blanks, nulls, and zeros. Because special characters sort either before or after normal letters and digits, these often show up in your output. If you're feeling especially ambitious, there are advanced techniques for finding garbage characters embedded within strings. Also on occasion you'll see dates that are centuries old (or centuries in the future), or numbers that are obviously out of range.
As a general rule of thumb, I like to replace bad string data with blanks, bad numeric data with zeros, and bad dates with nulls. This helps prevent incorrect query output—also referred to as garbage in, garbage out.
Remove the junk
Most databases I've seen contain tables that nobody uses, and no one knows what they're for or who put them there (usually a fellow named “Not Me”). Much like the junk in my garage, these seem to sit around forever; cluttering up the database and forcing users to work around them when building queries. Unless you're 100% sure, before deleting a table you should rename it and see if anything breaks. Once a week or so has gone by without incident, copy the data onto a backup media just to be sure, and delete the table.
Adopt the orphans
Because marketing databases obtain most of their data from other systems, they can get out of sync when system or timing problems occur. In the database world, an orphan is a logical impossibility that happens on an astoundingly regular basis: orders without matching consumers, orders without products, products without orders, and so on. The best way to handle orphans is to search for their “parents” back in the originating system, and reload them as necessary, causing the orphans to be adopted.
Watch out for blankety-blanks
Character strings such as name and address data often get loaded with trailing spaces. This commonly happens when data files are imported into the database, especially files where each data element starts in a specific column (known as “fixed length” files). So, why is this a big deal? Because those blankety-blank blanks take up room just like a printed character, and the more space you waste in a database, the slower it gets. The problem is astonishingly commonplace, can eat up hundreds of megabytes of storage, and often is quite tricky to detect. That's because many databases ignore trailing blanks when performing comparisons; a city name with no trailing spaces is considered equal to the same city with ten trailing spaces, and since everything seems to be working, nobody knows there's a problem.
One way to spot this condition is to compare the display length of a string with its physical storage length. Unfortunately, there is not a consistent way to do this across database platforms. SQL Server provides a function called LEN to obtain a string length without counting blanks at the end, and another called DATALENGTH that does count them. With other systems, you can compare a string length against the same string length after trimming trailing blanks. So whenever LEN(string) is not equal to DATALENGTH(string) or LENGTH(string) is not equal to LENGTH(trimmed string), you've got a blankety-blank problem.
Right-size your data
In ancient times—before Bill Gates single-handedly slayed the dinosaurs and popularized the PC—adding or changing data elements within a database used to be a big deal. Accordingly, database architects approached design with a conservative mind-set; usually including extra, unused “filler” fields for future expansion and allowing extra space on fields that they thought might grow larger.
In today's rapidly changing Big Data world, this is no longer necessary or recommended. Because marketing databases quickly become enormous monstrosities, wasting even a few bytes on every transaction can mean terabytes of squandered storage. Just as with trailing spaces, the more data that needs to be sifted through, the longer queries take to run. Why use a 10 character field if you're only using two? Take advantage of this period of relative quiet by making sure that the field lengths you're using are appropriate for the data they contain, and get rid of fields that aren't used.
Archive, clean, merge, and purge
If you haven't done so for a while, consider removing or archiving old data (i.e. customers you no longer actively market to). After doing this, it might be time to run a full NCOA and merge to make sure you're targeting the right consumers with the right message, and almost as important, excluding the right consumers.
Compress the rest
Much like defragging your hard drive, most databases offer a command to reorganize their contents. This basically sorts and moves data around so that processing it becomes faster and more efficient. It also gets rid of wasted space by compressing data files. Depending on your database, the command for doing this might be SHRINK or REORGANIZE, although one system uses the command VACUUM—ensuring that nobody wants to do it. At any rate, reorganizing is easy to run and yields nice dividends in improved performance.
So once you've taken out the garbage, hauled the junk away, zapped the blankety-blank blanks, archived, cleaned, vacuumed, dusted, and mopped; your database will be in great shape for the busy fall and holiday seasons. As an added bonus, a well-maintained database looks great in a swimsuit, and helps keeps your mind off the summer heat!
Jeff Fowler is president and
founder of Decision Software