You're So Predictable: A Lesson in Managing Data
You're So Predictable: A Lesson in Managing Data
"Our lives are not our own. We are bound to others, past and present, and by each crime and every kindness, we birth our future." –from Cloud Atlas by David Mitchell
There's a certain pattern most people exhibit before they reach a certain future. This can be the case in life, but it's also the case when it comes to purchasing decisions. In the latter case there is often a consistent pattern, and as marketers, we are able to capture it and find others bound to that same pattern.
For example, people who join a gym are also likely to purchase gym shoes, supplements, maybe a yoga mat or some limited in-home equipment. People who purchase a kitchen mixer will likely buy baking pans. People who purchase a boat may be interested in buying fishing gear—and so on. We're able to draw these assumptions about a consumer based on the mass of consumers before them who did the same things.
But persona marketing can be a tricky science. Luckily, there are tools and methods out there that can help us determine which attributes to select. It's called data. And businesses have lots of it. But what if we don't want to handcraft everything that goes into a model? Let's say we have hundreds of things we're trying to predict. And on top of that, let's say there's also a certain portion of the population with an unclear or unstructured past—a scattered purchase history. Maybe I joined a gym, bought a mixer, and I like to fish.
There will always be little drivers that steer us off the path, and depending on how much we steer, it can completely change the course of our future. When making a prediction these little drivers are sometimes the most important attributes of a model; they might be the key or the detail that sets us up to completely customize a customer's experience. But because we don't completely understand these little drivers, we can't necessarily label them or organize them into useable structured databases. Enter unsupervised learning. It's the key to organizing and understanding the small drivers.
Unsupervised learning can be a powerful modeling method. It can classify unstructured past and present data on its own. This can save time and energy on the human side but at the risk of machine processing time. To get the most accurate result in unsupervised learning, every time something changes it recalculates a new algorithm on its own. If, however, there is something that needs to be predicted up in an up-to-the-minute fashion, unsupervised learning may not be the best solution. It can be time consuming to re-collect, reorganize, and re-cleanse data—and that's all before retraining and recalculating the new algorithm.
Here are some things that can be done to facilitate the retraining of data, known as inductive biases:
Label as much data as possible. This way, the machine that's learning the data doesn't have to start from scratch. For our gym goer, for example, we would include everything in the first round: What time does that person go to the gym? Will he or she have eaten first? Do they go with a friend?
Get rid of useless features. These won't always be the same—there's no silver bullet—but it is important to reduce your datapoints to a more manageable set. Once we have the larger set, we can drill down from there.
Simplify your hypotheses. From the logic of Occam's razor we know that if there are multiple hypotheses, the simpler one is the better one. There are infinite ways to answer the question “What are these customers?” Scale it down. Scale it beyond “Are they likely to buy these ten products in the next year?” Start with one question: “Will someone who goes to the gym be likely to buy a yoga mat?” If you start with a simple question, it's a lot easier to develop a model that will predict whether or not consumers will do that first. Once we establish that, we move on: “Of those who purchased a yoga mat, will their next purchase be gym shoes?” It's taking your mass of data and scaling it to one simple question to get started.
Ultimately, it's a cross between two types of learning, known as semi-supervised learning, that might be the best approach. The greatest model may just need to be formulated from the known alongside the unknown. It should be based on a hypothesis that is neither simple nor complex, but rather, a hypothesis that clearly gets to the root of the problem and gives a possible solution that is executable. There are many “little drivers” of life that we could focus on, and we might even be able to formulate better hypotheses based on these findings. But as long as we get to where we need to go, the small things might just be that—a small thing.
Hamen Lo McLaughlin is a statistical database analyst at Pluris Marketing where she focuses on marketing enablement, analytics, and optimization solutions.