Bigger, Stronger, Faster: Dr. Goodnight Talks Speed
During his keynote Q&A at the SAS Premiere Business Leadership Series in Orlando, company CEO and cofounder James Goodnight, Ph.D., voiced his displeasure of the Big Data buzzword.
“Most of our customers don't have Big Data,” he said. “It's not that big. The reason that term is being used today is because the analysts and media got tired of talking about cloud computing. There's a new buzzword that the analysts at Gartner and IDC come out with every year so they have something to consult about.”
Goodnight prefers the term “high-performance analytics”—though it should be noted that the top four Google search results of the phrase all lead to SAS websites.
After his keynote Goodnight sat down with Direct Marketing News to discuss what exactly constitutes high-performance analytics, how it works, and how businesses can apply the technology.
What are high-performance analytics?
It's basically the speed with which we can do things. It's just high-performance like a sports car that goes really fast.
You use in-memory analytics to gain that speed?
In-memory analytics is how we gain the speed. A lot of things we do in statistics are iterative processes. In all of our non-linear models, for example, we have to estimate what the values of all the different parameters are. We run through the data once and then we can estimate the derivative of each one of those. Then we go back through the data again using these new estimates, and we have to keep doing that until these estimates converge.
Could you reframe that in more basic language?
(Pause) I thought that was pretty basic. (Pause)
Could you give me a real-world activity where this is applied?
These are real-world activities! We have a model where we're trying to estimate the different numerical parameter values that would give us the very best fit. Because we can't just take one computation and get the results, we have to do the same computations over and over and over again, each time getting a little better answer until finally, after 15 or 20 reads through the data and all those computations, we have a set of answers that are about the same as the last time we tried to do it. And that's what we call convergence. You've got a set of numbers that get closer and closer and closer together until they finally don't change iterations.
And because of that, it requires the data to be read over and over again. If you've got 10 million records, you have to read those records over and over and over again as you're trying to come to the right answer. It takes a lot of time. The best way to avoid having to reread this is to read it one time and put it in-memory and keep it in-memory.
Also, because of the number of computations involved, today's chips from Intel only execute about three billion instructions per second. And there are some jobs that take trillions and trillions of operations before you get the right answer.
So to make those jobs run faster, we've developed a method where instead of trying to do this on a single CPU, we've spread the job out over hundreds, even thousands of CPUs. We spread the data out, we spread the computations out.
This allows us to do the job almost 1,000 times faster. It's a combination of using lots of memory, data in-memory, and using a lot of processors in parallel.
When did that idea originate?
Gosh, probably back in the early days of super-computers. I think they quit [atomic testing] 30 or 40 years ago because they could simulate that on computers. All the weather forecasting is done using weather simulations. It's been around.
But from a pure analytics standpoint, we were one of the early companies to realize that if we were going to do analytics on really large amounts of data, we need to go to this parallel computing idea.
That idea started with us about four and a half years ago. I was in Singapore and a banker said their risk computations--trying to see if value was at risk on their portfolio of stocks and bonds--was taking 18 hours. And that was on a single computer.
When I got back to Cary, [North Carolina,] I talked to our risk people and began to get an understanding of how it was and what it was they were doing with those computations. Then I sat down and roughed out an approximate number of computations they needed to do. It was 150 trillion operations on a computer. And if you can only do three billion and you've got to do 150 trillion, it's going to take a long time.
Clearly if we're going to solve this problem and get it any faster, we'll have to spread these computations out over many machines. We started looking at our statistical algorithms and saw we don't actually have to compute all the rows of a matrix over a single machine. We can spread the rows out over many machines.
So we began slowly putting together methods of how to do these computations in parallel when we'd always done it one step at a time.
Are most of your clients doing this sort of massive computing?
A number of our banking clients are beginning to look at Hadoop, a data storage method that can store very large amounts of data on home commodity hardware, so it becomes cheaper. They want to run analysis on all that data, so we pull that data from Hadoop directly into memory, and then fire up our high-performance analytics routines to analyze it.
So, yeah, I'd say there are at least 1,000 customers in the world that are doing this or are in the beginning of doing this.
Is there any interest in other verticals beyond financial services?
We're seeing some in the insurance field, especially some of the private healthcare insurers interested in detecting fraud in claims. We're also working on some of the tax authorities of different countries to try to help them better estimate which returns are likely to be incorrect.
How about with marketers?
A lot of the marketing stuff doesn't need high-performance, but one thing that really, really screams when you put it in high-performance is marketing optimization.
Say you're a telco with 25 million customers, and you're trying to optimize these 80 to 90 different offers that you can make to your customer, some from call centers, some may be email, some may be text messages or SMS. We can help you optimize exactly what messages should be sent to which individual customer.
And that is a linear programming problem that involves billions and billions of computations that can be done in just a matter of minutes. That one particular example was taking eight hours, which we now do in 90 seconds. It's really the speed at which you can get things done, especially in the modeling arena. It takes so long now to run some of these more complex models.
Why have those models become more complex? Is it the influx of information from digital channels?
We do collect a lot more data. With credit card fraud, for example, we collect 600 variables to do the predictions as to whether or not a card is being used fraudulently. You think of the sheer number of credit card transactions per day, and our goal is to collect every single bit of that data and use it in the models we've generated.
How do you build up those variables?
The credit people know what most of them are because they've been doing this for many years. But on a lot of problems, you don't have any idea. If you're trying to compute the probability of default on a home loan, you basically know this will be a logistic regression, so there will be a choice of models and you'll have to choose what variables go into it. Things like the current prime interest rate [and] the number of barrels of oil being imported or exported. All sorts of macro-economic variables. Unemployment rates, unemployment claims. All of these are variables that would have some effect on whether a mortgage payment can be made or not.
And then we throw a whole bunch more in there and see which ones are relevant. You know what some of the key ones are, but you're not sure about the others, so you experiment and go through a variable selection process.
Where should an enterprise start with high-performance analytics?
If you have a small business, you probably don't need high-performance analytics. Regular analytics will be fast enough for you. That's the case for many, many companies.
But it's the really, really long jobs that need to be run in a much smaller window of time. Obviously, if your risk computations are taking 18 hours, you're talking about not knowing what your risk is before the market is open the next day. You definitely need those computations done so you can plan a trading strategy for the next day.