How to Prep Data For Machine Learning: Analytics Corner

It’s easy for data analysts to lose sight of the purpose of the analysis. Staying on track means framing the purpose for the data against the machine learning model planned, then applying steps associated with exploratory data analysis (EDA).

EDA ensures model quality by encouraging an evaluation of the data types and values being placed into the model. Analysts use EDA to reduce data messiness by  stating the purpose of the model; reviewing the input data; and spending time adjusting those inputs to run in the model seamlessly.

Messy data in data-sets usually appears as raw entries in the data fields, with various field types, and unusual characters and category labels. Omitted field values can be a serious problem if they are systematic, because most machine learning models do not handle missing values well.

Fortunately there are functions in data sciences languages R and Python to edit messy data. A function is a block of code that is used to perform a single, related action. Functions in each language are reusable, so they can inspect and act on the data when it is imported into an object.

Both languages provide built-in functions to scan data when imported. But analysts will mostly likely need additional, more nuanced functions to visualize the edits better. Libraries, mini applications designed to hold functions and repeated tasks, can provide those nuanced functions.

For R Programming the built-in functions are simple, such as the one in the image below called “head.” “Head” shows the first few rows of a data-set contained in variable “y”. The number 10 is a value an analyst can select to return a set number of rows. The return value shows a matrix of 10 rows and 13 columns. Each column represents a variable. You can view the data entries to appreciate what your data-set contains.

Additional libraries can reveal more information, tailored to the kinds of data structures being created for an advanced model. For example, while the built-in dim() can show a simple dimension of a matrix, the library dplyr can show the observations using the glimpse() function and indicate what data type appears in a variable.

Ultimately, you use the libraries to determine the qualities in your dataset.

  • View distribution within the data columns – is the data skewed? Are there outliers? What’s the frequency on scoring values
  • Identify data types for the value – are they true numeric or just characters? Are they continuous or categorical? If so, how does your model generally recognize those values?
  • Decide if a program line should cover how data is imported. Decision such as which columns should be joined or separated may require library functions to automate these tasks when new datasets are used.

The data is considered as clean and ready-to-analyze when the type and quality are consistent through the columns. It should be clear what columns will be used and assumptions about their role as independent variables.

The data should also make sense against the performance metrics used for the machine learning model chosen. Take verifying model accuracy with cross validation. Cross validation examines different subsets of data for training and validating, so it important to make validation choices that relates to the business objectives for data as well as those for a model. Having an appreciation of the metrics in the input data can lead to a better explanation behind a machine learning initiative.

Marketing managers who are not the analysts conducting the exploration may feel intimidated with the code syntax in R or Python. But outlining the layout from a data-set, be it mapping a table in Excel or a simple note on paper, can still be helpful. Managers can then review the desired results with the analysts who will conduct the exploration. For additional help in planning data for machine learning, check out my prior posts on data bias and on tidy data, a structure format popular with R programming.

No matter how this is done, applying exploratory data analysis is the best way to avoid getting lost in a machine learning initiative.

Related Posts