Data Preprocessing Techniques in Machine Learning

  • Introduction
  • Why do we need Data Preprocessing?
  • The data preprocessing techniques involve two different methods.
  • Getting the Dataset
  • Importing Libraries
  • Importing the Datasets
  • How to handle missing data
  • How to encode categorical data
  • Split the data set – Training set and Test set
  • Feature Scaling

Introduction

Data preprocessing is the first step of creating a machine learning model, which prepares raw data for analysis.

A machine learning project cannot always start with clean and formatted data. Data must be formatted to get the most use out of it. For cleaning data, a Machine Learning project may require a Data preprocessing task.

Why do we need Data Preprocessing?

Data collected from the real world may be noisy, incomplete, or in an unusable format. Data preprocessing is an essential task that eliminates unnecessary information in the data in order to prepare it for a machine learning model. This process also has the effect of improving accuracy and efficiency.

Data preprocessing technique plays a very important role in Data mining. The data is quite essential for making any kind of machine learning model, which is typically used to predict the target value. Thus, proper and effective data preprocessing techniques are required for getting the best results in various machine learning algorithms.

The data preprocessing techniques involve two different methods.

  1. Getting the Dataset

To create a machine learning model, the first thing we need is a dataset. Data serve as an input to our algorithm and ensures that it generates expected information. A dataset is a collection of data for a particular problem in the proper format.

Different datasets are geared for different purposes; for example, a dataset needed to build a machine learning algorithm would be different from one required by someone with liver disease. Every dataset is different and catering to the needs of each one can be a challenge. If you intend on using your data in code, most often it will need to come from a CSV file. However certain datasets come from HTML or .xlsx files.

What is a CSV File?

CSV or “Comma-Separated Values” files are formatted as a list of values. There are several different variations in how this file is formatted. CSV files can be read and written by a variety of applications, including Microsoft Excel, Google Docs, Tableau, Neo4J, RStudio, and Python.

As the name suggests, each line in a comma-separated values file consists of records made up of fields separated by commas. Records are delineated by newlines in the file, though CSV files can be configured to contain an extra comma at the end of a record.

Here, we will use a demo dataset for data preprocessing. For practice, it can be downloaded from Super Data Science. For real-world problems, we can download datasets from online sources such as Kaggle.

  1. Importing Libraries

When performing data processing with Python, we need to import some predefined libraries. Data preprocessing is the major task that data scientists perform.

There are three key libraries of importance:

Numpy: This is a Python library that performs any mathematical operation in code. It is the fundamental package for scientific calculation in Python. Python also supports creating large, multidimensional data arrays. It can be imported as:

  • import numpy as nm  

We use NM, which is short for the Python package Numpy, and it will be used in the whole program.

Matplotlib: This is the second library, a plotting library for creating 2D graphs. To use this library, we need to import the pyplot sublibrary. The library is used to generate charts of any type in Python. The way it will be imported ought to look like this:

  • import matplotlib.pyplot as mpt  

The following code uses the library mpt.

Pandas: The last library is Pandas, which is one of the most renowned Python libraries. It is used for loading and managing datasets. The pandas library is an open-source package for data manipulation and analysis.

  • import pandas as pd

Here, pd is the short name for the library.

  1. Importing the Datasets

We need to import the datasets we collected for our machine learning project. To import a dataset, we must set the current directory as our working directory. Open Spyder IDE and set the working directory in the specified text box as follows:

  • Save your Python file in the directory where you found the dataset.
  • Go to the File Explorer option in Spyder IDE and select the directory.
  • Click F5, or run the option.

We can set any directory as a working directory that contains the required data.

This Python file contains the code and dataset required. The current working directory is now set to that folder.

The read_csv() function:

The module pandas includes a function read_csv(), which is used to import and work on CSV files. This can be done locally from the report or from the URL of a site, such as quandl.com

We can use the read_csv function as follows:

  • data_set= pd.read_csv(‘Dataset.csv’)  

This function stores the dataset into a variable of data_set, and then closes the function with end. The following line of code will successfully import the data into our program. We can also check for the imported dataset by clicking on section variable explorer, and then double-clicking on data_set.

Indexing begins from 0 as in the image below. We can also change the formatting of our dataset by clicking on “format”.

Extracting dependent and independent variables

Machine Learning is the science of gaining insight from structured data, and it is crucial to understand how features and outcomes differ. In our dataset, there are three independent variables Gross Salary, Age of Employee, and Country. The dependent variable is Purchased.

Extracting independent variable

We use iloc[] method of Pandas library to extract an independent variable. It can extract the required columns and rows from the dataset.

  • x= data_set.iloc[:,:-1].values  

In the aforementioned code, the first colon(:) is used to take all rows and the second colon(:) to take all columns. We have used :-1 because we do not want to take the last column of data, which contains our independent variable. By doing this, we will get the matrix of features.

Output on executing the aforementioned code:

[[‘India’ 38.0 68000.0]  

 [‘France’ 43.0 45000.0]  

 [‘Germany’ 30.0 54000.0]  

 [‘France’ 48.0 65000.0]  

 [‘Germany’ 40.0 nan]  

 [‘India’ 35.0 58000.0]  

 [‘Germany’ nan 53000.0]  

 [‘France’ 49.0 79000.0]  

 [‘India’ 50.0 88000.0]  

 [‘France’ 37.0 77000.0]]  

As you can see, the output has only three variables.

Extracting dependent variable

We will use the Pandas .iloc[] method, again, to extract dependent variables.

  • y= data_set.iloc[:,3].values  

We have taken all the rows with the last column only and this will give us a list of dependent variables.

Output on executing the aforementioned code:

array([‘No’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’],

dtype=object)

If you are using Python for machine learning, extracting text from PDFs is mandatory.

If you are using R and not Python, then it is not necessary to extract the text from PDFs.

  1. How to handle missing data

The next step of data preprocessing is to address missing values in the dataset. If a dataset contains any missing values, it may create problems for our machine learning model, so it’s necessary to handle these values when we’re processing the data.

Here are some solutions to handling missing data

  1. The first way to deal with null values is by deleting the row or column that contains them. However, this way is not so efficient, and removing data may lead to the deletion of information which will cause an inaccurate output.
  2. By calculating the mean, we can fill in for any missing values. This strategy is useful for features such as age, salary, and year. Here we will use this approach.

To account for missing values, we can always use the Scikit-learn library that contains tools for building machine learning models. In this case, Imputer is the class of the sklearn.preprocessing library.

Here is a code:

  • #handling missing data (Replacing missing data with the mean value)  
  • from sklearn.preprocessing import Imputer  
  • imputer= Imputer(missing_values =’NaN’, strategy=’mean’, axis = 0)  
  • #Fitting imputer object to the independent variables x.   
  • imputerimputer= imputer.fit(x[:, 1:3])  
  • #Replacing missing data with the calculated mean value  
  • x[:, 1:3]= imputer.transform(x[:, 1:3])  

Output:

array([[‘India’, 38.0, 68000.0],

[‘France’, 43.0, 45000.0],

[‘Germany’, 30.0, 54000.0],

[‘France’, 48.0, 65000.0],

[‘Germany’, 40.0, 65222.22222222222],

[‘India’, 35.0, 58000.0],

[‘Germany’, 41.111111111111114, 53000.0],

[‘France’, 49.0, 79000.0],

[‘India’, 50.0, 88000.0],

[‘France’, 37.0, 77000.0]], dtype=object

The missing values are replaced with the mean of the rest column values shown above.

  1. How to encode categorical data

Structured data sets can contain categorical variables.

Building machine learning models require categorical data to be encoded to numbers before they can be used in algorithms.

For Country variable:

In order to train the model, we need categorical data. We will use LabelEncoder() from the preprocessing library.

  • #Categorical data  
  • #for Country Variable  
  • from sklearn.preprocessing import LabelEncoder  
  • label_encoder_x= LabelEncoder()  
  • x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

Output:

Out[15]:

array([[2, 38.0, 68000.0],

[0, 43.0, 45000.0],

[1, 30.0, 54000.0],

[0, 48.0, 65000.0],

[1, 40.0, 65222.22222222222],

[2, 35.0, 58000.0],

[1, 41.111111111111114, 53000.0],

[0, 49.0, 79000.0],

[2, 50.0, 88000.0],

[0, 37.0, 77000.0]], dtype=object)

Explanation:

We imported the LabelEncoder class of SkLearn into our code, and it successfully encoded the variables into digits.

In this case, there are three variables that correlate with each other which might produce wrong output. To solve this issue, we will use dummy values.

Dummy Variables:

A dummy variable is one that will either be 0 or 1. It assigns a number value to different variables.

We will use the OneHotEncoder class of an external library which offers dummy encoding for the three categories.

  • #for Country Variable  
  • from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
  • label_encoder_x= LabelEncoder()  
  • x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
  • #Encoding for dummy variables  
  • onehot_encoder= OneHotEncoder(categorical_features= [0])    
  • x= onehot_encoder.fit_transform(x).toarray()  

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,

6.80000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,

4.50000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,

5.40000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,

6.50000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,

6.52222222e+04],

[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,

5.80000000e+04],

[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,

5.30000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,

7.90000000e+04],

[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,

8.80000000e+04],

[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,

7.70000000e+04]])

As one can see, all the variables are encoded as a number between 0 and 1 that appears in three columns.

It can be seen more clearly in the Variables explorer section by clicking on the x as:

Purchased Variable:

  • labelencoder_y= LabelEncoder()  
  • y= labelencoder_y.fit_transform(y)  

For the second categorical variable, we will only use LabelEncoder object of LabelEncoder class. Here we are not using OneHot Encoder because the purchased variable has only two categories: “yes” and “no,” and which is automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as in this image:

  1. Split the data set – Training set and Test set

When performing data preprocessing for machine learning, it is important to split your dataset into a training set and a test set.

If we are unable to train a machine learning model with the same dataset it is tested on, then the challenges in understanding correlations will be too great.

If a machine learning model is trained on very large datasets, it will not perform well with smaller sets.

Defining a dataset:

Training Set: The subset of the dataset used to train the machine learning model. In this, we are familiar with the output.

Test set: The subset of a dataset is used by the machine learning model to predict an output.

The following code splits the dataset:

  • from sklearn.model_selection import train_test_split  
  • x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

Explanation:

The first line in the code above separates the dataset into two subsets: training, or “train,” and test.

In the second line, we have declared four variables on which our output will depend on.

  • x_train: features for the training data
  • x_test: features for testing data
  • y_train: Dependent variables for training data
  • y_test: Independent variable for testing data

In the train_test_split() function, we have passed four parameters in which the first two are for arrays of data. The test size can specify the size of the testing set. The test_size might be different from 50% to 30%, depending on the DT part.

The last parameter, random_state, is used to set a seed for a random generator. This means that you’ll always get the same result; the most common value being 42.

Output:

Executing this code will generate 4 different variables, which are listed on the variable explorer section.

The following chart shows how the x and y variables are divided into 4 different values.

  1. Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to standardize the values included in independent variables on a specific range, according to their scale.

The proportional rule helps ensure that variables are all put in the same scale and range so no single variable dominates another.

Some machine learning models rely on Euclidean distance, so age and salary values must be divided up appropriately in order to produce the desired effect.

Euclidean distance:

If we compute any two values from age and salary, then the salary value will dominate the age value. We need to perform feature scaling for machine learning in order to have a balanced effect on all of the features.

There are two methods to perform feature scaling in machine learning:

Standardisation

Normalisation

In this dataset, we will use the standardization method. For feature scaling, we will import the method of sklearn.preprocessing called StandardScaler as follows:

  • from sklearn.preprocessing import StandardScaler  

We will create the object of StandardScaler class for independent variables or features and then fit and transform the training dataset.

  • st_x= StandardScaler()  
  • x_train= st_x.fit_transform(x_train)  

For the test dataset, we’ll apply the transform() function directly instead of using  fit_transform(), as it has already been done in the training set.

  • x_test= st_x.transform(x_test)  

Output:

Executing the above lines of code will return scaled values for x_train and x_test to 24.

The x_train:

The x_test:

As we can see in the above output, all input variables are scaled between 0 and 1.

In this project, our dependent variable is binary and only has two values: 0 and 1. In future projects, more variables will also have a range of values (e.g., 0-100 or 1-5).

Combining them all

Now that we understand every single step, let’s combine them to create a more comprehensible program.

  • # importing libraries  
  • import numpy as nm  
  • import matplotlib.pyplot as mtp  
  • import pandas as pd  
  • #importing datasets  
  • data_set= pd.read_csv(‘Dataset.csv’)  
  • #Extracting Independent Variable  
  • x= data_set.iloc[:, :-1].values  
  • #Extracting Dependent variable  
  • y= data_set.iloc[:, 3].values  
  • #handling missing data(Replacing missing data with the mean value)  
  • from sklearn.preprocessing import Imputer  
  • imputer= Imputer(missing_values =’NaN’, strategy=’mean’, axis = 0)  
  • #Fitting imputer object to the independent varibles x.   
  • imputerimputer= imputer.fit(x[:, 1:3])  
  • #Replacing missing data with the calculated mean value  
  • x[:, 1:3]= imputer.transform(x[:, 1:3])  
  • #for Country Variable  
  • from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
  • label_encoder_x= LabelEncoder()  
  • x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
  • #Encoding for dummy variables  
  • onehot_encoder= OneHotEncoder(categorical_features= [0])    
  • x= onehot_encoder.fit_transform(x).toarray()  
  • #encoding for purchased variable  
  • labelencoder_y= LabelEncoder()  
  • y= labelencoder_y.fit_transform(y)  
  • # Splitting the dataset into training and test set.  
  • from sklearn.model_selection import train_test_split  
  • x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  
  • #Feature Scaling of datasets  
  • from sklearn.preprocessing import StandardScaler  
  • st_x= StandardScaler()  
  • x_train= st_x.fit_transform(x_train)  
  • x_test= st_x.transform(x_test)  

Instead of including all the preprocessing steps, we can split our code to be more flexible. So, in a nutshell, these are 7 simple data processing techniques in Machine Learning.

Happy learning!

Reference:

Data Preprocessing Guide

Data processing technique for machine learning with Python 

Super data science blogs

Data sets in machine learning