If you have categorical variables in your dataset and want to know how to deal with categorical variables in machine learning, then this tutorial is for you. In this article, you will understand the method in machine learning for Categorical variables along with Python code. So give your few minutes to this article and clear your doubts.
Now without any further ado, let’s get started-
How to Deal with Categorical Variables in Machine Learning with Python?
Before we dive into the techniques in machine learning for Categorical variables, first understand what are Categorical variables?
What are Categorical variables?
Categorical variables have different categories or labels associated with the observation. And they have non-numerical values, that’s why we need to convert this textual data into numerical form.
For example, in this dataset, the “Country” variable has 3 categories- France, Spain, and Germany. And for the machine learning model, it’s hard to compute some correlation between these categories. And that’s why we need to convert these strings into numbers.
Some more examples for categorical variables are-
- A “weather” variable with the values: Sunny, Cloudy, and Rainy.
- A “color” variable with the values: Green, Yellow, and Red
So to convert these strings into numbers, there are various methods available. But the most popular method is one hot Encoding. Now let’s understand one hot Encoding in detail.
One hot Encoding
One hot encoding technique turns this country column into 3 different columns. Why only three columns?
Because there are total 3 different categories in the Country variable- France, Spain, and Germany. If there were 5 different countries, we would turn this column into five columns.
I hope you understood.
One more important thing is that One hot encoding consists of creating binary vectors for each of the countries. That means we have to represent categorical variable values in terms of 0 and 1. Let me explain this in detail-
As I told you that one hot encoding turns this country column into 3 different columns. So after creating three different columns and filling the values as 0 and 1, it looks something like that-
Country | France | Spain | Germany |
France | 1 | 0 | 0 |
Spain | 0 | 1 | 0 |
Germany | 0 | 0 | 1 |
Spain | 0 | 1 | 0 |
Germany | 0 | 0 | 1 |
France | 1 | 0 | 0 |
Spain | 0 | 1 | 0 |
France | 1 | 0 | 0 |
Germany | 0 | 0 | 1 |
France | 1 | 0 | 0 |
How these values are filled?
Let me explain with the help of this image-
I hope now you understood. Now let’s see how to implement one hot encoding in Python.
One hot Encoding in Python
For implementation, I am gonna use a small dataset just for your interpretation. So this the small dataset, where we have two categorical variables “Country and Purchased“. And we have to convert this textual data into numerical form.
So the first step is-
1. Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
NumPy is an open-source Python library used to perform various mathematical and scientific tasks. NumPy is used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices.
Matplotlib is a plotting library, that is used for creating a figure, plotting area in a figure, plot some lines in a plotting area, decorates the plot with labels, etc.
Pandas is a tool used for data wrangling and analysis.
So in step 1, we imported all required libraries. Now the next step is-
2. Load the Dataset
dataset = pd.read_csv('Data.csv')
As you can see in the dataset, there are 3 independent variables and 1 dependent variable. That’s why we need to split the independent variables as X and a dependent variable as Y. So the next step is-
3. Split Dataset into X and Y
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
Now we have split the dataset into X and Y. But as you can see in the dataset, there are some missing values too. I have already written an article on How to Handle Missing Values in Machine Learning, you can check. So the next step is-
4. Handling Missing Values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
After handling missing values, it’s time to apply one hot encoding to the dataset.
5. Encoding categorical data
First we encode the independent variable “Country”-
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Why I used 0…?
Because Country variable has index value 0.
After running this code, the “Country” variable values turns into numerical form and look something like that-
Now let’s encode the dependent variable “Purchased”-
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
And the dependent variable “Purchased” values are converted into 0 and 1. 0 means No and 1 means Yes.
So this is all about encoding the categorical variables. I hope you understood the concept easily. Now it’s time to wrap up.
Conclusion
In this article, I have discussed how to deal with categorical variables in machine learning. If you have any questions, feel free to ask me in the comment section. But if you found this article helpful, kindly share it with others.
All the Best!
Happy Learning!
You May Also Interested In
10 Best Online Courses for Machine Learning with Python in 2024
9 Best Tensorflow Courses & Certifications Online- Discover the Best One!
15 Best Udacity Machine Learning Courses
10 Best Courses for Machine Learning on Coursera You Must Know- 2024
Best Keras Online Courses You Need to Know in 2024
Machine Learning Engineer Career Path: Step by Step Complete Guide
Best Online Courses On Machine Learning You Must Know in 2024
Best Machine Learning Courses for Finance You Must Know
Best Resources to Learn Machine Learning Online in 2024
8 Best+FREE Calculus Courses Online for Machine Learning in 2024
60 Best FREE Online Courses for Machine Learning & Artificial Intelligence-2024
Best Math Courses for Machine Learning- Find the Best One!
Thank YOU!
Learn Machine Learning A to Z Basics
Subscribe For More Updates!
[mc4wp_form id=”28437″]
Though of the Day…
‘ Anyone who stops learning is old, whether at twenty or eighty. Anyone who keeps learning stays young.
– Henry Ford
Written By Aqsa Zafar
Founder of MLTUT, Machine Learning Ph.D. scholar at Dayananda Sagar University. Research on social media depression detection. Create tutorials on ML and data science for diverse applications. Passionate about sharing knowledge through website and social media.