How to encode Categorical data in machine learning

some of the prominent methods explained to encode categorical data

encoding in machine learning

Machine learning and almost all its models work on numeric data and don’t understand anything other than numbers. Their input and output data is numeric and thus numbers have great importance in a machine learning dataset. In real world, we don’t always have numeric data but categorical data is also in a dataset. A data scientist spends a lot of his time in converting that data into numbers. This conversion is called encoding and it is a crucial step in achieving the desired results.

Categorical data is an information about any business and it can be explained by taking some basic informational details:

  1. The gender of our customers can be male, female, and others.
  2. Our customer lives in Delhi, Hyderabad, Mumbai, or Indore.
  3. Our customer belongs to the segment of undergraduates, graduates & postgraduates.

The categorical data is finite and not in numeric form. The information about a customer can’t be always in numbers and he/she belongs to any of the mentioned categories. The categorical data is of two types

Ordinal: Data in this category has ordered relationship between the category labels like educational qualification, ratings of a customer. All these values have an inherent order.

Nominal: Data in this category does not have any ordered relationship with other category labels like information about customer’s city, having or not a premium plan. These values don’t have any inherent order.

data types
data types
Data Types in machine learning

How to deal with categorical data?

Categorical data can’t be directly fed to the machines for training and testing purposes so it has to be formatted and converted into a format understood by the machine. Categorical data is converted into numeric type so that it can contribute to machine learning.

While doing this conversion a data scientist should maintain the basic nature i.e. ordinal data and nominal data should be converted using different methods. Here is the list of few methods which are mostly used for encoding the categorical data.

One Hot encoding

Dummy encoding

Label encoding

Ordinal encoding

Binary encoding

Mean encoding

let us take sample data for explaining all the encoding styles. So the sample data has 3 categorical columns having data about gender, city, and plan taken.

sample dataset
sample dataset

let us take the various encoding methods one by one and see how they work.

One Hot Encoding

It is one of the most common methods used for encoding the nominal data. here the encoder converts the categorical values into 0 and 1. The column having categorical values gets extended into numerous columns equal to the total number of different categorical values. Each value has 1 in its own named column and 0 in all others.

The newly created variables are also called dummy variables. like if we have n number of categorical values then we will have n new columns in which all will be zero except one where the value exists.

from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
ohc=ohe.fit_transform(data.city.values.reshape(-1,1)).toarray()
ncity=pd.DataFrame(ohc,columns=[“ncity_”+str(ohe.categories_[0][i]) for i in range(len(ohe.categories_[0]))])
new_data=pd.concat([data,ncity],axis=1)
new_data
one hot encoding effect
one hot encoding effect
one hot encoding effect

Dummy Encoding

Dummy encoding is similar to one-hot encoding but has one minor difference in the number of columns. It works similar to one-hot encoding i.e. converts the data into binary values of 0 and 1. If there are n categorical values then the new columns will be n-1.

For the present condition, it marks 1 same as one hot and 0 for absence. Here one column will be missing which have all 0 values.

new_city=pd.get_dummies(data.city,prefix=”n”,drop_first=True)
new_city
dummy encoding
dummy encoding
dummy encoding sample

Label Encoding

This encoding is good for both ordinal data and nominal but under certain conditions, we will discuss soon. Here labels start from 1 and go on as per the number of categorical values in a column. All the values get one numeric value.

There is one issue with this encoding, the labels are not given as per the actual ordinal relationship but it is somewhat random. As per our data economy plan<deluxe plan<premium plan and the encoding should be 1,2, 3 respectively. Now see what we actually get.

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data[‘new_plan’]=le.fit_transform(data.plan)
data
label encoding effect
label encoding effect
label encoding on data

It is used where we have lots of categorical values in nominal data because doing any other encoding will add a huge number of additional columns. Secondly, it is preferred where ordinal data has lots of categorical values

Ordinal Encoding

It is the encoding for ordinal data and overcomes the issue with the label encoding method. It sorts the issue of priority while encoding the data and mark a higher number for higher grade value. like premium should be having the highest number and then deluxe and in the last economy should get 1.

n_plan={‘economy’:1,’deluxe’:2,’premium’:3}
data[‘new_plan’]=data.plan.map(n_plan)
data
ordinal encoding
ordinal encoding
ordinal encoded data

Binary Encoding

Binary encoding converts each categorical value into binary format having 0 and 1. It looks similar to one-hot but here is one major difference in one-hot and binary encoding. In one-hot encoding for every n values we have new n columns but in case of binary we have fewer columns as instead of 0 and 1 we encode by log(base2)ⁿ features. For a value 3 binary code 011 will be generated and for 2, the code 010 will be generated

import category_encoders as ce
enc= ce.BinaryEncoder(cols=[‘city’])
ncity=enc.fit_transform(data[‘city’])
data=pd.concat([data,ncity],axis=1)
data
binary encoder
binary encoder
binary encoded data

Mean Encoding

Mean encoding is also referred as target encoding is found very commonly in kaggle problems. Mean encoding resembles label encoding in operation but here the encoding is not random but it is in correlation with target variable.

It is a Bayesian encoding technique, here we calculate mean of target variable for each category and replace the category variable with that mean value.

import numpy as np
import pandas as pd
data=pd.DataFrame({'class':'A,','B','C','B','C','A','A','A'],'Marks':[33,85,72,58,41,76,48,68]})
data
sample data

now after using target encoding the new result will be like as under

import category_encoders as ce
encoder=ce.TargetEncoder(cols=’class’)
encoder.fit_transform(data[‘class’],data[‘Marks’])
target/mean encoded results

There are some more types of encoding which are used in some other cases. I have tried to cover some of the major encodings which are popularly used in various machine learning problems. Changing the type of encoding can change the final results of a machine learning model and hence it is sometimes used as a method for tweaking the model also.

Technology enthusiast, Data Scientist, Entrepreneur, Digital Marketing expert.