Handle missing values in Categorical Features
An useful guide to a proper deal with missing categorical data, with use cases
In this post, it will be shown how to deal with categorical features with missing values with several examples compared to each other. It will be used the Classified Ads for Cars dataset to predict the price of ADs through a simple model of Linear Regression.
To show the various strategies and relevant pros / cons, we will focus on a particular categorical feature of this dataset, the maker, the name of the brand of cars (Toyota, Kia, Ford, Bmw, …).
Post Steps:
- Show Raw Data: let’s see how our dataset looks like.
- Deal with missing values in Categorical Features: we will deal missing values by comparing different techniques.
- 1 — Delete the entire column maker.
- 2 — Replace missing values with the most frequent values.
- 3 — Delete rows with null values.
- 4 — Predict values using a Classifier Algorithm (supervised or unsupervised).
- Conclusions!
Show Raw Data
Let’s start importing some libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
First of all let’s see how our dataset looks like
filename = "cars.csv"
dtypes = {
"maker": str, # brand name
"model": str,
"mileage": float, # km
"manufacture_year": float,
"engine_displacement": float,
"engine_power": float,
"body_type": str, # almost never present
"color_slug": str, # also almost never present
"stk_year": str,
"transmission": str, # automatic or manual
"door_count": str,
"seat_count": str,
"fuel_type": str, # gasoline or diesel
"date_created": str, # when the ad was scraped
"date_last_seen": str, # when the ad was last seen
"price_eur": float} # list price converted to EUR
df_cleaned = pd.read_csv(filename, dtype=dtypes)
print(f"Raw data has {df_cleaned.shape[0]} rows, and {df_cleaned.shape[1]} columns")
Raw data has 3552912 rows, and 16 columns
After cleaning all columns from missing data and not useful features (the whole procedure is shown on my github), with the exception of the maker,
we will find ourselves in this situation:
# Missing values
print(df_cleaned.isna().sum())maker 212897
mileage 0
manufacture_year 0
engine_displacement 0
engine_power 0
price_eur 0
fuel_type_diesel 0
fuel_type_gasoline 0
ad_duration 0
seat_str_large 0
seat_str_medium 0
seat_str_small 0
transmission_auto 0
transmission_man 0
dtype: int64
Correlation Matrix
corr = df_cleaned.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns, annot=True, )
Deal with missing values in Categorical Features
Now we just have to handle the maker feature, and we will do it in four different ways. Then we will create a simple model of Linear Regression for each ways to predict the price.
- 1st Model: Delete the entire column maker.
- 2nd Model: Replace missing values with the most frequent values.
- 3rd Model: Delete rows with null values.
- 4th Model: Predict the missing values with the RandomForestClassifier.
mse_list = []
r2_score_list = []
def remove_outliers(dataframe):
'''
return a dataframe without rows that are outliers in any column
'''
return dataframe\
.loc[:, lambda df: df.std() > 0.04]\
.loc[lambda df: (np.abs(stats.zscore(df)) < 3).all(axis=1)]
def plot_regression(Y_test, Y_pred):
'''
method that plot a linear regression line on a scatter plot
'''
x = Y_test
y = Y_pred
plt.xlabel("True label")
plt.ylabel("Predicted label")
plt.plot(x, y, 'o')
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b)
def train_and_score_regression(df):
df_new = remove_outliers(df)
# split the df
X = df_new.drop("price_eur", axis=1).values
Y = np.log1p(df_new["price_eur"].values)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,
test_size=0.1, random_state=0)
# train and test of the model
ll = LinearRegression()
ll.fit(X_train, Y_train)
Y_pred = ll.predict(X_test)
mse_list.append(mean_squared_error(Y_test, Y_pred))
r2_score_list.append(r2_score(Y_test, Y_pred))
# print the metrics
print("MSE: "+str(mean_squared_error(Y_test, Y_pred)))
print("R2 score: "+str(r2_score(Y_test, Y_pred)))
# plot the regression
plot_regression(Y_test, Y_pred)
Delete the entire column Maker
Our first basic approach will be to create a model without the column maker. This “no-deal” practice is necessary when the large amount of missing data threatens to invalidate the entire feature.
With this scenario we will probably have the worst scores on metrics, but we can use them to compare the other methods
df_no_maker = df_cleaned.copy()
# delete the entire column maker
df_no_maker = df_no_maker.drop("maker", axis="columns")
train_and_score_regression(df_no_maker)# MSE: 0.1384341569294906
# R2 score: 0.8401186412746953
This is our first attempt. Can we improve it?
Replace missing values with the most frequent data
Certainly more effective method is to assign the missing values with the most frequent data: the mode. But be careful that this could lead to an unbalanced dataset if the missing values were a considerable number.
df_replace_mode = df_cleaned.copy()
# replace missing values with the mode
replace_with = df_replace_mode["maker"].mode()
df_replace_mode["maker"].fillna(replace_with,inplace=True)
df_replace_mode = pd.get_dummies(df_replace_mode,columns=["maker"])
train_and_score_regression(df_replace_mode)# MSE: 0.10243504754361979
# R2 score: 0.8703379824571595
As expected, we got better scores on metrics than before without losing any row
Delete rows with null values
Another option is to delete rows that have null values. Absolutely not recommended if our dataset is very small, but easily feasible if there are missing only a few values or if we have a very large dataset
df_del_rows = df_cleaned.copy()
# deleteing row with null maker
df_del_rows = df_del_rows[df_del_rows['maker'].notna()]
df_del_rows = pd.get_dummies(df_del_rows,columns=["maker"])
train_and_score_regression(df_del_rows)# MSE: 0.10465841311348424
# R2 score: 0.8580395349902117
This time we got worse scoring metrics than the previous attempt. Certainly partly caused by the fact that this time we have decided to delete some rows, which made us lose information
Predict the missing values with the RandomForestClassifier
The most interesting approach is surely the predicting of missing values with a classification algorithm. This will give us the opportunity not to waste a good chunk of the dataset, and thus a large amount of information. If our predictions will be accurate enough, with this technique we should have the best scoring metrics.
Therefore:
- we have to split the dataset between rows with the column maker valorized, and rows with null values.
- the first dataframe will become the one on which we will create the classification model with the maker as target feature.
- use the model thus created to predict the missing values from the dataframe with null values
- merge the two dataframe into one
- training the linear regression model as before
Let’s split the dataset in two
df_with_maker = df_cleaned[df_cleaned['maker'].notna()]
print("N. rows with maker not null:", df_with_maker.shape[0])
df_no_maker = df_cleaned[df_cleaned['maker'].isna()]
print("N. rows with maker null:", df_no_maker.shape[0])# N. rows with maker not null: 1690186
# N. rows with maker null: 212897
Here we create the classification model and take a look at its metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
X = df_with_maker.drop("maker", axis=1).values
Y = df_with_maker["maker"].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)
forest = RandomForestClassifier(n_estimators=45, max_depth=25, random_state=False,
max_features=0.6, min_samples_leaf=3, n_jobs=-1)
forest.fit(X_train, Y_train)
y_pred_train = forest.predict(X_train)
y_pred = forest.predict(X_test)
y_pred_proba = forest.predict_proba(X_test)
accuracy_train = accuracy_score(Y_train, y_pred_train)
accuracy_test = accuracy_score(Y_test, y_pred)
print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))
print("LOG LOSS: "+str(log_loss(Y_test, y_pred_proba)))
importances = forest.feature_importances_
indices = list(np.argsort(importances))[::-1]
plt.title("Feature importances")
plt.barh(range(len(indices)), importances[indices], color="g", align="center")
plt.yticks(range(len(indices)), df_with_maker.iloc[:, 1:].columns[indices])
# plt.ylim([-1, len(indices)])
plt.gca().invert_yaxis()# ACCURACY: TRAIN=0.9424 TEST=0.8962
# LOG LOSS: 0.3384233751455516
Unsurprisingly the model tells us that engine_displacement and engine_power are the key features able to define the maker
Now that we have the model, let’s fill the 2nd dataframe with the values of the prediction…
df_no_maker = df_no_maker.drop('maker', axis=1)
prediction = forest.predict(df_no_maker)
df_no_maker.insert(0, 'maker', prediction)
…and then, merge the two dataframe together
frames = [df_with_maker, df_no_maker]
df_final = pd.concat(frames)
Here’s our final dataframe
df_final.head()
Now that our final dataset is ready we can re-create the linear regression model and check if we have improved its metrics
df_final = pd.get_dummies(df_final,columns=["maker"])
train_and_score_regression(df_final)# MSE: 0.08867003817099878
# R2 score: 0.8825117873840629
Here’s we can see the best metrics values thanks to the rows saved through the aforementioned dataset merge
Conclusions!
Our considerations are driven by metrics. Deleting the feature from the dataset should be our last resort. Replacing null data with the most frequent ones or deleting the rows can be a convenient solution in case we have few missing data or a very large dataset. On the contrary, in case we have a lot of missing values or a small dataset, a prediction/clustering could save us valuable information that would otherwise be lost.
options_list = [
"Delete the entire column Maker",
"Replace null values with the mode",
"Delete rows with null values",
"Predict the missing values"
]
df_metrics = pd.DataFrame({
"": options_list,
"MSE": mse_list,
"R2_score": r2_score_list,
})
df_metrics.head()