Handle missing values in Categorical Features

An useful guide to a proper deal with missing categorical data, with use cases

Published in

Analytics Vidhya

7 min readNov 17, 2020

In this post, it will be shown how to deal with categorical features with missing values with several examples compared to each other. It will be used the Classified Ads for Cars dataset to predict the price of ADs through a simple model of Linear Regression.

To show the various strategies and relevant pros / cons, we will focus on a particular categorical feature of this dataset, the maker, the name of the brand of cars (Toyota, Kia, Ford, Bmw, …).

Post Steps:

Show Raw Data: let’s see how our dataset looks like.
Deal with missing values in Categorical Features: we will deal missing values by comparing different techniques.
1 — Delete the entire column maker.
2 — Replace missing values with the most frequent values.
3 — Delete rows with null values.
4 — Predict values using a Classifier Algorithm (supervised or unsupervised).
Conclusions!

Show Raw Data

Let’s start importing some libraries

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

First of all let’s see how our dataset looks like

filename = "cars.csv"

dtypes = {
    "maker": str, # brand name
    "model": str,
    "mileage": float, # km
    "manufacture_year": float,
    "engine_displacement": float,
    "engine_power": float,
    "body_type": str, # almost never present
    "color_slug": str, # also almost never present
    "stk_year": str,
    "transmission": str, # automatic or manual
    "door_count": str,
    "seat_count": str,
    "fuel_type": str, # gasoline or diesel
    "date_created": str, # when the ad was scraped
    "date_last_seen": str, # when the ad was last seen
    "price_eur": float} # list price converted to EUR

df_cleaned = pd.read_csv(filename, dtype=dtypes)
print(f"Raw data has {df_cleaned.shape[0]} rows, and   {df_cleaned.shape[1]} columns")

Raw data has 3552912 rows, and 16 columns

After cleaning all columns from missing data and not useful features (the whole procedure is shown on my github), with the exception of the maker,
we will find ourselves in this situation:

# Missing values
print(df_cleaned.isna().sum())maker                  212897
mileage                     0
manufacture_year            0
engine_displacement         0
engine_power                0
price_eur                   0
fuel_type_diesel            0
fuel_type_gasoline          0
ad_duration                 0
seat_str_large              0
seat_str_medium             0
seat_str_small              0
transmission_auto           0
transmission_man            0
dtype: int64

Correlation Matrix

corr = df_cleaned.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns, annot=True, )

Deal with missing values in Categorical Features

Now we just have to handle the maker feature, and we will do it in four different ways. Then we will create a simple model of Linear Regression for each ways to predict the price.

1st Model: Delete the entire column maker.
2nd Model: Replace missing values with the most frequent values.
3rd Model: Delete rows with null values.
4th Model: Predict the missing values with the RandomForestClassifier.

mse_list = []
r2_score_list = []

def remove_outliers(dataframe):
    '''
    return a dataframe without rows that are outliers in any column
    '''
    return dataframe\
    .loc[:, lambda df: df.std() > 0.04]\
    .loc[lambda df: (np.abs(stats.zscore(df)) < 3).all(axis=1)]

def plot_regression(Y_test, Y_pred):
    '''
    method that plot a linear regression line on a scatter plot
    '''
    x = Y_test
    y = Y_pred

    plt.xlabel("True label")
    plt.ylabel("Predicted label")
    plt.plot(x, y, 'o')

    m, b = np.polyfit(x, y, 1)

    plt.plot(x, m*x + b)

def train_and_score_regression(df):
    
    df_new = remove_outliers(df) 
    
    # split the df
    X = df_new.drop("price_eur", axis=1).values
    Y = np.log1p(df_new["price_eur"].values)

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,
    test_size=0.1, random_state=0)

    # train and test of the model
    ll = LinearRegression()
    ll.fit(X_train, Y_train)
    Y_pred = ll.predict(X_test)

    mse_list.append(mean_squared_error(Y_test, Y_pred))
    r2_score_list.append(r2_score(Y_test, Y_pred))
    
    # print the metrics
    print("MSE: "+str(mean_squared_error(Y_test, Y_pred)))
    print("R2 score: "+str(r2_score(Y_test, Y_pred)))
    
    # plot the regression 
    plot_regression(Y_test, Y_pred)

Delete the entire column Maker

Our first basic approach will be to create a model without the column maker. This “no-deal” practice is necessary when the large amount of missing data threatens to invalidate the entire feature.

With this scenario we will probably have the worst scores on metrics, but we can use them to compare the other methods

df_no_maker = df_cleaned.copy()

# delete the entire column maker
df_no_maker = df_no_maker.drop("maker", axis="columns")

train_and_score_regression(df_no_maker)# MSE: 0.1384341569294906
# R2 score: 0.8401186412746953

This is our first attempt. Can we improve it?

Replace missing values with the most frequent data

Certainly more effective method is to assign the missing values with the most frequent data: the mode. But be careful that this could lead to an unbalanced dataset if the missing values were a considerable number.

df_replace_mode = df_cleaned.copy()

# replace missing values with the mode
replace_with = df_replace_mode["maker"].mode() 
df_replace_mode["maker"].fillna(replace_with,inplace=True)

df_replace_mode = pd.get_dummies(df_replace_mode,columns=["maker"])

train_and_score_regression(df_replace_mode)# MSE: 0.10243504754361979
# R2 score: 0.8703379824571595

As expected, we got better scores on metrics than before without losing any row

Delete rows with null values

Another option is to delete rows that have null values. Absolutely not recommended if our dataset is very small, but easily feasible if there are missing only a few values or if we have a very large dataset

df_del_rows = df_cleaned.copy()

# deleteing row with null maker
df_del_rows = df_del_rows[df_del_rows['maker'].notna()]

df_del_rows = pd.get_dummies(df_del_rows,columns=["maker"])

train_and_score_regression(df_del_rows)# MSE: 0.10465841311348424
# R2 score: 0.8580395349902117

This time we got worse scoring metrics than the previous attempt. Certainly partly caused by the fact that this time we have decided to delete some rows, which made us lose information

Predict the missing values with the RandomForestClassifier

The most interesting approach is surely the predicting of missing values with a classification algorithm. This will give us the opportunity not to waste a good chunk of the dataset, and thus a large amount of information. If our predictions will be accurate enough, with this technique we should have the best scoring metrics.

Therefore:

we have to split the dataset between rows with the column maker valorized, and rows with null values.
the first dataframe will become the one on which we will create the classification model with the maker as target feature.
use the model thus created to predict the missing values from the dataframe with null values
merge the two dataframe into one
training the linear regression model as before

Let’s split the dataset in two

df_with_maker = df_cleaned[df_cleaned['maker'].notna()]
print("N. rows with maker not null:", df_with_maker.shape[0])

df_no_maker = df_cleaned[df_cleaned['maker'].isna()]
print("N. rows with maker null:", df_no_maker.shape[0])# N. rows with maker not null: 1690186
# N. rows with maker null: 212897

Here we create the classification model and take a look at its metrics

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

X = df_with_maker.drop("maker", axis=1).values
Y = df_with_maker["maker"].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)

forest = RandomForestClassifier(n_estimators=45, max_depth=25, random_state=False, 
                                max_features=0.6, min_samples_leaf=3, n_jobs=-1)

forest.fit(X_train, Y_train)

y_pred_train = forest.predict(X_train)
y_pred = forest.predict(X_test)

y_pred_proba = forest.predict_proba(X_test)

accuracy_train = accuracy_score(Y_train, y_pred_train)
accuracy_test = accuracy_score(Y_test, y_pred)

print("ACCURACY: TRAIN=%.4f TEST=%.4f" % (accuracy_train,accuracy_test))
print("LOG LOSS: "+str(log_loss(Y_test, y_pred_proba)))

importances = forest.feature_importances_
indices = list(np.argsort(importances))[::-1]

plt.title("Feature importances")
plt.barh(range(len(indices)), importances[indices], color="g", align="center")
plt.yticks(range(len(indices)), df_with_maker.iloc[:, 1:].columns[indices])
# plt.ylim([-1, len(indices)])
plt.gca().invert_yaxis()# ACCURACY: TRAIN=0.9424 TEST=0.8962
# LOG LOSS: 0.3384233751455516

Unsurprisingly the model tells us that engine_displacement and engine_power are the key features able to define the maker

Now that we have the model, let’s fill the 2nd dataframe with the values of the prediction…

df_no_maker = df_no_maker.drop('maker', axis=1)
prediction = forest.predict(df_no_maker)

df_no_maker.insert(0, 'maker', prediction)

…and then, merge the two dataframe together

frames = [df_with_maker, df_no_maker]
df_final = pd.concat(frames)

Here’s our final dataframe

df_final.head()

Now that our final dataset is ready we can re-create the linear regression model and check if we have improved its metrics

df_final = pd.get_dummies(df_final,columns=["maker"])

train_and_score_regression(df_final)# MSE: 0.08867003817099878
# R2 score: 0.8825117873840629

Here’s we can see the best metrics values thanks to the rows saved through the aforementioned dataset merge

Conclusions!

Our considerations are driven by metrics. Deleting the feature from the dataset should be our last resort. Replacing null data with the most frequent ones or deleting the rows can be a convenient solution in case we have few missing data or a very large dataset. On the contrary, in case we have a lot of missing values or a small dataset, a prediction/clustering could save us valuable information that would otherwise be lost.

options_list = [
    "Delete the entire column Maker", 
    "Replace null values with the mode", 
    "Delete rows with null values", 
    "Predict the missing values"
]

df_metrics = pd.DataFrame({
    "": options_list,
    "MSE": mse_list,
    "R2_score": r2_score_list,
})
df_metrics.head()