__author__ = "Amoli Rajgor"
__email__ = "amoli.rajgor@gmail.com"
__website__ = "amolir.github.io"
The dataset provided appears to be a series of account statements (99976) having some pre-processed features (43) representing a time frame. As the accurate timestamp is not provided for each statement, it is unlikely that the data points are indexed in ordered time. We have to predict the probability of each customer (account holder) defaulting their next payment based on the information of their account.
In the snapshot of the dataset below, we see:
# Hide warnings
import warnings
warnings.filterwarnings('ignore')
# Data manipulation
import pandas as pd
data = pd.read_csv("data/dataset.csv", sep=";")
print(data.shape)
data.head()
The dataset contains 43 features of mixed data type (numeric, boolean, nominal and ordinal) representing information related to Account, Invoices, Payment, Merchant and Customer. We will try to build an intuition out of each feature to understand its importance in predicting the likelihood of default. Moreover, banking related terms are abbreviated for brevity.
print(pd.Series(data.columns))
We will use some basic data visualization techniques to analyse patterns in the data. Before that let's import some libraries.
import seaborn as sns
import matplotlib.pyplot as plt
# To visualise missing values
import missingno as msno
Before we perform data analysis, lets split the data into train (70%) and test (30%) sets so that there is no data leakage when we validate our model. Test data should only be used to validate and tune the model we build. It should not impact the decision we make regarding the feature selection process. We will stratify the split so that the proportion of values in the samples will be the same as that of the dataset.
from sklearn.model_selection import train_test_split
# Create csv file for prediction set
data[data["default"].isna()].to_csv("data/predict.csv", sep=";", index=False)
# Drop prediction set from the data
data = data.dropna(subset=["default"])
X = data.drop(columns=['default']).copy()
y = data['default'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.3,
stratify=y,
random_state=8)
eda_data = pd.concat([X_train.copy(), y_train.copy()], axis=1)
Three things that are very evident from the first glance at the distribution of the features are:
eda_data.hist(figsize = (20,20));
Lets visualise the missing values in the dataset and make certain observations. This can also be verified by arranging the data points in the decreasing number of missing values.
# Arrange the data points in the decreasing number of missing values
# eda_data.isna().sum().sort_values(ascending=False)
plt.figure(figsize=(20,10))
sns.heatmap(eda_data.isna(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})
We see lots of NaN
values in features like num_arch_written_off_12_24m, avg_payment_span_0_12m and num_active_div_by_paid_inv_0_12m across all the timelines; One possible explanation of this could be presence of outstanding payment for these account entries. If the status of any active or past payment is not available or certain yet, then these features can not have a valid meaningful value. Consider the following cases happening due to outstanding payment of an active or a past invoice:
NaN
because this might be the first payment of the account holder and it's outstanding yet so the average number of days can not be calculated.NaN
as none of the invoices are paid so division by zero yields NaN
.NaN
because the status of an outstanding payment is not certain yet and the delay has not reached a limit beyond which it can be marked as written off. This intuition is also supported by the following correlation matrix of missing values, it can be seen that, features num_arch_written_off, avg_payment_span and num_active_div_by_paid_inv have strong nullity correlation ranging from 0.8 to 1, suggesting presence/absence of NaN
in one correlates with another.
Moreover, out of all the invoices (9419) that have unpaid status for last payment (has_paid = False
) , around 97% (9121) have missing values in all num_arch_written_off, avg_payment_span and num_active_div_by_paid_inv. This bolsters the assumption that outstanding payment causes missing values in these features.
print(f'\033[1mTotal number of accounts with unpaid status of last payment:\033[0m {eda_data.has_paid.value_counts().loc[False]}')
outstanding_payment = eda_data[eda_data[["num_arch_written_off_0_12m", "num_active_div_by_paid_inv_0_12m",
"avg_payment_span_0_12m", "avg_payment_span_0_3m",
"num_arch_written_off_12_24m"]].isna().all(1)].copy()
print(f'\033[1mTotal number of accounts with assumed outstanding payment:\033[0m {outstanding_payment.has_paid.value_counts().loc[False]}')
msno.heatmap(data, cmap='YlGnBu');
Two nominal categories namely merchant_group (12) and merchant_category (57) has high number of levels. Let's observe the distribution of defaults among them.
# Function to highlight top 4 values
def highlight_nlargest(row):
# Get 4 largest values of the rows
is_large = row.nlargest(4).values
# Apply style color if the current value is among the 4 biggest values
return ['background-color: lightgreen' if val in is_large else '' for val in row]
mgroup_default_ct = pd.crosstab(index=eda_data['merchant_group'], columns=eda_data['default'], margins = True).sort_values(1.0, ascending=False)
mgroup_default_ct["groups_default_rate"] = (mgroup_default_ct[1.0] / mgroup_default_ct.loc["All", 1.0])*100
mgroup_default_ct["within_group_default_rate"] = (mgroup_default_ct[1.0] / mgroup_default_ct ["All"])*100
mgroup_default_ct = mgroup_default_ct.transpose()
mgroup_default_ct = mgroup_default_ct.drop("All", axis=1) # remove margin for training set
mgroup_default_ct.apply(pd.to_numeric).style.apply(highlight_nlargest, axis =1)
mcat_default_ct = pd.crosstab(index=eda_data['merchant_category'], columns=eda_data['default'], margins = True).sort_values(1.0, ascending=False)
mcat_default_ct["cat_default_rate"] = (mcat_default_ct[1.0] / mcat_default_ct.loc["All", 1.0])*100
mcat_default_ct["within_cat_default_rate"] = (mcat_default_ct[1.0] / mcat_default_ct ["All"])*100
# sorting by within category default rate
# mcat_default_ct = mcat_default_ct.sort_values("within_cat_default_rate", ascending=False).transpose()
# Retaining category default rate sorting
mcat_default_ct = mcat_default_ct.transpose()
mcat_default_ct = mcat_default_ct.drop("All", axis=1) # remove margin for training set
mcat_default_ct.apply(pd.to_numeric).style.apply(highlight_nlargest, axis =1)
figure, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 4))
sns.kdeplot(data=eda_data, x="age", hue="default", fill=True, common_norm=False, alpha=.5,
clip=(eda_data.num_unpaid_bills.min(),eda_data.num_unpaid_bills.max()),
palette="mako", linewidth=0, ax=ax1)
sns.kdeplot(data=eda_data, x="num_unpaid_bills", hue="default", fill=True, cut = 0,
common_norm=False, clip=(0,eda_data.num_unpaid_bills.max()), alpha=.5,
palette="mako", linewidth=0, bw_method=0.5, ax=ax2)
plt.suptitle('Density Plots For Default & Non-Default Indepedent Groups', fontsize=16)
plt.tight_layout()
plt.show()
**Observations:**
💡 Note on Density Estimator Kernel Since we have used Gaussian kernel density estimation to plot the density for num_unpaid_bills, we will get tails extending over our data range because the function applies a gaussian over each data point and sums up the densities & normalizes. In order to have better representation, before plotting the density, we are clipping the distribution in the range of num_unpaid_bills cutting the distribution before the value 0.
# Create two independent groups
default_data = eda_data[eda_data['default'] == 1][["age", "num_unpaid_bills"]]
non_default_data = eda_data[eda_data['default'] == 0][["age", "num_unpaid_bills"]]
default_summary = pd.DataFrame({'Feature': ["age", "num_unpaid_bills"], 'Variance': default_data.var().values,
'Mean': default_data.mean().values, 'Median': default_data.median().values,
'Group':["Default", "Default"], 'Mode': default_data.mode().values[0],
'Sample Size': [default_data.shape[0], default_data.shape[0]]})
non_default_summary = pd.DataFrame({'Feature': ["age", "num_unpaid_bills"], 'Variance': non_default_data.var().values,
'Mean': non_default_data.mean().values, 'Median': non_default_data.median().values,
'Group':["Non-Default", "Non-Default"], 'Mode': non_default_data.mode().values[0],
'Sample Size': [non_default_data.shape[0], non_default_data.shape[0]]})
summary = pd.concat([default_summary, non_default_summary])
summary.set_index(['Group', "Sample Size", "Feature"], inplace=True)
summary
from scipy import stats
import numpy as np
# Test normality of the groups
print(f"p-value for default group: {stats.shapiro(non_default_data.age).pvalue:.2f}")
print(f"p-value for non-default group: {stats.shapiro(default_data.age).pvalue:.2f}")
# Calculate F test statistic
f = np.var(np.array(non_default_data.age), ddof=1)/np.var(np.array(default_data.age), ddof=1) #calculate F test statistic
# Define degrees of freedom
df_non_default = non_default_data.age.size-1
df_default = default_data.age.size-1
# Find p-value of F test statistic
p = 1-stats.f.cdf(f, df_non_default, df_default)
print(f"p-value for F test: {p:.2f}")
Welch’s T-Test
$H_0$: There is no difference in the mean age of the two population groups default and non-default.
$H_1$: There is statistically significant difference in the mean age of the two groups default and non-default.
f_test_pvalue = stats.ttest_ind(default_data["age"], non_default_data["age"], equal_var = False).pvalue
print(f"F-test p-value: {f_test_pvalue:.2f}")
As the p-value is less than 0.05, we conclude that there is statistically significant difference in the mean of the two groups. Thus our intuition that younger people are more likely to default the payment might be true. As the default is not independent of the feature age, we will include age in the final model.
Let's find the features that have less than 1% variance, by applying some variance threshold. Values for these features almost remain constant. We are likely to see features with less variance as the dataset is sparse with maximum number of constant values in some features (like status features).
from sklearn.feature_selection import VarianceThreshold
var_check = eda_data.copy().drop(["uuid", "default", "name_in_email", "merchant_category", "merchant_group", "has_paid"], axis=1)
var_thres = VarianceThreshold(threshold=1)
var_thres.fit(var_check)
var_thres.get_support()
constant_columns = [col for col in var_check.columns if col not in var_check.columns[var_thres.get_support()]]
constant_columns
In order to find dependence of the target variable default over the feature space, let's find out the point biserial correlation between the features and the "default" variable. Correlation value that is zero or negative shows that there is no correlation between default and these features. Drop these features from the modelling stage later.
# Correlation between nominal default variable and continuous numeric variables
corr_value = []
p_value = []
colname = []
for col in eda_data.columns.difference(["uuid", "merchant_category", "merchant_group", "has_paid","name_in_email", "status_max_archived_0_6_months",
"status_max_archived_0_12_months", "status_max_archived_0_24_months", "status_last_archived_0_24m",
"status_2nd_last_archived_0_24m", "status_3rd_last_archived_0_24m"]).to_list():
if col != "default":
d = eda_data[["default", col]].dropna(axis=0).copy()
corr = stats.pointbiserialr(d.default, d[col])
colname.append(col)
corr_value.append(corr.correlation)
p_value.append(corr.pvalue)
corr_result = pd.DataFrame({"feature":colname, "correlation": corr_value, "pvalue": p_value})
corr_result[corr_result.pvalue>0.05].sort_values(by="pvalue", ascending=False) #only show statistically significant results
The correlation matrix shows that features with overlapping timeline such as [status_max_archived_0_12_months, status_max_archived_0_24_months, status_max_archived_0_6_months, status_last_archived_0_24m, status_2nd_last_archived_0_24m, status_3rd_last_archived_0_24m], [num_arch_ok_0_12m and num_arch_ok_12_24m] and [max_paid_inv_0_12m and max_paid_inv_0_24m] have strong positive correlation.
To avoid collinearity among the features we will only keep features in the model that captures the latest larger timeline; Namely status_max_archived_0_24_months, num_arch_ok_0_12m and max_paid_inv_0_24m.
correlation_matrix = eda_data[['sum_capital_paid_account_0_12m', 'age', "status_max_archived_0_12_months", "status_max_archived_0_24_months",
"recovery_debt", "status_2nd_last_archived_0_24m", "sum_capital_paid_account_12_24m", "sum_paid_inv_0_12m",
"status_3rd_last_archived_0_24m", "account_amount_added_12_24m", "status_max_archived_0_6_months",
"status_last_archived_0_24m", "max_paid_inv_0_12m", "max_paid_inv_0_24m", "time_hours", "num_active_inv",
"num_arch_dc_0_12m", "num_arch_dc_12_24m", "num_arch_ok_0_12m", "num_arch_ok_12_24m", "num_arch_rem_0_12m",
"num_unpaid_bills"]].corr()
correlation_matrix.apply(pd.to_numeric).style.background_gradient(cmap='coolwarm').set_precision(2).set_table_styles(
[dict(selector="th",props=[('max-width', '200px')]),
dict(selector="th.col_heading",
props=[("writing-mode", "vertical-rl"),
('transform', 'rotateZ(180deg)'),
('height', '200px'),
('vertical-align', 'top')])])
During exploratory analysis in section 3.2, we observed that the majority of payment default happens for a few specific merchant groups. This evidence bolsters the belief that possibility of payment default depends on the type of merchant group. Let's do some hypothesis testing to verify this assumption.
$H_0$ = Merchant groups (or Merchant category) have no effect on the default values
$H_1$ = Merchant groups (or Merchant category) have some effect on the default values
# Test of independence for merchant group using chi-square
print(f"p-value for chi-square test (Merchant Groups): \
{stats.chi2_contingency(pd.crosstab(index=eda_data['merchant_group'], columns=eda_data['default'], margins = False))[1]}")
print(f"p-value for chi-square test (Merchant Category): \
{stats.chi2_contingency(pd.crosstab(index=eda_data['merchant_category'], columns=eda_data['default'], margins = False))[1]}")
Test alternate hypothesis for each merchant group and category against the default feature.
$H_0$ = Merchant group $m$ (category $c$) has no effect on the default values.
$H_1$ = Merchant group $m$ (category $c$) has some effect on the default values.
Note: $m$ takes 12 values from merchant_groups & $c$ takes 57 values from merchant_category.
Compute p-value for each test.
p-value Correction: We have in total 12 merchant groups (i.e. 12 hypotheses) to test against our target feature default, with a significance level $\alpha = 0.05$. So for each of these tests we have a 5% chance of false positives (error rate). With each subsequent test this false positive rate accumulates thereby increasing the error rate by 5%. So at the end of 12 tests, the chi-square test will have a 60% error rate i.e. we will test the hypothesis against a higher p-value of 0.6. Thus to correct this p-value we will use multiple testing correction (Family-wise error rate (FWER) correction) method such as Bonferroni correction.
from IPython.display import display, HTML, display_html
def post_hoc(feature, target):
result = {}
significant_feature = {}
feature_coded = pd.get_dummies(feature)
adjusted_pvalue = 0.05/feature.nunique() # Bonferroni
for value in feature_coded:
if stats.chi2_contingency(pd.crosstab(target, feature_coded[value]))[1] < adjusted_pvalue:
result[value] = 'Reject Null Hypothesis H_0'
significant_feature[value] = 1
else:
result[value] = 'Fail to Reject Null Hypothesis H_0'
return result, significant_feature
chi_post_hoc_mg, significant_mgroup = post_hoc(eda_data["merchant_group"], eda_data["default"])
chi_post_hoc_mc, significant_mcat = post_hoc(eda_data["merchant_category"], eda_data["default"])
chi_post_hoc_mg = pd.DataFrame(
data = {"Merchant Group ($m$)": chi_post_hoc_mg.keys(),
"Conclusion": chi_post_hoc_mg.values()}).style.set_table_attributes("style='display:inline'")\
.where(lambda val: 'Reject Null Hypothesis H_0' == str(val),
'background-color: lightgreen', subset=['Conclusion'])
chi_post_hoc_mc = pd.DataFrame(data = {"Merchant Category ($c$)": chi_post_hoc_mc.keys(),
"Conclusion": chi_post_hoc_mc.values()})\
.style.set_table_attributes("style='display:inline'")\
.where(lambda val: 'Reject Null Hypothesis H_0' == str(val),
'background-color: lightgreen', subset=['Conclusion'])
# pd.set_option("display.max_rows", None)
display_html("<div style='height: 200px'>" + chi_post_hoc_mg._repr_html_() + "\xa0" * 10 + chi_post_hoc_mc._repr_html_() + "</div>", raw=True)
From the previous section we found that account_worst_status related features have less than 1% variance. We can combine the performance of these status features spanned over multiple time frames and form a new composite feature performance_score.
from sklearn.feature_selection import chi2
acc_status_performance = eda_data[['account_worst_status_0_3m','account_worst_status_3_6m', 'account_worst_status_6_12m', 'account_worst_status_12_24m']].copy()
acc_status_performance["performance_score"] = acc_status_performance.sum(axis=1)
acc_status_performance = acc_status_performance.fillna(0)
f_p_values=chi2(acc_status_performance, y_train)
f_values=pd.Series(f_p_values[0])
f_values.index=acc_status_performance.columns
f_values.sort_values(ascending=False)
from sklearn.base import BaseEstimator, TransformerMixin
acc_worst_status_features = ['account_worst_status_0_3m','account_worst_status_3_6m',
'account_worst_status_6_12m', 'account_worst_status_12_24m']
class AccountStatusFeatureTransformer(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x["performance_score"] = x.sum(axis=1)
x.drop(['account_worst_status_0_3m', 'account_worst_status_3_6m', 'account_worst_status_6_12m', 'account_worst_status_12_24m'], axis=1, inplace=True)
return x
Note: Earlier we had used frequency encoding over these features, that is we had replaced each level of the categorical feature with its frequency of occurrence in the data. For example, the merchant group Entertainment was replaced by 30809. Drawback of this approach is that it treats merchant groups/categories with equal frequency as the same that can cause loss of information. As this transformation brought down the performance of the linear model we applied later, it was removed from the final learning pipeline. It is observed that the new encoding method improves the True Positive and True Negative predictions.
merchant_features = ["merchant_category", "merchant_group"]
class MerchantFeatureTransformer(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
# Create frequency mapper for merchant category and group
self.merchant_category_mapper = significant_mcat
self.merchant_group_mapper = significant_mgroup
return self
def transform(self, x):
x['merchant_category'] = x['merchant_category'].apply(lambda cat: self.merchant_category_mapper.get(cat, 0))
x['merchant_group'] = x['merchant_group'].apply(lambda grp: self.merchant_group_mapper.get(grp, 0))
return x
Drop Features:
Impute Missing Values
Scaling and Transformation
RobustScaler()
that is robust to the outliers in the data. It removes the median and then uses quantile range to scale the features.from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
dropped_features = ['uuid', 'name_in_email', 'has_paid', 'worst_status_active_inv', 'account_incoming_debt_vs_paid_0_24m',
'account_status', 'avg_payment_span_0_3m', 'avg_payment_span_0_12m', 'num_arch_written_off_12_24m',
'num_arch_written_off_0_12m', 'sum_capital_paid_account_0_12m', 'sum_capital_paid_account_12_24m',
'account_amount_added_12_24m', 'num_arch_rem_0_12m', 'status_max_archived_0_12_months',
'status_max_archived_0_6_months', 'status_last_archived_0_24m', 'status_2nd_last_archived_0_24m',
'status_3rd_last_archived_0_24m', 'num_arch_ok_12_24m', 'max_paid_inv_0_12m']
missing_value_features = ['num_active_div_by_paid_inv_0_12m', 'account_days_in_dc_12_24m',
'account_days_in_term_12_24m', 'account_days_in_rem_12_24m']
missing_value_transformer = Pipeline(steps=[
('simpleimpute', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)),
('stdscaler', StandardScaler())
])
numeric_features = ['age', "time_hours"]
numeric_transformer = Pipeline(steps=[('stdscaler', StandardScaler())])
skewed_features = ["recovery_debt", "sum_paid_inv_0_12m", "max_paid_inv_0_24m", "num_active_inv",
"num_arch_dc_0_12m", "num_arch_dc_12_24m", "num_arch_ok_0_12m", "num_unpaid_bills"]
skewed_transformer = Pipeline(steps=[ ('robustscaler', RobustScaler())])
nominal_features = ["status_max_archived_0_24_months"]
nominal_transformer = Pipeline(steps=[('stdscaler', OneHotEncoder(handle_unknown="ignore"))])
feat_transformer = ColumnTransformer(transformers=[
('drop_column', 'drop', dropped_features),
('missing_value_processing', missing_value_transformer, missing_value_features),
('acc_worst_status_processing', AccountStatusFeatureTransformer(), acc_worst_status_features),
('numeric_processing', numeric_transformer, numeric_features),
('skewed_processing', skewed_transformer, skewed_features),
('nominal_processing', nominal_transformer, nominal_features),
('merchant_processing', MerchantFeatureTransformer(), merchant_features)
], remainder='passthrough')
If we do not want to generate new samples of the minority class than we can simply train the model to penalize the misclassification of the minority class more. In sklearn, this can be accomplished by setting _classweight hyperparameter. Giving more weightage to the minority class in the cost function of the algorithm will give higher penalty to the minority class and the model will emphasize on reducing the errors for it. Setting _classweight value as balanced
would assign weight based on the frequency of each class as:
Creating a new dataset by applying an oversampling technique on the training set will generate similar looking data as in minority class to regain balanced classes. SMOTE will alter the data and increase the samples of the minority class (default = 1). We will be using imbalanced learn's pipeline to do that as it carefully only applies the SMOTE at the training stage to avoid any data leak.
y_train.value_counts()
LogisticRegression
implementation with lbfgs
solver.from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.linear_model import LogisticRegression
pipeline_base = imbpipeline(steps = [['transform_features', feat_transformer],
["classifier", LogisticRegression(random_state=11,
max_iter=5000)]])
pipeline_class_weight = imbpipeline(steps = [['transform_features', feat_transformer],
["classifier", LogisticRegression(random_state=11, class_weight="balanced",
max_iter=5000)]])
pipeline_smote = imbpipeline(steps = [['transform_features', feat_transformer],
['smote', SMOTE(random_state=11)],
["classifier", LogisticRegression(random_state=11,
max_iter=5000)]])
roc-auc
score as a metric of model performance, later we find the logistic regression threshold value on the ROC curve for which the balance between False positive (FP) and True Positive (TP) rates is optimal.from sklearn.model_selection import GridSearchCV, StratifiedKFold
stratified_kfold = StratifiedKFold(n_splits=5,
shuffle=True,
random_state=11)
param_grid = {"classifier__C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search_base = GridSearchCV(estimator=pipeline_base,
param_grid=param_grid,
scoring='roc_auc',
cv=stratified_kfold,
n_jobs=-1)
grid_search_class_weight = GridSearchCV(estimator=pipeline_class_weight,
param_grid=param_grid,
scoring='roc_auc',
cv=stratified_kfold,
n_jobs=-1)
param_grid_smote = {"classifier__C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
"smote__sampling_strategy": np.linspace(0.1,1.0,10)}
grid_search_smote = GridSearchCV(estimator=pipeline_smote,
param_grid=param_grid_smote,
scoring='roc_auc',
cv=stratified_kfold,
n_jobs=-1)
# Fit all the models
grid_search_base.fit(X_train, y_train)
grid_search_class_weight.fit(X_train, y_train)
grid_search_smote.fit(X_train, y_train);
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, precision_recall_curve #plots
from numpy import argmax
# threshold for Logistic Regression
class_threshold = 0.5
def performance_scores(grid_search, class_threshold=0.5, roc_plot=False, precision_recall_plot=False):
y_pred_prob = grid_search.best_estimator_.predict_proba(X_test)[:,1]
if roc_plot:
# Compute roc curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Find the index of the threshold with the largest J statistics
max_jstat_index = argmax(tpr - fpr)
class_threshold = thresholds[max_jstat_index] # New optimal threshold
y_pred = (grid_search.best_estimator_.predict_proba(X_test)[:,1]>=class_threshold).astype(int)
plot_param = [fpr, tpr, max_jstat_index]
elif precision_recall_plot:
# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
# Find index where F-Score is maximum
max_fscore_index = argmax((2 * precision * recall) / (precision + recall))
class_threshold = thresholds[max_fscore_index] # New optimal threshold
y_pred = (grid_search.best_estimator_.predict_proba(X_test)[:,1]>=class_threshold).astype(int)
plot_param = [precision, recall, max_fscore_index]
else:
y_pred = (grid_search.best_estimator_.predict_proba(X_test)[:,1]>=class_threshold).astype(int)
plot_param = []
cv_score = grid_search.best_score_
test_score = grid_search.score(X_test, y_test)
accuracy_value = accuracy_score(y_test,y_pred)
precision_value = precision_score(y_test,y_pred)
recall_value = recall_score(y_test,y_pred)
f1_value = f1_score(y_test,y_pred)
# tn, fp, fn, tp = confusion_matrix(y_test == 1.0, y_pred_prob > class_threshold).ravel()
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
return [class_threshold, cv_score, test_score, accuracy_value, precision_value,
recall_value, f1_value, tn, fp, fn, tp] + plot_param
performance = {'Models': ['Threshold', 'CV Score', 'Test Score', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'TN', 'FP', 'FN', 'TP'],
'Base Model': performance_scores(grid_search_base, class_threshold),
'Cost Sensitive Model': performance_scores(grid_search_class_weight, class_threshold),
'Smote Model': performance_scores(grid_search_smote, class_threshold)}
results = pd.DataFrame(performance, columns = ['Models', 'Base Model', 'Cost Sensitive Model', 'Smote Model'])
results.set_index("Models", inplace=True)
results = results.T
results.reset_index().rename(columns={'index': 'Models'})
results.apply(pd.to_numeric).style.highlight_max(color = 'lightgreen', axis = 0)
Change the value of threshold from 0.5 to $\frac{0.5}{k}$. Where, k is: $$ \begin{align} k & = \frac{\text{Fraction of default accounts after SMOTE $F^{min}_{os}$}}{\text{Fraction of default accounts before SMOTE (Training Set) $F^{min}_{train}$}} \end{align} $$
We will find the positive (minority) class percentage using the SMOTE sampling ratio that the best_estimator_
selects. We are only oversampling the minority class, under that assumption, oversampling ratio in imblearn
is defined as:
threshold_moving = {}
threshold_moving["Techniques"]= ['Threshold', 'CV Score', 'Test Score', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'TN', 'FP', 'FN', 'TP']
threshold_moving["Default Threshold"] = results.loc["Smote Model"].tolist()
alpha_os = grid_search_smote.best_params_["smote__sampling_strategy"]
k = (alpha_os/(1 + alpha_os)) / (y_train.value_counts()[1.0]/len(y_train))
adjusted_threshold = 0.5/k
threshold_moving["Adjusted Threshold"] = performance_scores(grid_search_smote, adjusted_threshold)
💡 Intuition : Ideal model should predict both the True Negatives and True Positives correctly, it should not be biassed towards predicting one class. So an optimal threshold would yield high TPR & TNR (maximisation problem). We know that $TNR = 1 - FPR$, so we find an average threshold value between TPR and TNR i.e. (1 - FPR) that yields a low FPR while keeping a reasonable TPR. As TPR and FPR are dependent on each other, we use the geometric mean of these values to compute the average i.e. $\sqrt[2]{TPR \times (1 - FPR)}$. Goal is to maximise this mean which is the same as maximising $(TPR - FPR)$. Youden's $J$ statistic states that $J = TPR + TNR - 1$. Reframing in terms of FPR, $J$ would be $J= TPR - FPR$. Thus we find a value on the ROC curve that maximises J statistic.
# [fpr, tpr, thresholds, class_threshold, cv_score, test_score, accuracy_value, precision_value, recall_value, f1_value, tn, fp, fn, tp]
performance_scores_roc = performance_scores(grid_search_smote, roc_plot=True)
threshold_moving["ROC Threshold"] = performance_scores_roc[:-3]
fpr, tpr, threshold_index = performance_scores_roc[-3:]
# plot the roc curve for the model
plt.plot([0,1], [0,1], linestyle='--', label='Average Model')
plt.plot(fpr, tpr, marker='.', label='Logistic Regression With SMOTE')
plt.scatter(fpr[threshold_index], tpr[threshold_index], marker='o', color='black', label='Optimal threshold')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
💡 Intuition : Optimal threshold will be the average value between precision and recall that balances correct prediction of True positives and keeping False Positive low. Precision $(\frac{TP}{TP + FP})$ and Recall $(\frac{TP}{TP + FN})$ are just differently scaled versions of TPs, as there's inverse relationship between Recall and Precision we choose harmonic mean of these mterics as threshold value, which is nothing but F-Score.
performance_scores_pr = performance_scores(grid_search_smote, precision_recall_plot=True)
threshold_moving["Precision-Recall Threshold"] = performance_scores_pr[:-3]
precision, recall, threshold_index = performance_scores_pr[-3:]
# plot the roc curve for the model
average_model = len(y_test[y_test==1]) / len(y_test)
plt.plot([0,1], [average_model,average_model], linestyle='--', label='Average Model')
plt.plot(recall, precision, marker='.', label='Logistic Regression With SMOTE')
plt.scatter(recall[threshold_index], precision[threshold_index], marker='o', color='black', label='Optimal threshold')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
threshold_moving_results = pd.DataFrame(threshold_moving, columns = ['Techniques', 'Default Threshold', 'Adjusted Threshold', 'ROC Threshold', 'Precision-Recall Threshold'])
threshold_moving_results.set_index("Techniques", inplace=True)
threshold_moving_results = threshold_moving_results.T
threshold_moving_results.reset_index().rename(columns={'index': 'Techniques'})
threshold_moving_results["FPR"] = (threshold_moving_results["FP"] / (threshold_moving_results["FP"] + threshold_moving_results["TN"])) * 100
threshold_moving_results.apply(pd.to_numeric).style.highlight_max(color = 'lightgreen', axis = 0, subset=['Threshold', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'TN', 'FP', 'FN', 'TP', 'FPR'])
Find all the accounts that are likely to Default.
If it is not much of a concern that a large number of accounts are incorrectly predicted to be at risk of defaulting the next payment; the business goal is to solely identify as many positive (likely to default) cases as possible, then, we should choose a threshold strategy that gives LOW PRECISION HIGH RECALL model. Adjusted Threshold, ROC Threshold and the default 0.5 threshold are the three cutoffs that yield a high number of True Positives for an imbalanced test set. This comes at the cost of misclassifying a lot of trustworthy account holders as defaulters. If the customer approval is automated and too many customers are declined purchase it can create a frustrating user experience and may discourage the users from using the app in future (causing missed sale). Two possible approaches to solve this could be:
Find some accounts that are likely to Default; Capping misclassification rate (FPR) not more than $x\%$
If a high False Positive Rate is a matter of concern we can set a limit to the percent of accounts that can be labelled to be at the risk of defaulting. This could be achieved by computing FPR for various threshold values and picking a value where FPR is less than equal to say $x\%$. To reduce False Positives we will have to choose a model with threshold that gives HIGH PRECISION LOW RECALL. Precision Recall Threshold yields lowest number of False Positives. If we decide to allow only 4% of accounts to be mislabeled as defaulters, then this model could be a right fit.
print(f'Best Estimator: \n{grid_search_smote.best_estimator_.get_params()["steps"][2]}')
print(grid_search_smote.best_params_)
SMOTE Model
sampling strategy
= 1.0 as best estimator. That means the model is trained on an equal number of default and non-default samples.pred = pd.read_csv("data/predict.csv", sep=";").drop(columns=["default"])
predictions = pred[["uuid"]].assign(pd=grid_search_smote.predict_proba(pred)[:,1].tolist())
# Create csv file to store the results
predictions.to_csv("data/result.csv", sep=";", index=False)
predictions["default"] = (predictions.pd >= threshold_moving_results.loc["Precision-Recall Threshold"]["Threshold"]).astype(int)
predictions.head()
import joblib
joblib.dump(grid_search_smote.best_estimator_, "artifacts/bestmodel.pkl")