Cross-validation and hyper-parameter tuning

In this notebook you will find example sklearn ML models estimated on the BBB data.

Loading the data

Converting categorical variables to 0/1 dummy variables

Did it work as expected? No! The first level in buyer is yes so we get the reverse of what we wanted

Check that buyer is a categorical variable

Show levels

Change level order - now pd.get_dummies should work as intended

Converting categorical variables to 0/1 dummy variables

Did it work as expected this time? Yes!

An alternative approach to converting categorical variables to 0/1 dummy variables offers more control but also more work if you have variables with many categories

Did it work as expected? Yes!

Adding a random number that should not become an important variable in any model. Setting a random seed to make the results reproducible. Try running the cell below multiple times to confirm

Creating a list of variable names

The below requires pyrsm version 0.4.2 or newer

By passing information about the training variable to rsm.scale_df the variables will be scaled using only information from the training sample and you will retain the column headers

Just as an exercise, we are going to create a new training variable using ShuffleSplit. This function from sklearn returns a python generator, so we use a loop to extract the relevant values. The training-test split will be 70-30. Note the use of .copy() to ensure changes to the training or test data are not reflected back in the bbb dataframe

Notice that the proportion of 1s in y is not exactly the same when we use ShuffleSplit

StratifiedShuffleSplit is a better option to use when creating a training variable because it ensures the proportions of 1s in the training and test set are as close as possible. To demonstrate, we will first uses a 50-50 split. Now the proportions and number of 1s in the training and test set are identical

Now re-create the training index so we have a 70-30 split

We will use the new training variable from here on out. We don't want the training variable to be used in estimation, so we will remove it from X

No we can standardize X and specify the training variable so only information from the training data is used in scaling

Storing results for evaluation of different models

Lets start with a basic logistic regression model on standardized data

Logistic regression through sklearn with out any regularization gives the same results as smf.glm above

Logistic regression through sklearn with very strong regularization. Note that if you are using regularization you should always standardize your explanatory variables first. The level of regularization is so strong that the coefficients for all explanatory variables have been set to zero

Logistic regression through sklearn with minimal regularization same result as not using any penalty

Logistic regression in sklearn with strong L1 regularization will set the coefficient for the rnd variable to zero

Storing predictions from LogisticRegression (sklearn)

Tuning logistic regression with L1 penalty (LASSO)

Display the results from CV for logistic regression with regularization

We now have to use iloc because the df has been sorted and .loc[0, "param_C"] might return the wrong row!

Note that the random variable ("rnd") was not removed or set to 0!

Estimate a Neural Net for classification from SKLEARN

Use CV to tune the NN. Below we tune on the size of the NN the level of regularization

The evalbin command below requires pyrsm >= 0.4.2. Install from a terminal in jupyter using the command below, and then restart the notebook kernel:

pip3 install --user 'pyrsm>=0.4.2'

Decision tree

Estimate a basic decision tree model

Estimate a larger tree that is likely to overfit the data

Predict for the entire dataset

Random Forest

How many features to consider at each node split? Use sqrt(nr_columns) as an approximation

If we do not use the OOB values in prediction, and check out the AUC value in the training sample!!!

Random Forest with cross validation and grid search

If we do not using OOB values, again, note the AUC value!!!

If we want to use the OOB scores for the training data instead, we have to re-estimate because it is no possible to pass the 'oob_score' option when using GridSearchCV

XGBoost

Summarize performance across all models in both training and test

When using any machine learning model you should always create plots of the (1) relative importance of variables the (2) the direction of the effect. Use permutation importance for the importance plot and use partial dependence plots to get a sense of the direction of the effect (positive, negative, non-linear)

The first plot below can be used to indicate possible interaction effects