MGTA 495: Analytics Assignment - XGBoost - Week 10

Classification on Amazon SageMaker

Perform a classification task on the given dataset.
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the WAGP column) - which will be categorized into multiple bins.


Tasks:

Due: 18th March 11:59 PM PST


Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame.

Pandas API documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

Amazon Sagemaker API documentation: https://sagemaker.readthedocs.io/en/stable/

Amazon Sagemaker Tutorials: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html


1. Import packages and Get Amazon IAM execution role & instance region

2. Read data.

The dataset has been shared in the following S3 path: s3://rsm-emr01/week10/person_records_merged.csv We'll use this path to read the data from S3 directly.

Description of Columns

There are lots of columns in the original dataset. However, we'll only use the following columns whose descriptions are given below.

AGEP - Age

COW - Class of worker

WAGP - Wages or salary income past 12 months

JWMNP - Travel time to work

JWTR - Means of transportation to work

MAR - Marital status

PERNP - Total person's earnings

NWAV - Available for work

NWLA - On layoff from work

NWLK - Looking for work

NWAB - Temporary absence from work

SCHL - Educational attainment

WKW - Weeks worked during past 12 months

3. Filtering data

Find the correlation of the WAGP value with all other features. You can use the following technique for finding correlation between two columns:

df['col_1'].corr(df['col_2']) gives you the correlation between col_1 and col_2.

Your task is to find the correlation between WAGP and all other columns.

4. Outlier Removal

Remove outlier rows based on values in the WAGP column. This will be an important step that impacts our model's predictive performance in the classification step below.

Based on the statistics above, we need an upper limit to filter out significant outliers. We'll filter out all the data points for which WAGP is more than the mean + 3 standard deviations.

Your tasks:

  1. Filter the dataframe using a calculated upper limit for WAGP

Expected Output:

  1. Number of outlier rows removed from DataFrame

5. Dropping NAs

Drop rows with any nulls in any of the columns.
Print the resulting DataFrame's row count.

Note: The more features you choose, the more rows with nulls you will drop. This may be desirable if you are running into memory problems

Your tasks:

  1. Drop rows with any nulls

Expected Output:

  1. Number of rows in cleaned DataFrame

6. Discretize salary

We want to convert the WAGP column, which contains continuous values, into a column with discrete labels so that we can use it as the label column for our classification problem. We're essentially turning a regression problem into a classification problem. Instead of predicting a person's exact salary, we're predicting the range in which that person's salary lies.

Note that labels are integers and should start from 0.

XGBoost expects that the Label column (WAGP_CAT) is the first column in the dataset.

Your tasks:

  1. Make a new column for discretized labels with 5 bins. Recommended column name is WAGP_CAT
    • XGBoost expects that the Label column (WAGP_CAT) is the first column in the dataset.
    • Remember to put your label column as the first column in the dataframe, otherwise training won't run!
  2. Examine the label column

Expected Output:

  1. The first 5 rows of the dataframe with the discretized label column. The label column must be the first column in the dataframe.
  2. A histogram from the discretized label column

7. Splitting data and converting to CSV

8. Save processed data to S3

This step is needed for using XGBoost with Amazon Sagemaker. We have provided this code for you, but you should look through them to see what they do.

9. Create channels for train and validation data to feed to model

10. Create the XGBoost model

11. Set model hyperparameters

12. Train model using train and validation data channels

13. Deploying model

14. Testing the model on test data

The code has been given for you. Read through and see what it is doing.

15. Confusion matrix and classification report

The code for this is given to you. Read and see how the functions are being used.

IMPORTANT: DELETE THE ENDPOINT

16. Hyperparameter tuning

We'll use do hyperparameter tuning on two hyperparameters:

  1. min_child_weight
  2. max_depth

We'll use a Random search strategy since that's more effective than searching all possible combinations of hyperparameters.

Tasks:

  1. Set up a hyperparameter tuning job with the values for both the hyperparameters to be in the range [1,10]

17. Results of tuning job

The code is given to you for getting the results of the tuning job in a dataframe.

18. Deploy the tuned model.

19. Test the tuned model on test data

We've given the code for testing the tuned model.

IMPORTANT: DELETE THE ENDPOINT

20. Screenshot of everything terminated.

You need to submit a screenshot of terminated endpoints and notebook instances once you are done with the assignment. Nothing should be in green in this screenshot since all running instances are shown in green.

You can take the screenshot of the Dashboard once you go to Amazon SageMaker.