Sitemap

Using Random Forest and XGBoost To Predict Income

12 min readOct 28, 2023

In this article, I build two predictive machine learning models — Random Forest and XGBoost — to find out what demographic factors determine income. I use a Kaggle dataset of 48,842 Americans in 1994 that contains a wide host of demographic variables (age, years of education, occupation).

First, we investigate the data to determine any patterns in data. Next, I create new predictors with feature engineering, and subsequently specify, tune, compare in-sample results. Variable importance of features will also be compared.

Getting started: Data cleaning

First, let’s load the dataset as df0 and look at what the dataset looks like.

df0.info()
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 48842 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 educational-num 48842 non-null int64
5 marital-status 48842 non-null object
6 occupation 48842 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 gender 48842 non-null object
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 48842 non-null object
14 income 48842 non-null object
dtypes: int64(6), object(9)

Across 48,842 data entries, we have a total of 15 variables, of which the last income will be our outcome variable.

Next, I check for missing and duplicated values. On first glance, it seems like there are no missing values.

df0.isna()sum()
age                0
workclass 0
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 0
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
income 0
dtype: int64

The problem is that this is a trick. I noticed many of the values were marked with a “?”.

missing_values = df0[df0 == '?'].count()
missing_values
age                   0
workclass 2799
fnlwgt 0
education 0
educational-num 0
marital-status 0
occupation 2809
relationship 0
race 0
gender 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 857
income 0
dtype: int64

Upon further investigation, it seems like three variables workclass, occupation and native-country contains missing values. Let’s check how many rows in total that makes up.

question_mark_count = (df0 == '?').any(axis=1).sum()
print(f"Rows with '?' values: {question_mark_count}")
Rows with '?' values: 3620

In total, there are 3620 rows out of 48,842 rows (7.4%) that are missing. This is not significant, so rather than imputing missing values based on averages or mode, I’ll remove them.

There are also 47 rows of duplicated data.

df0.duplicated().sum()
47

It’s unlikely for data to be same across 15 variables, so they’re likely accident re-entries. I remove duplicates.

df0 = df0.drop_duplicates(keep='first')

Finally, I rename variables from kebab-case to snake_case. This isn’t a must-do, just an aesthetic preference.

df0 = df0.rename(columns={'educational-num': 'educational_num',
'marital-status': 'marital_status',
'capital-gain': 'capital_gain',
'capital-loss': 'capital_loss',
'hours-per-week': 'hours_per_week',
'native-country': 'native_country'})

Here is our final dataset all cleaned up and ready for EDA.

Int64Index: 45175 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 45175 non-null int64
1 workclass 45175 non-null object
2 fnlwgt 45175 non-null int64
3 education 45175 non-null object
4 educational_num 45175 non-null int64
5 marital_status 45175 non-null object
6 occupation 45175 non-null object
7 relationship 45175 non-null object
8 race 45175 non-null object
9 gender 45175 non-null object
10 capital_gain 45175 non-null int64
11 capital_loss 45175 non-null int64
12 hours_per_week 45175 non-null int64
13 native_country 45175 non-null object
14 income 45175 non-null object
dtypes: int64(6), object(9)

Exploratory data analysis

Now we can do some investigation into our data.

The tree-based models I am planning to use do not require many assumptions. It doesn’t require a linearity assumption and can handle non-linear data generally well. There is also minimal preprocessing required —scaling isn’t necessary and outliers or missing data aren’t generally a problem.

Nonetheless, it’s good practice to clean up our dataset.

Handling outliers

First, let’s check for outliers across our continuous values: age, educational_num (years of education), hours_per_week and capital_gain, capital_loss.

fig, axes = plt.subplots(3, 2, figsize=(12, 10))
fig.suptitle('Boxplots to Detect Outliers', fontsize=16)

sns.boxplot(ax=axes[0, 0], y=df0['educational_num']).set_title('Educational Number', fontsize=14)
sns.boxplot(ax=axes[0, 1], y=df0['capital_gain']).set_title('Capital Gain', fontsize=14)
sns.boxplot(ax=axes[1, 0], y=df0['capital_loss']).set_title('Capital Loss', fontsize=14)
sns.boxplot(ax=axes[1, 1], y=df0['hours_per_week']).set_title('Hours per Week', fontsize=14)
sns.boxplot(ax=axes[2, 0], y=df0['age']).set_title('Age', fontsize=14)
axes[2, 1].axis('off')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()

While outliers exist on all variables, none sit outside a logically reasonable range e.g. someone who is aged 150-years-old, or who works 150 hours a week. Perhaps only capital_gain sticks out with data at the 100,000 point mark but that’s a reasonable data point for a (lucky!) investor. Let’s move on.

Ensuring no multicollinearity among predictive variables

To look at the relationship of our predictive variables, I plot a correlation heat map.

plt.figure(figsize=(16, 9))

heatmap = sns.heatmap(df0.corr(numeric_only=True), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);

There aren’t any strong positive correlations, which is good. But one thing that sticks out immediately is how capital_gain and capital_loss can be combined later for better interpretability and model performance in feature engineering later.

Ensuring a balanced dataset

When creating predictive models like Random Forest, it’s important for our outcome variable income to be balanced. I do a simple value count to check:

print(df0['income'].value_counts())
print()
print(df0['income'].value_counts(normalize=True))
<=50K    33973
>50K 11202
Name: income, dtype: int64

<=50K 0.752031
>50K 0.247969
Name: income, dtype: float64

The data indicates that ~75% of income makes less than or equal to $50,000 annually, while ~25% makes more than $50,000. A perfectly balanced dataset would have a nice 50:50 split. This is a somewhat imbalanced dataset but it’s not a big issue.

Let’s look at income closer. I visualise income against gender and age.

df0['age_bins'] = pd.cut(df0['age'], bins=range(0, 101, 10), labels=['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99'])

palette = {'Male': '#79C', 'Female': 'pink'}

fig, ax = plt.subplots(1, 2, figsize=(24,7))

#Plot income '>50K'
sns.histplot(data=df0, x='age_bins', hue='gender', palette=palette, multiple="dodge", shrink=0.7,
weights=(df0['income'] == '>50K').astype(int), discrete=True, common_norm=False, stat="count", ax=ax[0])
ax[0].set_title('Income >50K', fontsize=14)
ax[0].set_xlabel('Age')
ax[0].set_ylabel('Count of Individuals')

#Plot income '<=50K'
sns.histplot(data=df0, x='age_bins', hue='gender', palette=palette, multiple="dodge", shrink=0.7,
weights=(df0['income'] == '<=50K').astype(int), discrete=True, common_norm=False, stat="count", ax=ax[1])
ax[1].set_title('Income <=50K', fontsize=14)
ax[1].set_xlabel('Age')
ax[1].set_ylabel('Count of Individuals')

plt.suptitle('Impact of Age and Gender on Income', fontsize=16)
plt.show()

Takeaway: there is a gender disparity in income — males generally earn more across all age groups.

Let’s look at some visualisations of our predictive variables against income. I create histograms for numerical variables and boxplots for categorical variables.

A quick takeaway: In the above six charts, age and educational_num seem to be stronger factors influencing income. Looking at marital_status, being married also seems to be correlated with higher income.

A quick takeaway: as expected, income is higher with higher hours_per_week. As noted previously, there is also a greater gender disparity in income with males out-earning females. Finally, perhaps a controversial observation, whites tend to have a higher income too.

Feature engineering

At this point, let’s do some basic feature engineering for better interpretability and model performance. First, let’s create a new dataframe df_enc2.

df_enc2 = df0.copy()
df_enc2.head()

Based on EDA above, both data distributions of age and hours_per_week are right-skewed with peaks.

For that reason, transforming these variables into bins may help to reduce the impact of noise and outliers. This may also help to improve the model by increasing the linear relationship between our predictive and outcome variables.

What types of binning is appropriate? For age, the data is fairly equally distributed but not entirely. So rather than equal-width binning (dividing data into equal-sized bins), I’ll perform quantile-based binning for five bins (young, middle-aged, adult, older adult, senior) to ensure the later age bins have sufficient data points.

For hours_per_week however, the data is far more heavily skewed due to a concentration at the 40 hour mark. Instead, I’ll create customized bins based on typical workweek structures:

  • Part-Time: Less than 30 hours per week
  • Full-Time: 30 to 40 hours per week
  • Over-Time: More than 40 hours per week
# Applying quantile-based binning to "age" with 5 bins
age_labels = ['Young', 'Middle-Aged', 'Adult', 'Older Adult', 'Senior']
df_enc2['age_bin'] = pd.qcut(df_enc2['age'], q=5, labels=age_labels)

# Applying custom binning to "hours_per_week" as per the user's specification
bin_edges = [0, 30, 40, df_enc2['hours_per_week'].max()]
bin_labels = ['Part-Time', 'Full-Time', 'Over-Time']
df_enc2['hours_per_week_bin'] = pd.cut(df_enc2['hours_per_week'], bins=bin_edges, labels=bin_labels, include_lowest=True)

# Display the resulting dataframe with the binned features
print(df_enc2[['age', 'age_bin', 'hours_per_week', 'hours_per_week_bin']].head())

Here’s the result:

age      age_bin  hours_per_week hours_per_week_bin
0 25 Young 40 Full-Time
1 38 Adult 50 Over-Time
2 28 Middle-Aged 40 Full-Time
3 44 Older Adult 40 Full-Time
5 34 Middle-Aged 30 Part-Time

Next, I simplify capital_gain and capital_loss into one variable: capital_net_gain.

df_enc2['capital_net_gain'] = df_enc2['capital_gain'] - df_enc2['capital_loss']

Finally, the native_country variable is categorical with high cardinality (too many countries!). To keep the dimensionality of the dataset down in one-hot-encoding later, I opt to frequency encode native_country.

native_country_freq = df_enc2['native_country'].value_counts(normalize=True)
df_enc2['native_country_freq'] = df_enc2['native_country'].map(native_country_freq)

The dataset before label encoding:

Data preprocessing

Now, we can finally create our first Random Forest model. First, I dummy encode my outcome variable income. Then I add it to the dataframe df_enc2.

income_dummies = pd.get_dummies(df0['income'], prefix='income')

df_enc2 = pd.concat([df_enc2, income_dummies], axis=1)

Next, I drop all the redundant variables that have been transformed and are no longer needed.

Then I dummy encode all the categorical variables in our dataset. Categorical variables like workclass, education, marital_status, and occupation have different impacts on income and will need to be encoded for modeling.

df_enc2 = df_enc2.drop(columns=['income',
'age',
'fnlwgt',
'capital_gain',
'capital_loss',
'hours_per_week',
'native_country',
'age_bins']) #note: this was for earlier EDA

# Dummy encode workclass, education, marital_status, occupation, relationship, race, gender, native_country columns
df_enc2 = pd.get_dummies(df_enc2, drop_first=False)

Finally, although the model does not require the assumption of linearity, I apply scikit-learn’s StandardScaler for faster training and numerical stability.

from sklearn.preprocessing import StandardScaler

numerical_features_to_scale = ['educational_num', 'capital_net_gain', 'native_country_freq']

scaler = StandardScaler()
df_enc2[numerical_features_to_scale] = scaler.fit_transform(df_enc2[numerical_features_to_scale])

df_enc.head()

Here’s what our final dataset looks like before model construction:

Constructing the Random Forest model and evaluation

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import classification_report, roc_auc_score

# Selecting the features and the target variable
X = df_enc2.drop(columns=['income_<=50K', 'income_>50K'])
y = df_enc2['income_>50K']

# Splitting the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Creating and training Random Forest model
random_forest_model = RandomForestClassifier(random_state=42, n_estimators=100)
random_forest_model.fit(X_train, y_train)

# Predictions using Random Forest
rf_predictions = random_forest_model.predict(X_test)

# Evaluation - Classification Reports
rf_classification_report2 = classification_report(y_test, rf_predictions)

# Predicting probabilities using Random Forest
rf_probs = random_forest_model.predict_proba(X_test)[:, 1]
# Calculating AUC scores
rf_auc = roc_auc_score(y_test, rf_probs)

print("\nRandom Forest:")
print(rf_classification_report2)
print(f"AUC Score: {rf_auc:.2f}")

The evaluation scores of our Random Forest model:

Random Forest:
precision recall f1-score support

0 0.88 0.93 0.90 6795
1 0.74 0.62 0.67 2240

accuracy 0.85 9035
macro avg 0.81 0.77 0.79 9035
weighted avg 0.84 0.85 0.85 9035

AUC Score: 0.89

Hyperparameter tuning and second evaluation of Random Forest model

Pretty good scores but let’s see if we can improve them by hyperparameter tuning. Using RandomizedSearchCV, I look for my optimal parameters.

from sklearn.model_selection import RandomizedSearchCV

# Hyperparameters grid for Random Forest
rf_param_grid = {
'n_estimators': np.arange(50, 201, 50),
'max_depth': np.arange(5, 21, 5),
'min_samples_split': np.arange(2, 21, 2),
'min_samples_leaf': np.arange(1, 21, 2)
}

# Creating the RandomizedSearchCV object for Random Forest
rf_random_search = RandomizedSearchCV(random_forest_model, rf_param_grid, n_iter=50, cv=5,
verbose=1, random_state=42, n_jobs=-1)

# Fitting the random search model (this may take a few minutes)
rf_random_search.fit(X_train, y_train)

# Getting the best parameters and the best score
rf_best_params = rf_random_search.best_params_
rf_best_score = rf_random_search.best_score_

rf_best_params, rf_best_score

Optimal parameters:

({'n_estimators': 100,
'min_samples_split': 8,
'min_samples_leaf': 3,
'max_depth': 20},
0.8587437742114001)

Now I retrain my Random Forest model with the optimal parameters then make predictions on the test data and evaluate the model again.

# Optimal parameters obtained from RandomizedSearchCV
optimal_params = {
'n_estimators': 100,
'min_samples_split': 8,
'min_samples_leaf': 3,
'max_depth': 20
}

# Retraining the random forest model with optimal parameters
random_forest_model = RandomForestClassifier(**optimal_params, random_state=42)
random_forest_model.fit(X_train, y_train)

# Predicting on the test data
y_pred = random_forest_model.predict(X_test)

# Predicting probabilities using Random Forest
rf_probs = random_forest_model.predict_proba(X_test)[:, 1]
# Calculating AUC scores
rf_auc = roc_auc_score(y_test, rf_probs)

# Printing classification report for precision, recall, f1-score, and support
print("\nRandom Forest (after tuning):")
report2point1 = classification_report(y_test, y_pred)
print('Classification Report:')
print(report2point1)
print(f"AUC Score: {rf_auc:.2f}")

Our evaluation scores the second time:


Random Forest (after tuning):
Classification Report:
precision recall f1-score support

0 0.88 0.94 0.91 6795
1 0.78 0.59 0.67 2240

accuracy 0.86 9035
macro avg 0.83 0.77 0.79 9035
weighted avg 0.85 0.86 0.85 9035

AUC Score: 0.91

When comparing the scores of both models, the tuned random forest model appears to be a superior model. The only one exception is recall scores class 1.

The Random Forest 2 model’s accuracy is 0.86, slightly higher than the 0.85 accuracy of the first Random Forest model. Additionally, it boasts a higher AUC score of 0.91, suggesting better performance in terms of the trade-off between sensitivity (true positive rate) and specificity (true negative rate).

Let’s plot a confusion matrix to better visualize our model’s predictive power:

conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print('Confusion Matrix:')
print(conf_matrix)

# Visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'],
yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix Heatmap')
plt.show()
  • True Negatives (TN): The model correctly predicted 6,421 instances where a person’s income level was below 50,000.
  • False Positives (FP): In 374 instances, the model incorrectly predicted that a person’s income level would be above 50,000 when it was actually below.
  • False Negatives (FN): For 913 instances, the model predicted that a person’s income would be below 50,000 when it was actually above.
  • True Positives (TP): The model accurately predicted 1,327 instances where a person’s income level was above 50,000.

Finally, we create a bar chart to visualize our most important predictive variables from our Random Forest model.

# Extracting feature importance from the Random Forest model
feature_importances = random_forest_model.feature_importances_

# Creating a DataFrame to hold features and their importance scores
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sorting the features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Displaying the top 10 most important features
top_features = feature_importance_df.head(10)

# Plotting the top 10 features
plt.figure(figsize=(12, 7))
sns.barplot(data=top_features, x='Importance', y='Feature', palette='viridis')

plt.title('Top 10 Important Features in Predicting Income')
plt.show()

top_features

The plot above shows that of our champion random forest model, the factors of capital_net_gain, educational_num and marital_status have the highest predictive value in determining income.

Building an XGBoost model

Now we create an XGBoost model. First, we divide our data into a training and test set as usual. Then we instantiate the XGBClassifier and define our parameters for hyperparameter tuning.

X = df_enc2.drop(columns=['income_<=50K', 'income_>50K'])  # Features (excluding the target variable)
y = df_enc2['income_>50K'] # Target variable

# Data Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate XGBClassifier for model construction
model = xgb.XGBClassifier(objective='binary:logistic')

# Define parameters for hyperparameter tuning
param_grid = {
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 300, 500],
'max_depth': [3, 5, 7],
'subsample': [0.7, 0.9, 1.0],
'colsample_bytree': [0.7, 0.9, 1.0]
}

Next, we construct a GridSearch cross-validation using the model and parameters defined. Then we can fit the model to our training data and make predictions.

# Grid Search
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best Model from Grid Search
best_model = grid_search.best_estimator_

# Formulate predictions on your test set (Model Evaluation)
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1] # Probabilities for the positive class

Here are our evaluation scores:

XGBoost Model:
Classification Report:

precision recall f1-score support

0 0.89 0.93 0.91 6842
1 0.76 0.65 0.70 2193

accuracy 0.87 9035
macro avg 0.83 0.79 0.81 9035
weighted avg 0.86 0.87 0.86 9035

AUC Score: 0.93

And a confusion matrix plus top 10 feature importance ranked:

Conclusion

Some closing takeaways:

  • The XGBoost model yields a slightly higher accuracy (0.87) in predicting individuals with an income level below 50,000 compared to the Random Forest model (0.86).
  • The relatively high number of false negatives (913) in the Random Forest model indicates that it may underestimate the number of individuals earning above 50,000 based on the demographic factors provided. This could be a concern, especially if the goal is to identify potential high earners accurately. False negatives for the XGBoost model are lower (765), which is better.
  • While the Random Forest model has fewer false positives (374), it does imply that some individuals are mistakenly identified as high earners based on their demographics, even when their actual income is below the threshold.

--

--

Donovan Choy
Donovan Choy

Written by Donovan Choy

Classical liberal. I love the Wu-Tang Clan, Spaghetti Westerns and anything Aly & Fila.

No responses yet