Unleashing the Power of XGBoost

Marcos Gois
8 min readFeb 10, 2023

--

An Introduction to Extreme Gradient Boosting

XGBoost is a popular machine learning library for gradient boosting trees. It stands for Extreme Gradient Boosting, reflecting its focus on high performance and efficiency. XGBoost is designed to be scalable, both in terms of the number of samples and features it can handle, as well as the complexity of the models it can build.

Ilustration how XGBoost Works

Gradient boosting trees are an ensemble method that combines the predictions of many simple models to create a more accurate and robust prediction. In XGBoost, the basic building block of these models is a decision tree. Decision trees are constructed by recursively splitting the data into subsets based on the most important feature and value, until the data in each leaf node is homogeneous. XGBoost uses a variant of gradient boosting called gradient tree boosting, where the objective is to minimize a loss function through iteratively adding trees to the model.

One of the strengths of XGBoost is its ability to handle large datasets with a large number of features. It does this by using a tree-based structure that can efficiently split the data based on feature values, rather than using a linear structure that would require evaluating all features for every sample. XGBoost also uses a regularization term, called L1 or L2 regularization, to prevent overfitting.

Ability to handle missing data

Another key feature of XGBoost is its ability to handle missing data. This is important because in real-world data, it is common for samples to be missing values for certain features. XGBoost can handle missing data by automatically imputing missing values based on the other data in the sample, or by allowing users to specify how they want missing values to be treated.

Work XGBoost
Work XGBoost

Efficiency and performance

Its efficiency and performance, XGBoost has become popular because of its ease of use and its wide range of features. XGBoost has a comprehensive Python library that is easy to install and use, and it also has interfaces for R, Julia, and other programming languages. It also has a wide range of features, such as the ability to handle categorical features, handle unbalanced data, and perform grid search for hyperparameter tuning.

XGBoost is a powerful and versatile library for gradient boosting trees that is well-suited for a wide range of applications. Whether you are working with a large dataset, dealing with missing data, or building complex models, XGBoost is a great choice for efficient and accurate machine learning.

Hands-on with XGBoost:

Description of problem

This challenge serves as final project for the “How to win a data science competition” Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms — 1C Company.

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

About Dataset

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions

  • sales_train.csv — the training set. Daily historical data from January 2013 to October 2015.
  • test.csv — the test set. You need to forecast the sales for these shops and products for November 2015.
  • sample_submission.csv — a sample submission file in the correct format.
  • items.csv — supplemental information about the items/products.
  • item_categories.csv — supplemental information about the items categories.
  • shops.csv- supplemental information about the shops.

Data fields

  • ID — an Id that represents a (Shop, Item) tuple within the test set.
  • shop_id — unique identifier of a shop.
  • item_id — unique identifier of a product.
  • item_category_id — unique identifier of item category.
  • item_cnt_day — number of products sold. You are predicting a monthly amount of this measure.
  • item_price — current price of an item.
  • date — date in format dd/mm/yyyy.
  • date_block_num — a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33.
  • item_name — name of item.
  • shop_name — name of shop.
  • item_category_name — name of item category.

This dataset is permitted to be used for any purpose, including commercial use.

Evaluation of problem

Submissions are evaluated by root mean squared error (RMSE) . True target values are clipped into [0,20] range.

1. Exploratory Data Analysis (EDA)

# Import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import xgboost as xgb
import warnings

from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from collections import Counter
from operator import itemgetter
from datetime import datetime

warnings.filterwarnings("ignore")

# Import dataframe of train, test, shops, itens and categories
df_sales_train = pd.read_csv('sales_train.csv')
df_sales_test = pd.read_csv('test.csv')
shops = pd.read_csv('shops.csv')
items = pd.read_csv('items.csv')
categories = pd.read_csv('item_categories.csv')

# Visualization sales train dataframe
df_sales_train.head()
Visualization sales train dataframe
# Lets group data by item_id and date_block_num and look closer on it.
sales_by_item_id = df_sales_train.pivot_table(index = ['item_id'],values = ['item_cnt_day'],
columns = 'date_block_num',
aggfunc = np.sum,
fill_value = 0).reset_index()

sales_by_item_id.columns = sales_by_item_id.columns.droplevel().map(str)
sales_by_item_id = sales_by_item_id.reset_index(drop = True).rename_axis(None, axis = 1)
sales_by_item_id.columns.values[0] = 'item_id'
sales_by_item_id.head()
Visualization result of transformation data with group data by item_id and date_block_num and look closer
# Plot sum sales itens of month
sales_by_item_id.sum()[1:].plot(legend = True, label = "Sum sales itens of month")
Sum sales itens of month
# Lets now group train data by shop_id.
sales_by_shop_id = df_sales_train.pivot_table(index = ['shop_id'],values=['item_cnt_day'],
columns = 'date_block_num',
aggfunc = np.sum,
fill_value = 0).reset_index()

sales_by_shop_id.columns = sales_by_shop_id.columns.droplevel().map(str)
sales_by_shop_id = sales_by_shop_id.reset_index(drop = True).rename_axis(None, axis = 1)
sales_by_shop_id.columns.values[0] = 'shop_id'
# Visualization shops name dataframe
shops.head()
Visualization shops name dataframe
# Converted train shop_id to shop_id that is in the test set
shops['shop_name'] = shops['shop_name'].apply(lambda x: x.lower()).str.replace('[^\w\s]', '').str.replace('\d+','').str.strip()
shops['shop_city'] = shops['shop_name'].str.partition(' ')[0]
shops['shop_type'] = shops['shop_name'].apply(lambda x: 'мтрц' if 'мтрц' in x else 'трц' if 'трц' in x else 'трк' if 'трк' in x else 'тц' if 'тц' in x else 'тк' if 'тк' in x else 'NO_DATA')
shops.head()
Show DataFrame shops
# Visualization itens dataframe
items.head()
Visualization itens dataframe
# Visualization categorias of itens dataframe
categories.head()
Visualization categorias of itens dataframe
# Visualization categorias of itens dataframe
categories.head()
Visualization categorias of itens dataframe
# Merge between sales train and sales test to best sales for training
good_sales = df_sales_train.merge(df_sales_test, on = ['item_id','shop_id'], how = 'inner').dropna()

print('Number of good_sales:', len(good_sales))
Number of good_sales: 1224439
# Visualization good sales dataframe
good_sales.head()
Visualization good sales dataframe

2. Prepare data

# Remove column "ID" of good sales dataframe
good_sales = good_sales.drop(columns = ['ID'])

good_sales.info()
Information about DataFrame good_sales
# Count duplicated sales on good sales dataframe
good_sales.duplicated().sum()
5
# Remove data duplicate
good_sales = good_sales.drop_duplicates()

# Sort date and reset index
good_sales['date'] = pd.to_datetime(good_sales['date'], format = '%d.%m.%Y')
good_sales = good_sales.sort_values(by = 'date')
good_sales = good_sales.reset_index(drop = True).rename_axis(None, axis = 1)

good_sales.head()
good_sales sort date, reseted index and removed data duplicate
#visualization outliers of item_price
good_sales[['item_price']].boxplot()
Visualization outliers of item_price
# Remove possible outliers in item_price
good_sales = good_sales[good_sales['item_price'] < 50000]
#visualization outliers of item_cnt_day
good_sales[['item_cnt_day']].boxplot()
Visualization outliers of item_cnt_day
# Remove possible outliers in item_cnt_day
good_sales = good_sales[(good_sales['item_cnt_day'] > 0) & (good_sales['item_cnt_day'] < 400)]


# Group data by SHOP and ITEM and get the amout of sales for each month in them
pivot_train = good_sales.pivot_table(index = ['shop_id', 'item_id'],
columns = 'date_block_num',
values = 'item_cnt_day',
aggfunc = 'sum').fillna(0.0)

pivot_train.head()
Visualization data train pivoted
df_train_cleaned.info()
Information about data train cleaned
# Set the parameters for model
param = {'max_depth': 15,
'subsample': 0.999,
'min_child_weight': 1,
'eta':0.34,
'seed':1,
'verbosity': 1,
'eval_metric':'rmse'}

# using XGBoost
#XGBoost is short for Extreme Gradient Boosting. It is a machine learning library which implements gradient boosting in a more
#optimized way. This makes XGBoost really fast and accurate as well.

#DMatrix is an optimized data structure that provides better memory efficiency and training speed.

# Create DMatrix to train the model
xgbtrain = xgb.DMatrix(df_train_cleaned.iloc[:, (df_train_cleaned.columns != 33)].values,
df_train_cleaned.iloc[:, df_train_cleaned.columns == 33].values)

watchlist = [(xgbtrain,'train-rmse')]

# Train the model
bst = xgb.train(param, xgbtrain)
preds = bst.predict(xgb.DMatrix(df_train_cleaned.iloc[:, (df_train_cleaned.columns != 33)].values))
rmse = np.sqrt(mean_squared_error(preds,df_train_cleaned.iloc[:, df_train_cleaned.columns == 33].values))

# Print the RMSE of the model
print(rmse)
0.875861793733841
# Take a look at the importance of the features in the dataset
fig, ax = plt.subplots(figsize = (20, 20))
xgb.plot_importance(bst, ax = ax)
Features importance

4. Predictions sales with model constructed

# Preprocess the testing dataframe
df_test = df_sales_test
df_test['shop_id'] = df_test.shop_id.astype('str')
df_test['item_id'] = df_test.item_id.astype('str')

df_test = df_sales_test.merge(df_train_cleaned, how = "left", on = ["shop_id", "item_id"]).fillna(0.0)

df_test.head()
Visualization data test
# Get the predictions
preds = bst.predict(xgb.DMatrix(df_test.iloc[:, (df_test.columns != 'ID') & (df_test.columns != 33)].values))

# Get general info about our predictions
preds = list(map(lambda x: min(20, max(x,0)), list(preds)))
sub_df = pd.DataFrame({'ID': df_test.ID, 'item_cnt_month': preds})

# Get datatime for save name csv
now = datetime.now()
dt_string = now.strftime("%d_%m_%Y_%H_%M_%S")

# Convert the predictions we got to the csv
sub_df.to_csv(f'out_using_matrix_xgb_{dt_string}.csv', index = False)

# Get csv generated of predictions
df_preds = pd.read_csv(f'out_using_matrix_xgb_{dt_string}.csv')
df_preds.head()
Visualization results of predicton

Conclusion

The XGBoost can be very good for any application, in that article saw as simple is use and very good eficacy, but we have to becareful with your use for not be overfinting and have problem in prediction.

For see code complete click Here.

That’s all folks and I hope they have like!!!

Thank so Much.

References:

  1. https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-HowItWorks.html
  2. https://aboutyou.tech/blog/xgboost-gone-wild-predicting-returns-with-extreme-gradient-boosting-3e2c16c5bc01/
  3. https://xgboost.readthedocs.io/en/stable/index.html
  4. https://www.kaggle.com/code/marcosgois07/predict-sales-of-after-month-with-xgboost
  5. https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales

--

--

Marcos Gois

Full stack developer | Data Science | Machine Learning | Python | C# | SQL