This is a true story of how I lost money using machine learning (ML) to bet on CS:GO. The project was done with a friend, who gave me permission to share this story in public.

Check out the first post of the series, which covers the theory and foundations necessary to understand what’s going on in this second post:

What is your edge?
Financial decision-making with ML
One bet: Expected profits and decision rule
Multiple bets: The Kelly criterion
Probability calibration
Winner’s curse

In this post, I will go over the actual implementation of the solution:

CS:GO basics
Data scraping
Feature engineering
TrueSkill
- Inferential vs predictive models
Dataset
Modelling
Evaluation
Backtesting
Why I lost 1000 euros

Solution

Before we get to the actual solution, I need to explain some CS:GO basics:

CS:GO basics

Counter-Strike: Global Offensive (CS:GO) is a first-person shooter (FPS) multiplayer game. It can be played casually or competitively. When played competitively, it’s typically played on the following format:

Two teams of 5 play against each other: Terrorists vs Counter-Terrorists
Best of 3 maps (sometimes best of 1 or 5)
Maps are played up to 30 rounds
Each round can be won be killing the other team or by planting or defusing the bomb
Each player has a number of kills (K), deaths (D), assists (A) and average damage per round (ADR)

If you don’t know much about video-games, don’t worry, you can treat CS:GO as any other competitive team sport: Each match has a winner (in case of best of 1, it could be a tie) and each team and player have stats that correlate with their performance in the match. Teams are composed of players which range in talent, there are star players, dominant teams, and players might change teams over time.

CS:GO score screenshot from https://blog.scope.gg/cs-go-stats-why-is-it-so-important-en/

Web scraping

Data is the new oil.

As I explained in the first post, one of the reasons we chose CS:GO was data availability. Since we might have broken some term and conditions, I won’t name our exact sources, but they were easily found online.

We collected both match data and betting odds. Note that match data is easy to find retroactively, but betting odds need to be collected in real-time, which limited our ability to run backtests (more on this later). Betting odds data is super valuable, and maybe a better way to make money would have been to collect it across different websites and simply sell it¹.

We collected 3 years worth of match data over 30k matches. We managed to collect only 3 months of betting odds data, covering 1725 matches, with approximately 30 odds per match. Note that the odds fluctuate between the match announcement and start.

Match data contained information such as the teams playing, team composition, kills and deaths for each player, rounds won, the map to be played, the final score (win-loss-tie), and much more.

To scrape the data that we needed we used Selenium through a headless browser. That was necessary because the website content was dynamically loaded using JavaScript. Then, we parsed the resulting HTML with BeautifulSoup. We would run a batch job every night to get the new match data and old match results. Betting odds, however, were collected more frequently.

Feature engineering

Past behavior is the best predictor of future behavior.

With the match data, we created 100s of features². Most features were related to past performance, such as the percentage of times team 1 won on the map to be played or against team 2. If the teams faced each other off before, who won back then is an important predictor now. We also used game score features like KD difference and ADR on a team and individual basis.

Note that we couldn’t use the betting odds as features, even though the information there is invaluable³. The reason is simple, as previously explained: we didn’t have a backfill for historical betting odds. We could only use the odds that were available after we started to collect them, which was only (barely) enough for backtesting.

Also, we had a trump card, which ended up generating the most important set of features: TrueSkill.

TrueSkill

TrueSkill is a Bayesian skill rating system developed by Microsoft for multiplayer games, a Bayesian version of the ELO rating. It aims to estimate the “true skill” of each player or team based on their performance history.

TrueSKill uses a Gaussian distribution to represent the skill level of each player, and it updates these skill levels after each match using Bayesian updates⁴. TrueSkill provides not just the ability but also the uncertainty around each player’s skill, both of which can be used as features in a ML model.

Inferential vs predictive models

There are two cultures in the use of statistical modeling to reach conclusions from data. -Leo Breiman⁵

If we have TrueSkill, which predicts the win probability between two teams, why do we even need a ML model? TrueSkill is an inferential model, which attempts to explain the world through latent variables. Of course, a perfect model of the world would also make great predictions but, in practice, there is always a trade-off between explainability and predictive power. That is the biggest tension between statistics and ML.

ML models are typically less interpretable black boxes but much more powerful at making predictions. They can incorporate a wide range of features, including but not limited to those provided by TrueSkill, with the sole focus of optimising a loss function, which generally translates into better predictions.

Dataset

Here is the matches dataset with all the features and target together, including the TrueSkill features. I don’t provide the actual feature engineering code for the sake of brevity, as this post is long enough as it is.

Modelling

XGBoost is all you need. -Bojan Tunguz

The modelling done here is pretty standard tabular ML with a couple of notable exceptions:

We remove ties, which represent roughly 1.5% of the dataset⁶.
We do data augmentation by swapping team1 and team2 features and adding both rows to the training set
- We can do that as there is no “home advantage” in CS:GO like there is in football⁷
- When making predictions, we average the predictions across both scenarios

Out-of-time train-test split

We use out-of-time split instead of the more typical cross-validation. In pretty much any real-life application, a model is trained with past data and then is used to predict future unseen data. Your evaluation should reflect that, as you might be interested to know how your model performance degrades over time (which could be caused, for example, by concept drift). In that sense, almost all ML problems are time series problems and should be evaluated as such⁸.

Code

dataset = pd.read_parquet("dataset.parquet").drop_duplicates()

dt_train = '2019-01-01'
dt_test = '2019-08-01'

dataset['target'] = (dataset['winner'] == 'team1').astype(bool)
dataset = dataset[
    (dataset['match_date'] >= '2017-01-01') &
    (dataset['winner'] != 'tie') &
    (dataset['match_id'] != 'https://www.hltv.org/matches/2332976/lucid-dream-vs-alpha-red-esl-pro-league-season-9-asia')
].reset_index(drop=True)

mask_train = dataset['match_date'] < dt_train
dataset_train = dataset.loc[mask_train].reset_index(drop=True)

dataset_train2 = dataset_train.sample(frac=1).reset_index(drop=True)
dataset_train2['target'] = ~dataset_train2['target']
cols = []

# Swapping team features
for c in list(dataset_train.columns):
    if c.startswith('team1_'):
        cols.append(c.replace('team1_', 'team2_').replace('_team2', '_team1'))
    elif c.startswith('team2_'):
        cols.append(c.replace('team2_', 'team1_').replace('_team1', '_team2'))
    else:
        cols.append(c)

dataset_train2 = dataset_train2.rename(columns=dict(zip(dataset_train.columns, cols)))
dataset_train = dataset_train[cols]
dataset_train2 = dataset_train2[cols]
dataset_train = pd.concat([dataset_train, dataset_train2], axis=0, ignore_index=True).reset_index(drop=True)

idxs = np.random.choice(len(dataset_train), replace=False, size=4000)
dataset_val = dataset_train.loc[idxs].drop_duplicates('match_id').reset_index(drop=True)
dataset_val = dataset_val.reset_index(drop=True)

index = np.arange(len(dataset_train))
mask = ~np.in1d(index, idxs)
dataset_train = dataset_train.loc[mask].reset_index(drop=True)

mask_test = (
    (dataset['match_date'] >= dt_train) &
    (dataset['match_date'] < dt_test)
)
dataset_test = dataset.loc[mask_test].reset_index(drop=True)

dataset_train.shape, dataset_val.shape, dataset_test.shape

((38490, 267), (3798, 267), (4759, 267))

Code

dataset['match_date'] = pd.to_datetime(dataset['match_date'])

# Create weekly match counts
match_counts = dataset.groupby(dataset['match_date'].dt.to_period('W')).size().reset_index(name='count')
match_counts['match_date'] = match_counts['match_date'].dt.to_timestamp()

# Define color for each period
match_counts['period'] = 'Train'
match_counts.loc[match_counts['match_date'] >= pd.to_datetime(dt_train), 'period'] = 'Test'

# For validation, we'll consider it as part of the train set but with a different color
val_mask = dataset_val['match_date'].dt.to_period('W').value_counts().reset_index()
val_mask.columns = ['match_date', 'val_count']
match_counts = match_counts.merge(val_mask, on='match_date', how='left')
match_counts['val_count'] = match_counts['val_count'].fillna(0)
match_counts.loc[match_counts['val_count'] > 0, 'period'] = 'Validation'

# Create the plot
fig = px.line(match_counts, x='match_date', y='count', color='period', 
              title='Number of Matches Over Time (Weekly)',
              labels={'count': 'Number of Matches', 'match_date': 'Date'},
              color_discrete_map={'Train': 'blue', 'Validation': 'green', 'Test': 'red'})

# Add vertical lines for train/test split
fig.add_vline(x=dt_train, line_dash="dash", line_color="gray")

# Add annotation for the train/test split
fig.add_annotation(x=dt_train, y=1, yref="paper", showarrow=False,
                   text="Train/Test Split", textangle=-90, xanchor="right")

# Update layout for better readability
fig.update_layout(
    legend_title_text='Dataset',
    xaxis_title="Date",
    yaxis_title="Number of Matches per Week",
)

fig.show()

Model: LightGBM

We use a standard off-the-shelf LightGBM binary classifier. There are many advantages in using LightGBM or XGBoost for tabular data problems (either choice is fine!):

Handles missing values natively
Handles categorical features natively
Early stopping to optimize the number of estimators
Blazing fast and scalable
Multiple loss functions options, including using a custom one
- For binary classification, the default is the negative logloss (a proper scoring rule, which should lead to well-calibrated probabilities)
You can use SHAP for feature importance and explanations

For more information on how to unlock the power of LightGBM, watch my PyData London 2022 presentation.

class CSGOPredictor:
    """
    A predictor class for CS:GO match outcomes using LightGBM.
    """

    def __init__(self, model_params: Dict[str, Any]):
        """
        Initialize the CSGOPredictor.

        Args:
            model_params (Dict[str, Any]): Parameters for the LightGBM model.
        """
        self.model_params = model_params
        self.lgb = None  # Will be initialized in the fit method

    def fit(self, x_train: pd.DataFrame, y_train: np.ndarray, 
            x_val: pd.DataFrame, y_val: np.ndarray) -> 'CSGOPredictor':
        """
        Fit the LightGBM model on the training data.

        Args:
            x_train (pd.DataFrame): Training features.
            y_train (np.ndarray): Training labels.
            x_val (pd.DataFrame): Validation features.
            y_val (np.ndarray): Validation labels.

        Returns:
            CSGOPredictor: The fitted predictor object.
        """
        self.lgb = LGBMClassifier(**self.model_params)
        self.lgb.fit(
            x_train, y_train,
            eval_set=[(x_train, y_train), (x_val, y_val)],
            eval_names=['training', 'validation'],
            callbacks=[
                early_stopping(stopping_rounds=25),
                log_evaluation(period=50),  # Log every 50 iterations
            ]
        )
        return self

    def predict_proba(self, x: pd.DataFrame) -> np.ndarray:
        """
        Predict probabilities for match outcomes.

        This method performs predictions twice with swapped team features and averages the results.

        Args:
            x (pd.DataFrame): Input features for prediction.

        Returns:
            np.ndarray: Predicted probabilities for each class.
        """
        # Original predictions
        original = self.lgb.predict_proba(x)

        # Create a copy of the input data for feature swapping
        x_inv = x.copy()
        
        # Identify team1 and team2 columns
        team1_cols = [i for i in x_inv.columns if i.startswith('team1')]
        team2_cols = [i for i in x_inv.columns if i.startswith('team2')]

        # Swap team1 and team2 features
        x_inv = x_inv.rename(dict(zip(team1_cols + team2_cols, team2_cols + team1_cols)), axis=1)
        x_inv = x_inv.reindex(columns=x.columns)

        # Predictions with swapped features
        inv = self.lgb.predict_proba(x_inv)

        # Swap the probabilities for team1 and team2
        inv[:, 0], inv[:, 1] = inv[:, 1], inv[:, 0].copy()

        # Average the original and swapped predictions
        return (original + inv) / 2.0
    
    def predict(self, x: pd.DataFrame) -> np.ndarray:
        """
        Predict the class labels for the input data.

        Args:
            x (pd.DataFrame): Input features for prediction.

        Returns:
            np.ndarray: Predicted class labels.
        """
        return self.predict_proba(x).argmax(axis=1)

model_params = {
    'n_estimators': 10_000,  # With early stopping, we will use many fewer trees than that
    'learning_rate': 0.05
}

Code

drop_cols = ['winner',  'match_date', 'match_id', 'event_id', 'team1_id', 'team2_id', 'target']

x_train = dataset_train.drop(columns=drop_cols, axis=1)
y_train = dataset_train['target']
features = list(x_train.columns)

x_val = dataset_val[features]
y_val = dataset_val['target']

x_test = dataset_test[features]
y_test = dataset_test['target']

model = CSGOPredictor(model_params).fit(x_train, y_train, x_val, y_val)

Training until validation scores don't improve for 25 rounds
[50]    training's binary_logloss: 0.563521 validation's binary_logloss: 0.586438
[100]   training's binary_logloss: 0.536487 validation's binary_logloss: 0.58016
[150]   training's binary_logloss: 0.517125 validation's binary_logloss: 0.578305
Early stopping, best iteration is:
[145]   training's binary_logloss: 0.51895  validation's binary_logloss: 0.578226

Feature importance

Here is the “beeswarm” view of SHAP values. It shows not just the importance but also how each feature influences the prediction logits⁹. You can also apply SHAP to individual samples to understand what features caused their prediction logits.

explainer = shap.Explainer(model.lgb)
shap_values = explainer(x_test)
shap.plots.beeswarm(shap_values, max_display=20)

Unsurprisingly, the TrueSkill win probability features are the most important ones. In a sense, this can be seen as a form of stacking, since TrueSkill is another model. Other important features relate to the team’s past performance, like KD ratio and ADR.

Are ~250 features really necessary? Probably not, especially with just 30k samples¹⁰. We didn’t do any feature selection, but I’d do permutation importance and adversarial validation on a time split if I had more time on my hands¹¹.

Evaluation

We evaluate using the following metrics:

Accuracy: how many bets you expect to get right
AUC¹²: how well you rank-order the winners/losers
Brier score: a metric takes both calibration and accuracy into account

I also plot the calibration curves for the training and test sets.

Code

def calculate_metrics(X, y, model):
    y_pred_proba = model.predict_proba(X)[:, 1]
    y_pred = model.predict(X)
    return {
        'Accuracy': accuracy_score(y, y_pred),
        'AUC': roc_auc_score(y, y_pred_proba),
        'Brier_score': brier_score_loss(y, y_pred_proba)
    }

metrics_train = calculate_metrics(x_train, y_train, model)
metrics_val = calculate_metrics(x_val, y_val, model)
metrics_test = calculate_metrics(x_test, y_test, model)

metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test],
                          index=['Training', 'Validation', 'Test'])

metrics_df

	Accuracy	AUC	Brier_score
Training	0.738555	0.824180	0.173662
Validation	0.718536	0.794228	0.185571
Test	0.700567	0.772830	0.191394

Code

def plot_calibration_curve(y_true, y_pred_proba, set_name, fig, color):
    mean_predicted_value, fraction_of_positives = calibration_curve(y_true, y_pred_proba, n_bins=10)
    fig.add_trace(go.Scatter(
        x=mean_predicted_value, y=fraction_of_positives,
        mode='lines+markers', name=f'{set_name} set',
        line=dict(color=color)
    ))

# Create a new figure for the calibration plot
calibration_fig = go.Figure()

# Add the perfectly calibrated line
calibration_fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines', name='Perfectly calibrated',
    line=dict(dash='dot')
))

# Plot calibration curve for the training set
plot_calibration_curve(y_train, model.predict_proba(x_train)[:, 1], 'Training', calibration_fig, 'blue')

# Plot calibration curve for the test set
plot_calibration_curve(y_test, model.predict_proba(x_test)[:, 1], 'Test', calibration_fig, 'red')

# Set layout properties for the calibration plot
calibration_fig.update_layout(
    title="Calibration plot",
    xaxis_title="Mean predicted value",
    yaxis_title="Fraction of positives",
    xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    showlegend=True
)

calibration_fig.show()

The model seems well calibrated (slightly more so on the test set than on the training set, a welcome surprise), which makes it useful for betting: recall from the previous post that our betting decision rule is based on the probability of team 1 or 2 winning. If you use a probability for decision making, it generally needs to be calibrated.

If the model wasn’t well calibrated, we could have used Isotonic regression on a validation set to fix that. There are other options for post-hoc model calibration like Platt scaling, but Isotonic regression works best for tree-based models.

Code

def auc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)

def acc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = accuracy_score(y, model.predict(X))
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)

Code

# Calculate weekly AUC for training and test sets
weekly_auc_train = auc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_auc_test = auc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_auc_train.index,
    y=weekly_auc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_auc_test.index,
    y=weekly_auc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='AUC Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='AUC'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_auc = weekly_auc_train.mean()
avg_test_auc = weekly_auc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_auc_train.index.min(), y0=avg_train_auc,
              x1=weekly_auc_train.index.max(), y1=avg_train_auc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_auc_test.index.min(), y0=avg_test_auc,
              x1=weekly_auc_test.index.max(), y1=avg_test_auc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_auc_train.index.max(), y=avg_train_auc,
                   text=f"Train Avg: {avg_train_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_auc_test.index.max(), y=avg_test_auc,
                   text=f"Test Avg: {avg_test_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()

There is a train-test performance gap, which implies overfitting, but that’s not a big concern per se. What we really care about is the out-of-time performance, which will also be evaluated with the backtest below. Overfitting is not uncommon in gradient-boosted trees models, but its generalization performance tends to still be better than other models like logistic regression or random forests (I will leave model comparison as an exercise to the reader).

Also, note that there is a big drop in the last 3 weeks of the test dataset. That is exactly when I lost most money! There was some kind of drift or event in that period which made the model perform much worse. That also suggests we should not let the model go fora a long time without re-training. Unfortunately, when we first started to place the bets, those last weeks of the test set were not available to us, being future yet unseen data.

Code

# Calculate weekly AUC for training and test sets
weekly_acc_train = acc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_acc_test = acc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_acc_train.index,
    y=weekly_acc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_acc_test.index,
    y=weekly_acc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='Accuracy Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='Accuracy'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_acc = weekly_acc_train.mean()
avg_test_acc = weekly_acc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_acc_train.index.min(), y0=avg_train_acc,
              x1=weekly_acc_train.index.max(), y1=avg_train_acc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_acc_test.index.min(), y0=avg_test_acc,
              x1=weekly_acc_test.index.max(), y1=avg_test_acc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_acc_train.index.max(), y=avg_train_acc,
                   text=f"Train Avg: {avg_train_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_acc_test.index.max(), y=avg_test_acc,
                   text=f"Test Avg: {avg_test_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()

The accuracy plot is is similar to AUC in almost all aspects. Note that we’re much better than predicting at random, but that is not a good baseline here. A much better baseline would be the accuracy calculated with the probabilities implied by the betting odds.

Backtesting

Past performance is no guarantee of future results.

Backtesting is replaying the past with your model decisions. One example of backtesting is the following:

Train model with data up to a certain date
Sample betting odds for the next matches
Make bets for those next matches according to your betting strategy
Repeat 1-3 until you cover all the test data
Evaluate ML metrics (e.g. AUC) and business metrics (e.g ROI) on your bets

Backtesting allows us to assess our financial performance, which matters a lot more than ML metrics. For example, is an AUC of 0.77 good or bad? That is hard to tell in general, while a ROI of 1.1 is something we can understand and compare to other strategies (including leaving your money in the bank to earn risk-free interest).

Here, we only assess the ROI of the bets, not other financial metrics like the Sharpe ratio or max drawdown.

For simplicity, we just train the model once and keep it fixed for all future bets, which makes it a more conservative backtest. Also, to make it simple and more conservative, we sample betting odds at random, while in practice we have access to more than one betting odd per match.

First, let’s download the dataset with matches with betting odds:

Code

dataset_with_odds = pd.read_parquet("match_predictions_with_odds.parquet")
dataset_with_odds = dataset_with_odds[["match_id", "team1_odds", "team2_odds"]]
dataset_with_odds = dataset_with_odds.merge(dataset_test, on="match_id")
dataset_with_odds['match_date'] = pd.to_datetime(dataset_with_odds['match_date'])
dataset_with_odds = dataset_with_odds.sort_values(by='match_date')

dataset_with_odds.shape

(1113837, 269)

Now, we can simulate our betting strategy:

Only bet if the probability of winning is over 50%
Only bet if the probability of winning is greater than the implied probability by the odds plus a delta of 1%
The bet can either be a fixed amount or determined by the Kelly criterion (here, for simplicity, I only show fixed betting – see previous blog post for a discussion on the Kelly criterion and some variants)

Using the notation from the previous post, here are the betting equations:

\[\begin{align} P(X_A, X_B) &> 50\% \\ P(X_A, X_B) &> \frac{1}{O+1} + 1\% \end{align}\]

There was some trial and error involved in designing our betting strategy and I’m sure there is room for improvement. The delta of 1% is our safety margins due to model error and we found it with a grid search. It’s a parameter you can play with in the simulation below:

# Constants
MIN_PROBA = 0.5
MIN_DELTA_PROBA = 0.01
N_SIMS = 200

all_samples_data = []

for _ in range(N_SIMS):
    # Sample one row per match
    df = (dataset_with_odds.groupby('match_id')
                           .apply(lambda x: x.sample(1))
                           .reset_index(drop=True))
    
    # Predict probabilities
    predict_proba = model.predict_proba(df[features])
    df['team1_proba'] = predict_proba[:, 1]
    df['team2_proba'] = predict_proba[:, 0]
    
    # Calculate implied probabilities from odds
    df['team1_implied_prob'] = 1 / df['team1_odds']
    df['team2_implied_prob'] = 1 / df['team2_odds']
    
    # Determine whether to bet based on probabilities and odds
    df['team1_bet'] = (df.team1_proba > MIN_PROBA) & (df.team1_proba > (df.team1_implied_prob + MIN_DELTA_PROBA))
    df['team2_bet'] = (df.team2_proba > MIN_PROBA) & (df.team2_proba > (df.team2_implied_prob + MIN_DELTA_PROBA))
    
    # Calculate returns
    df['team1_returns'] = np.where(df.team1_bet & (df.winner == 'team1'), df['team1_odds'], 0.0)
    df['team2_returns'] = np.where(df.team2_bet & (df.winner == 'team2'), df['team2_odds'], 0.0)
    
    # Calculate profit/loss
    df['loss'] = df['team1_bet'].astype(int) + df['team2_bet'].astype(int)
    df['revenue'] = df['team1_returns'] + df['team2_returns']
    df['profit'] = df['revenue'] - df['loss']
    
    all_samples_data.append(df)

Code

# Combine all sample data and prepare for analysis
all_samples_df = pd.concat(all_samples_data).reset_index(drop=True)
all_samples_df['match_date'] = pd.to_datetime(all_samples_df['match_date'])
all_samples_df.sort_values(by='match_date', inplace=True)

# Calculate cumulative profit
all_samples_df['cumulative_profit'] = all_samples_df.groupby('match_date')['profit'].cumsum()

# Calculate daily profit and cumulative profit
daily_profit_sum = (all_samples_df.groupby('match_date')['profit']
                                  .sum()
                                  .reset_index())
daily_profit_sum['cumulative_profit'] = daily_profit_sum['profit'].cumsum() / N_SIMS

# Calculate total profits and ROI
total_profits = all_samples_df['profit'].sum()
total_bets = all_samples_df['loss'].sum()  # Assumes 'loss' represents the number of bets
roi = total_profits / total_bets if total_bets > 0 else 0

# Calculate annualized ROI
min_date = all_samples_df['match_date'].min()
max_date = all_samples_df['match_date'].max()
duration_years = (max_date - min_date) / pd.Timedelta(days=365.25)
annualized_roi = (roi + 1) ** (1 / duration_years) - 1 if duration_years > 0 else 0

print(f"Backtest ROI: {round(roi*100)}%")
print(f"Annualized ROI: {round(annualized_roi*100)}%")

Backtest ROI: 10%
Annualized ROI: 64%

The ROI after 2 months is 10%, which annualized would be 63%, not bad at all! For reference, the risk free interest rate in the US today is around 5% per year, while the average S&P500 returns are roughly 10% a year.

We did have an edge after all, or so it seemed. Let’s see the uncertainty across multiple simulations, where the randomness comes from sampling different betting odds for each match:

Code

# Create a Plotly figure
fig = go.Figure()

# Add traces for each sample's cumulative profits
for sample_data in all_samples_data:
    # Make sure to sort the sample_data by 'match_date'
    sample_data_sorted = sample_data.sort_values(by='match_date')
    fig.add_trace(go.Scatter(
        x=sample_data_sorted['match_date'],
        y=sample_data_sorted['profit'].cumsum(),
        mode='lines',
        line=dict(width=1, color='lightgrey'),
        showlegend=False
    ))

# Add a trace for the average cumulative profits per date
fig.add_trace(go.Scatter(
    x=daily_profit_sum['match_date'],
    y=daily_profit_sum['cumulative_profit'],
    mode='lines',
    name='Avg Cum. Profits',
    line=dict(width=3, color='blue')
))

# Adding ROI text
fig.add_trace(go.Scatter(
    x=[daily_profit_sum['match_date'].iloc[-1] + pd.DateOffset(days=4)],
    y=[daily_profit_sum['cumulative_profit'].iloc[-1]],
    text=[f"ROI: {roi:.2f}"],  # The ROI text
    mode="text",
    showlegend=False,
    textfont=dict(  # Adjust the font properties here
        size=14,
        color='black',
    )
))

# Update layout to add titles and make it more informative
fig.update_layout(
    title="Cumulative Profits over Time with Average",
    xaxis_title="Match Date",
    yaxis_title="Cumulative Profit",
    legend_title="Legend",
    template="plotly_white",
    xaxis=dict(
        type='date'  # Ensure that x-axis is treated as date
    )
)

# Show the figure
fig.show()

There is a lot of variability, but all lines are still ROI positive. Also, note we’re conservative in the backtest, as the model is static and we picked the betting odds at random.

However, as you saw in my first post, my ROI was actually negative and I did lose 1000 euros. What gives? First, let’s see my actual betting strategy and process:

Betting strategy and process

The right man in the wrong place can make all the difference in the world. -G Man.

Our actual betting strategy was exactly like our simulation with some additions:

I did actually use the half-Kelly criterion for the initial betting amounts (big mistake, see below)
The model was retrained weekly instead of being fixed (which didn’t help with the July underperformance)
A Slackbot would alert us to the bets we were supposed to make (match, team and value to bet)
Each bet was made manually in the betting website

We considered automating the betting process, but that risked breaking the terms and conditions of the betting websites, which could lead to a ban or freezing of funds, which would be catastrophic.

Having to make the bets manually was one of the worst parts of this experience¹³. Finding time for it was tricky, and multiple good bets were lost due to timing. This is probably one of the biggest reasons why the backtesting was optimistic: There is no edge if you cannot act on it at the right time!

Also, using the half-Kelly criterion made the initial bets way too aggressive, before we had any confidence in the whole system. That was probably the worst mistake of all. First, you need to validate the system end to end and prove your backtesting is representative of your actual betting process. To do so, one can paper trade for a while and then compare the paper bets with the backtesting bets. See the previous post for a longer discussion.

Why did I lose money after all?

There was definitely some bad luck, enhanced by systematic failures:

Model performance drop in July: As you can see in the accuracy chart (same applies to AUC), the model performance degraded unexpectedly in July, which is easy to see now but not obvious when we were placing the bets as we didn’t have that future information yet.
No entry strategy: Even Ed Thorp with his card counting skills started small and paper-traded for a while before going all in.
Half-Kelly criterion too early: Related to the above, I used the half-Kelly criterion at first instead of fixed bets, which made me overly aggressive in the beginning, right when the model was underperforming and before I had evidence the betting strategy actually worked.
Little risk management: Besides the backtesting curves, we didn’t check other risk metrics like risk-adjusted returns (e.g. Sharpe ratio) or the maximum drawdown of the strategy. Also, I didn’t consider the “unknown unkowns”.
No exit strategy: I decided to quit based on “vibes”, but quitting should also be planned, based on achieved financial goals, stop losses or when your edge becomes dull.
“ML is all you need” fallacy: ML is always just a part of a broader decision-making system and we focused too much on ML modelling (250 features!) and engineering and not enough on financial planning and decision-making.

Overall, emotions clouded my judgement, I was overconfident due to the awesome backtesting results and focused too much in ML and too little in finance.

If I had started small, paper-trading or making small fixed amounts, and validated the system end-to-end before increasing the bets, I almost certainly wouldn’t have lost so much money so fast. I might have even made some money, though I doubt it would have been worth my time¹⁴.

My friend, who persisted for a couple of years longer, did have a positive ROI after all, even with those losses, but much less than what our backtesting implied. This means the backtesting was actually optimistic rather than conservative, which is not obvious why. I will leave that question as a take home exercise.

In the end, here is the biggest takeaway: Making money directly with ML is hard, just find a ML job instead!

Acknowledgements

This post would not exist without the ideas of Ramon de Oliveira, who worked with me on the project, came up with many of the ideas here and did a lot of the implementation. Also, I thank Raphael Tamaki for his harsh but fair review.

Footnotes

Some say the best way to come up with startup ideas is to scratch your own itch, which you might find by attempting to solve some other problem. If you notice there is something missing from the market that you really need, the same might apply to other people, so you should solve that problem instead of the original one. In the words of Paul Graham: “That’s why so many successful startups make something the founders needed.” (Startups in 13 sentences)↩︎
While we built the features manually with some code automation, I’d suggest using Featuretools nowadays: It can automate most of the feature engineering process, aggregating features, combining different features, and making sure there is no temporal leakage.↩︎
See Beating the bookies with their own numbers - and how the online sports betting market is rigged which shows there is alpha by just averaging betting odds across different bookmakers.↩︎
For another post focused solely on Bayesian models and inference, check out How (not) to forecast an election.↩︎
For more on this distinction, I recommend reading Statistical Modeling: The Two Cultures by Leo Breiman.↩︎
Ties are quite rare as most matches are best of 3, where tying is impossible by definition. Tying can only happen in best of 1. Since we bet explicitly on team1 or team2 winning, we don’t have to worry about ties: team1 winning means either team2 wins or a tie happens and vice-versa.↩︎
Home advantage can be massive in IRL sports: “In the 2018–19 Premier League, the home team won 181 matches (47%), the away team won 128 matches (34%), and teams drew in 71 matches (19%)” (Wikipedia). Also, remember when South Korea placed 3rd in the World Cup!↩︎
If you intend to re-train your model periodically, say, weekly, you could evaluate that training strategy by simply using a time-series cross validation, where you train the model with data up to week X and predict on data of week X+1.↩︎
Logits are related to but they’re not probabilities: \(p = \frac{1}{1 + e^{-\text{logit}(p)}}\). Most probabilistic models like logistic regression are better understood in the logit space, but that is less interpretable for humans. Since the relationship is quite nonlinear, translating a logit impact to probability impact needs a point of reference (usually, the mean probability).↩︎
There is one heterodox view which suggests that yes, you should use all features possible if you solely care about predictive power and not about interpretability. First, more feature means more predictive power all else being equal, even if more overfitting takes place. Second, doing feature selection makes you more exposed to the risk of a few features degrading over time. Any ensemble model that is trained with a feature sample (a standard recommendation) will distribute the importance across correlated features, making you less susceptible to one of them failing. If all fail together, it’s no different than depending on just one of them. Caveat emptor: only attempt this if you know what you’re doing (also, your data scientist colleagues will not be happy!).↩︎
I strongly recommend not to use the correlation between features as a feature selection method. Features can be highly correlated and still provide class separation in some nonlinear settings. Also, no ensemble model will break with perfectly correlated features, unlike old school models like un-regularized linear regression.↩︎
While the AUC seems like an odd choice given our concern about calibration, which is something AUC doesn’t measure at all, if you can rank-order examples well you can always calibrate your model later with Isotonic regression.↩︎
Losing money, of course, was worse. However, even worse was getting addicted to watching CS:GO matches that I had bet on. That was time wasting, anxiety inducing, and made me feel like a gambler instead of a data scientist.↩︎
Even if I had made 63% returns on 1000 euros in a year, that would be a terrible time investment simply compared to my hourly wage, as betting was quite time consuming. Also note that we couldn’t bet a lot more without facing challenges imposed by the betting houses. Maybe this early loss was a blessing in disguise: Right after I quit this project, I changed jobs and ended up earning significantly more, which offset this loss in a month or so.↩︎