Winning Kaggle Competition is a Piece of Cake!

Introduction

This post is about how I won the KaggleX Skill Assessment Machine Learning Competition. Disclaimer: I was just a beginner in machine learning and Kaggle competitions when I started. This competition ran for about 20 days, and I improved day by day and climbed up the leaderboard. Initially, I was ranked around 368-ish on the private leaderboard, but when the competition ended, the tables turned and I was Rank 1.

Talking about my background, I had just completed my freshman year at the National Institute of Technology Karnataka (NITK) and was on my summer break from April 15th till mid-July, so I had plenty of time. My interest in machine learning started in my first semester when I went to a flagship IEEE event at NITK called IEEE Eureka. I didn’t even know what machine learning was back then, but I was fascinated. Also, it was kind of trending at that time around 2023, ChatGPT and LLMs were on a roll and AI was everywhere so I thought, why not give it a try?

That IEEE bootcamp boosted my interest in machine learning. From there, I started learning through Andrew Ng’s course on Coursera (btw it’s one of the goated courses for ML and DL). I also watched YouTube channels like:

StatQuest by Josh Starmer
3Blue1Brown for calculus and math intuition

Also, Math I and Math II in college helped a lot in understanding gradients, linear algebra, series, and calculus.

By the start of my second year, I had completed Andrew Ng’s ML course 1 and tried out models like linear regression, logistic regression, and basic data preprocessing. Then I started watching coding videos related to ML and came across Kaggle. Initially, it felt too complicated, but later I realized it’s just like LeetCode is for DSA, Kaggle is for AI and data science. A lot of Kaggle competitions are hosted by big tech companies.

I tried some starter competitions, but my performance was not that great:

Multi-Class Prediction of Obesity Risk → 1064 / 3587 (Top ~29.66%)
Binary Classification with a Bank Churn Dataset → 1188 / 3632 (Top ~32.71%)
Titanic: Machine Learning from Disaster → 967 / 2938 (Top ~32.91%)
Rohlik Orders Forecasting Challenge → 783 / 1017 (Top ~77.00%)

Then came the KaggleX Skill Assessment Machine Learning Competition, which was a 20 day long competition. I was really excited because it felt like a great opportunity to test my skills and learn from others. This competition was also an entry point for the Kaggle Cohort 2024 mentorship for building open-source stuff, so I was really hyped.

Even though my previous performance wasn’t great, I was determined to give it my best shot and learn as much as possible from the experience.

KaggleX Skill Assessment Challenge

The KaggleX Skill Assessment Challenge was part of the KaggleX Fellowship application process. Along with the application form, participants had to complete this competition to demonstrate their practical machine learning and data science skills.

The core task of the competition was simple:

Predict the price of a used vehicle based on given features like brand, mileage, engine details, transmission, etc.

This was a regression problem, and submissions were evaluated using Root Mean Squared Error (RMSE), meaning lower scores were better.

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

Where:

$n$ = number of samples
$y_i$ = actual value
$\hat{y}_i$ = predicted value

For me, it turned out to be a really good learning experience and eventually, a winning one

Here are my final scores and results from the competition:

GitHub Repository (Full Code & Solution)

Dataset Description

The training dataset contained:

~52,893 rows (after dropping missing values)
13 columns (features + target)

Features

The dataset included the following features:

id → unique identifier
brand → car brand (BMW, Audi, Ford, etc.)
model → specific car model
model_year → year of manufacture
milage → total distance driven
fuel_type → type of fuel (Gasoline, Hybrid, etc.)
engine → engine description (contained horsepower and capacity info in text form)
transmission → type of transmission
ext_col → exterior color
int_col → interior color
accident → accident history
clean_title → whether the title is clean or not
price → target variable (car price)

Key Observation

One important thing about the dataset was:

The engine column was not structured
It contained mixed information like:
- Horsepower (HP)
- Engine capacity (L)
- Engine type

This became a very important feature engineering opportunity later.

Files Provided

The competition provided three main files:

train.csv → training data with target (price)
test.csv → test data (no price, needed for submission)
sample_submission.csv → format for submission

Feature Engineering

In this part, I mainly focused on making the raw data more useful for the model. Some columns already looked good, but the engine column had a lot of useful information hidden inside text, so I extracted that.

Extracting Horsepower and Engine Capacity

The engine column looked something like:

"375HP 3.5L V6 Cylinder Engine"
"300HP 3.0L Straight 6 Cylinder"

So I extracted:

Horsepower → from XXXHP
Engine capacity → from XXXL

This helped convert text into numerical features.

def get_horse_power(x, df):
    hp = []
    for i in x:
        if i.split()[0].endswith('HP'):
            hp.append(float(i.split()[0].strip('HP')))
        else:
            hp.append(np.nan)
    df['horse power'] = hp
    return df

train_df = get_horse_power(train_df['engine'], train_df)
test_df = get_horse_power(test_df['engine'], test_df)

def get_capacity(x, df):
    capacity = []
    x_new = [i.split() for i in x]

    for i in x_new:
        if len(i) >= 2 and i[1].endswith('L'):
            capacity.append(float(i[1].strip('L')))
        else:
            capacity.append(np.nan)

    df['capacity'] = capacity
    return df

train_df = get_capacity(train_df['engine'], train_df)
test_df = get_capacity(test_df['engine'], test_df)

Preprocessing Steps

After extracting features, some values were missing. I filled them using the mean.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputer.fit(train_inputs[numeric_cols])

train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

Since models can’t directly use text, I converted categorical columns into numbers using one-hot encoding.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoder.fit(train_inputs[categorical_cols])

encoded_cols = list(encoder.get_feature_names_out(categorical_cols))

train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

For numerical columns, I also scaled values between 0 and 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(train_inputs[numeric_cols])

train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

What Else Could Be Done (Advanced Ideas)

I didn’t go very deep here, but some more things that could be done:

Extract number of cylinders from engine (V6, V8, etc.)
Convert model_year → car age
Combine features (like horsepower per liter)
Frequency encoding for brand
Use embeddings for categorical features
Use text embeddings on model or engine

Feature Selection

After creating new features, I didn’t use everything directly. I selected only the features that I felt were actually useful for predicting price.

I mainly focused on:

Keeping meaningful features
Removing unnecessary columns
Avoiding noise

What Features I Used

I selected the following input features:

brand
model_year
milage
fuel_type
horse power
capacity
transmission

Target variable:

price

Code for Selecting Features

input_cols = ['brand', 'model_year', 'milage', 'fuel_type', 
              'horse power', 'capacity', 'transmission']

target_col = 'price'

train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()
test_inputs = test_df[input_cols].copy()

Using Correlation & Covariance Matrix

I used a correlation matrix to understand how numerical features are related to the target (price).

Why I used this:

To see which features are strongly related to price
To identify useless features
To check if some features are redundant

Code for Correlation Matrix

import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = train_df[numeric_cols + ['price']].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

Some features had strong correlation with price
Some had very low correlation → not very useful
Helped me understand which features matter more

I also checked covariance to understand how features vary together.

Why I used this:

To see how features change together
To detect relationships between variables
To get a rough idea of feature influence

Covariance is similar to correlation but not normalized. It can show the direction of the relationship (positive or negative) and the strength in terms of actual values.

cov_matrix = train_df[numeric_cols + ['price']].cov()
print(cov_matrix)

My Approach

I didn’t overcomplicate feature selection. I mainly:

Used domain intuition (what affects car price)
Checked correlation
Removed obvious unnecessary features

Experimenting with Models

After preprocessing and selecting features, I started trying different machine learning models to see what works best.

I didn’t assume any model would work. I just tested multiple models and compared their performance.

I created a dictionary of models along with their parameters and tried all of them.

This helped me:

Quickly compare models
Avoid writing repetitive code
See which models perform better

Models

Linear Regression
SGD Regressor
Decision Tree
Random Forest
Gradient Boosting
XGBoost
LightGBM
MLP Regressor (Neural Network)
CatBoost

Code for Trying Multiple Models

from sklearn.base import RegressorMixin

models: dict[RegressorMixin, dict[str, float]] = {

    LinearRegression: {
        'n_jobs': -1
    },

    SGDRegressor: {
        'eta0': 0.001,
        'max_iter': 2000,
        'penalty': 'l1',
        'learning_rate': 'adaptive'
    },

    DecisionTreeRegressor: {
        'max_depth': 32,
        'max_features': 'sqrt',
        'max_leaf_nodes': 100,
        'min_samples_leaf': 4
    },

    RandomForestRegressor: {
        'n_estimators': 400,
        'max_depth': 32,
        'min_samples_split': 10,
        'min_samples_leaf': 4,
        'n_jobs': -1
    },

    GradientBoostingRegressor: {
        'learning_rate': 0.06,
        'n_estimators': 900,
    },

    XGBRegressor: {
        'n_estimators': 400,
        'max_depth': 10,
        'learning_rate': 0.01,
        'subsample': 0.5,
    },

    LGBMRegressor: {
        'n_estimators': 500,
        'num_leaves': 32,
        'learning_rate': 0.01,
        'subsample': 0.7,
    },

    MLPRegressor: {
        'activation': 'identity',
        'solver': 'adam',
        'batch_size': 32,
        'max_iter': 400,
        'random_state': 42,
        'beta_1': 0.5,
    },

    CatBoostRegressor: {
        'n_estimators': 500,
        'od_type': 'Iter',
        'bootstrap_type': 'Bernoulli',
    },
}

Instead of manually checking results, I used my evaluation pipeline:

Trained each model
Calculated RMSE, MAE, and R²
Compared validation scores
Linear models didn’t perform well
Decision Trees overfit easily
Random Forest was a good baseline
Boosting models performed the best

The best ones were:

CatBoost
LightGBM
XGBoost

Why Boosting Models Worked Better

From what I saw:

Data had non-linear relationships
Simple models couldn’t capture patterns
Boosting models handled it much better

Model Evaluation Strategy

After defining multiple models, I needed a proper way to evaluate and compare them. Instead of manually training each model and checking results, I created a small evaluation pipeline.

I wanted to:

Train models easily
Compare them fairly
Check both train and validation performance

I wrote a function that:

Takes a model and its parameters
Trains it on training data
Evaluates it on validation data
Returns multiple metrics

Code for Evaluating a Single Model

def evaluate_model(model : type[RegressorMixin],
                   params : dict,
                   X_train : np.ndarray,
                   train_targets : np.ndarray,
                   X_val: np.ndarray, 
                   val_targets :np.ndarray) -> tuple[float,float,float,float,float,float]:
    
    regressor = model(**params).fit(X_train, train_targets)
    train_preds = regressor.predict(X_train)
    val_preds = regressor.predict(X_val)

    train_mae = mean_absolute_error(train_targets, train_preds)
    val_mae = mean_absolute_error(val_targets, val_preds)

    train_rmse = mean_squared_error(train_targets, train_preds, squared=False)
    val_rmse = mean_squared_error(val_targets, val_preds, squared=False)

    train_r2 = r2_score(train_targets, train_preds)
    val_r2 = r2_score(val_targets, val_preds)

    return (train_mae, val_mae, train_rmse, val_rmse, train_r2, val_r2)

Trying Multiple Models Together

Then I wrote another function to test all models at once.

This helped me:

Save time
Compare all models in one place
Avoid repetitive code

Code for Trying All Models

def try_models(model_dict: dict[RegressorMixin,dict[str,float]], 
               X_train: np.array, 
               train_targets: np.array, 
               X_val: np.array, 
               val_targets: np.array) -> pd.DataFrame:
    
    results = Parallel(n_jobs=-1)(
        delayed(evaluate_model)(model, params, X_train, train_targets, X_val, val_targets) 
        for model, params in model_dict.items()
    )

    metrics = ['train_mae', 'val_mae', 'train_rmse', 'val_rmse', 'train_r2', 'val_r2']
    results_dict = {metric: [result[i] for result in results] for i, metric in enumerate(metrics)}

    df = pd.DataFrame({
        'models': list(model_dict.keys()),
        'params': list(model_dict.values()),
        **results_dict
    })

    return df

I could quickly see which model performs better
I focused mainly on validation RMSE
I also checked train vs validation gap to detect overfitting

Baseline Model (CatBoost)

After trying multiple models, I saw that boosting models were performing better. So I decided to use CatBoost as my baseline model.

From the results, I picked the best parameters for CatBoost:

best_cbr_params = {
    'n_estimators': 500,
    'od_type': 'Iter',
    'bootstrap_type': 'Bernoulli',
    'verbose': 0
}

Training the Baseline Model

I trained CatBoost using these parameters and evaluated it on validation data.

Train RMSE → ~51,454
Validation RMSE → ~77,474
Train R² → negative
Validation R² → negative
The model was performing very poorly
Negative R² means it was worse than predicting the mean
High RMSE showed predictions were far off

Hyperparameter Tuning (Optuna)

After seeing that my baseline CatBoost model was performing badly, I decided to properly tune the model instead of guessing parameters.

I used Optuna for this.

I defined a search space for CatBoost and let Optuna try different combinations to minimize RMSE.

I focused on:

Learning rate
Number of trees
Regularization
Sampling parameters

Code for Optuna Tuning

import optuna

def objective(trial):

    params = {
        "n_estimators": trial.suggest_int("n_estimators", 800, 1000),
        "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.006),
        "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1, 2),
        "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.6, 0.8),
        "subsample": trial.suggest_float("subsample", 0.8, 1),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 150, 200),
        "max_bin": trial.suggest_int("max_bin", 120, 180),
        "od_wait": trial.suggest_int("od_wait", 50, 80, step=10),
        "od_type": 'Iter',
        "bootstrap_type": 'Bernoulli',
        "random_seed": 42,
        "verbose": 0,
    }

    model = CatBoostRegressor(**params)
    model.fit(X_train, train_targets, eval_set=[(X_val, val_targets)])

    preds = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(val_targets, preds))

    return rmse

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=150)

Final Best Parameters

After running 150 trials, these were the best parameters:

n_estimators → ~900+
learning_rate → ~0.005–0.006
l2_leaf_reg → ~1–2
colsample_bylevel → ~0.6–0.8
subsample → ~0.8–1
min_data_in_leaf → ~150–200
max_bin → ~120–180
od_wait → ~50–80

Key Learnings & Takeaways

This competition taught me a lot, especially as a beginner.

What I Learned

Start simple and improve step by step
Feature engineering is very important
Good preprocessing makes a big difference
Don’t rely on just one model
Boosting models work really well for tabular data

What Helped Me Improve

Trying multiple models
Proper evaluation (train vs validation)
Using Optuna for tuning
Learning from Kaggle discussions and notebooks

Final Takeaway

I didn’t do anything extremely complex.

I just:

Kept experimenting
Learned from mistakes
Improved step by step

References

Here are some resources that helped me during the competition:

Andrew Ng’s Machine Learning Course (Coursera)
StatQuest by Josh Starmer (YouTube)
3Blue1Brown (Math intuition)
Scikit-learn Documentation (Preprocessing, Models, Metrics)
Optuna Documentation (Hyperparameter tuning)
CatBoost Documentation
XGBoost Documentation
LightGBM Documentation
Kaggle Learn (Courses & Tutorials)
Kaggle Discussion Forums (very useful for ideas and approaches)
GitHub Repository (Full Code & Solution)