Kaggle 1st place in 30 minutes
Lately, I’ve been taking on machine learning challenges on Kaggle. After years of managing painstaking data science consulting projects, I joined an AutoML startup wondering what to expect. All of my previous machine learning and deep learning projects were done directly in Python.
I decided to see how AutoML would rank in last summer’s legendary challenge, “Mercedes Benz Greener Manufacturing.”
Mercedes-Benz sells cars that can be customized in many different ways. The company told Kaggle competitors that, when testing these cars for safety and reliability, they usually work with at most a couple of hundred cars every day. Mercedes anonymized its elaborate list of custom features, such as four-wheel-drive and head-up display, for the competition. The goal was to predict the length of time it takes for each configuration of car to pass testing. With that knowledge, the company could save time, cutting costs and pollutant emissions.
Altogether 3,835 teams competed over a small data set bearing the curse of dimensionality, meaning as the number of features (“dimensions”) increases, the amount of data needed to accurately learn and predict from this data set grows exponentially. The training data of 4,209 rows had a whopping 377 features. Mercedes Benz tests a lot of different car configurations! which comprised this multitude of features.
The hour was 11:15 am. I downloaded the CSV files of data at the above link, training data to make predictions on a testing data set. I uploaded the data sets to Firefly Lab and let it run.
My first attempt was setting the lab to train at 200 models. To me, it felt like a “cold start” to see how everything worked. My first effort didn’t even score on the top page of results, the top 50 entries to the Kaggle leaderboard. I was not impressed.
So I decided to give it one more try. I changed the number of models to train to 1000.
The leading score that won the competition was 0.55550. My submission scored 0.55792. My point spread outdistancing the winning entry was by the amount that the winner exceeded the 55th place entry. The time now was 11:47 am. I was astounded.
I had started with eight types of algorithms, including neural networks. I looked at the report on metrics that’s automatically generated to find that Firefly had settled on four different types of complementary algorithms to form the ensemble: XGBoost, Ridge Regression, Extra Trees and Random Forest.
The report also shed light on how critical feature selection and engineering had been in this case. I was floored that the data cleaning and preparation that would have taken days and sometimes weeks on my own was handled within minutes.
I didn’t have a lot of processing power at my fingertips when I approached this challenge. I relied on a standard T470S laptop without GPU. Altogether, the preprocessing, fine tuning hyper parameters and model building took me just half an hour…and I didn’t have to do any of those tasks.
This is the start of my Kaggle journey. Stay tuned for the next chapter.…
Leave a Reply