When doing the Udacity Machine Learning Engineer course some years ago I had to come up with a capstone project. Being in the market for a new car, I decided to see if I could predict the price of a second-hand car. This project followed a typical data science workflow:
Coming up with an idea This step if not always mentioned, but it is and extremely important step. If you don’t come up with an idea you cannot realize it!
Data gathering I choose gaspedaal.nl as site and scraped about 350.000 cars from it. Scraping itself can be legally sketchy, but since this was an academic project and I didn’t overload the server I figured it would be ok. Also Gaspedaal.nl is itself a site that scrapes several car marketplaces.
Data cleaning & preprocessing This step makes sure the data will be of use for a model. It involves removing certain outliers, processing of categorical variables. We all know the garbage in, garbage out principle.
Model selection The model should generate predictions, but which algorithm to choose. Trying out helps! Choosing some sensible options and pick the best performing one. When you execute a project in a corporate setting of course other deliberations are important such as maintainability, robustness and speed
Model optimizing Many models have a set of knobs to turn, but what position to put them in for the best results. Again: trying out helps. I’m still looking for a nice Design of Experiments package to do this, but in this project I used the RandomGridSearch, which basically tries and sees what works.
Model interpretation People often complain about machine learning being a black box, but when you’re not dealing with neural nets that statement is incorrect. Visualisation is the easiest step, but there are also packages that open up the black box, such as the shap package.
The model worked nicely and I was able to find a reasonably priced Toyota Prius with it. It has two weaknesses though: * The prices on the website are asking prices, not the final selling price. * If the model shows a car is ‘cheap’, it may be it has other problems (no maintenance? damage?). That’s why I did test the car at my own garage before making the purchase.