Help with training the Linear Regression Model

So I’m currently building a Multiple Linear Regression model which is trained on a dataset scraped off of a Used Car Marketplace website.

There are some duplicate entries, some that have errors in terms of price (for example some cars which would normally cost somewhere in the range of 3-5k, in the dataset cost somewhere between 200k and 900k) and also there are some errors in the age of the vehicles (some entries are older than 120yrs). I decided to filter out all entries that don’t make sense from the train dataset. When I fit that model on the test dataset, I get huge a RMSE of around 170k (base RMSE without altering anything is around 165k), but when I apply the same filtering to the test dataset too, the RMSE drops to 7.5k which is a huge improvement.

So my questions are: – Should I filter the test dataset using the same exact filtering rules as the train dataset? – Does it compromise the models predictions because I’m altering the test dataset?

submitted by /u/Global-Fly-8517 to r/learnmachinelearning
[link] [comments]


Commentaires

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *