The Preprocessing Survival Guide for New Data Scientists / ML Engineers

Mike Anderson — Sun, 22 Jun 2025 22:23:21 GMT

Over the years, I’ve seen many new grads and junior data scientists and machine learning engineers make some common mistakes that are easily avoided…if you know to look for them.

Most people are trained on toy datasets that are perfectly arranged, cleaned of all nulls, have identical string formatting, and play super nice for teaching purposes. The reality is that most data is messy, dirty, pulled from anywhere and everywhere, requires a lot of cleanup, and would make the most advanced model available predict hot garbage.

That being said, a lot of new data scientists (through no fault of their own) just aren’t prepped for the brutal reality of production-grade data. Sometimes the tools they’ve learned do apply — just not in the way they’ve seen them taught.

For this first lesson, I’m focusing on preprocessing. It all ties back to one central theme:

KNOW. YOUR. DATA.

1. Filling Nulls Without Context

You always hear about null imputation, but very few people stop to think about what exactly that means. I mean, the concept is fairly straightforward, right? You fill nulls with actual values to get rid of the nulls. Pretty simple.

However, I see a ton of junior (and some senior) data scientists and machine learning engineers assume that all nulls are made equal. They’re not. You have to look into what you’re actually doing to the data and what the nulls mean.

For instance, if you pull in data for sales by product by month and you have some nulls in there, your first instinct may be “oh, I should impute this with the mean”. NO. Your next thought may be “I should impute it with the mean per month for that product”. ALSO NO.

You should instead think about what the nulls mean in the context. In this instance, it means that that specific product for that specific month had NO SALES. Which means, those nulls are actually indicating a 0, not a null. The nulls arise because the database you’re querying or dataset you’re using doesn’t have any data for that specific product/month combination and will return a null, but in reality that null indicates a true, actual 0, which is far more important and accurate than an imputed mean. Therefore, you should impute nulls with 0 in this case, not with the mean or median.

Along the same lines, people often assume that you need to impute with the mean. This isn’t always the case. If your data distribution is super-skewed or has really long tails, the mean isn’t a good indicator of the actual average and it’s far better to impute with the median.

My favorite example is house prices. Say you want the average house price in a certain area and in that area are 10 houses that are each $500,000. Your mean is $500,000 and your median is $500,000. Now add one, singular outlier for a guy that builds a $10 million house in the same area. Because of that one house, your mean now jumps to a whomping $1,363,636, almost 3 times the average. Meanwhile, your median doesn’t flinch — it’s still $500,000.

This is a simple example, but blow that up to having 750 houses and around 50 outliers, and you’ll see that the median is definitely the way to go.

While I’m on the topic, I’ll also go into imputation by category. You don’t want to fill in the mean/median/mode/etc. for the entire dataset. If you have specific categories to split on, fill in nulls using the averages for those categories. The goal of null imputation is to intuitively guess which values should be in the fields that are missing, and filling in categorically will get you closer to what should really be there. A more robust method of null imputation is K-Means imputation, but I’ll save that for another post.

Bottom line: There’s no one-size-fits-all for nulls. Context is king.

2. Dropping Duplicates

Everyone probably knows this little block of code:

df.drop_duplicates()

However, again, we need to think about what it’s actually doing. It’s looking at every line of the dataframe and comparing each entry to see if any two lines are the same. Pretty simple.

Now, what if you have 2 lines for the same product, same category, same price, etc., but you have one singular space at the end of the product description for one of them? Just like that, drop_duplicates acts like it doesn’t even see them — because technically, it doesn’t.

Instead, intuitively think about how you’re dropping duplicates. Try dropping duplicates by id or by several columns like so:

df.drop_duplicates(subset=['product_id'])
df.drop_duplicates(subset=['product_id', 'product_category'])

This will save you a lot of headache, especially for text-heavy dataframes like product descriptions or social media posts.

3. Forgetting to Scale or Normalize

Please, please, please remember to scale/normalize your data. Some models don’t care if your data’s all over the place (looking at you, tree-based models). But others? They’ll have a full-blown meltdown if your features aren’t scaled properly.

For tree-based models (Random Forest, XGBoost, LightGBM, etc.), they determine predictions based on node splits and thresholds, which work fine with unscaled data. Under the hood, a house price prediction tree-based model with features like square feet and number of stories will go:

“Is the square feet above or below 2500? Great, now let’s check if the number of stories is more than 2.”

However, with models like KNN, Neural Networks, Logistic Regression, SVM, etc., it’s actually computing distances between points, and that can get really, really messy and inaccurate if you have some features WAY bigger or smaller than others. To correct for this, you use scaling with functions like StandardScaler or MinMaxScaler. Essentially, it puts all of your data into a uniform, confined space.

Now, when do you use MinMaxScaler vs. StandardScaler? Here’s a quick breakdown:

MinMaxScaler

The data is already bounded between certain numbers — think percentages which are bounded between 0 and 100 or 0 and 1.
You’re using distance based models (KNN, K-Means, Neural Networks, etc.)
You want to preserve the shape of the data distribution

StandardScaler

Data is normally distributed (or at least centered around a mean)
You’re using models that assume normality like Logistic Regression, Linear Regression, SVM, etc.
You want outliers to remain visible and not shoved into a predetermined range

4. One-Hot Encoding Explosion

If you had some data for demographics all over the world and had a categorical column for “Country”, how many potential columns would you have after one-hot encoding that singular column?

That’s right: 195

Now, multiply that by every categorical column you have in your data and you’ll soon end up with hundreds, if not thousands, of columns of nothing but 1s and 0s. Not only does it make your model run slower, but model explainability would be an absolute nightmare.

Instead, try feature engineering. Get rid of the “Country” column and group things by statistical similarity using something like K-Means clustering or try bucketing the countries into “Europe”, “Asia”, etc. or into population buckets. Do whatever you can to preserve accuracy while keeping the number of columns under control.

Sometimes, the one-hot encoding explosion is unavoidable, but for those instances you would use a feature decomposition algorithm like PCA after the fact, which I’ll go into in another post (and no, PCA isn’t just for reducing dimensions — it can also help your sanity).

5. Removing Outliers Before Analysis and Imputation

Say you have a set of 10 balls, 6 are red, 2 are yellow, and 2 are green with yellow balls being super heavy. Now, say you remove outliers based on weight and all of a sudden your yellow balls are out of the picture. What would your data look like?

First, your model would assume that the only categories that balls could ever be are red and green. Next, it would assume that the average weight is somewhere between the weight of the red balls and green balls. Your model would be based entirely upon these assumptions and wouldn’t reflect the reality in the slightest.

When you’re removing outliers, you need to carefully examine them to determine if they’re truly outliers or indicators of a deeper meaning. In this instance, the higher weights indicate a specific kind of ball to the point that if a model sees that weight, it’ll automatically go “AHA! YELLOW!” You don’t want to remove that, as it’s incredibly useful information.

This goes back to my main mantra: Know your data.

Conclusion

Preprocessing might not be the most awesome or hi-tech part of data science, but it’s where most of the modeling magic (or train wrecks) begin. If you skip over it or do it blindly, you’re basically building a castle on quicksand.

So, next time you’re wrangling data, slow down. Ask yourself what the data means, not just what shape it’s in. There’s a reason seasoned data scientists spend 90% of their time on preprocessing and data collection and maybe 5% on actual model development.