So, a density function tells us what will happen if we sample a random variable What values will appear more frequently and what values will appear rarely, or never So, what’s the deal with these functions and why do I need to estimate them?

Jose Jorge

Want to know a new, simple, useful, and unpopular machine learning technique?ðŸ¤“
This is a thread ðŸ§µ on basic and practical Density Estimation ðŸ‘‡

It tells us that there is a lot of data in the range (90, 110) but there are not so many examples below 60
We can assume that the data comes from some random variable… The histogram is an approximation of the density function of such variable

So, a density function tells us what will happen if we sample a random variable
What values will appear more frequently and what values will appear rarely, or never
So, what’s the deal with these functions and why do I need to estimate them?

Imagine you have two populations of organisms…
And you want to tell what population a given organism belongs to just by looking at some characteristics like size, color, lifetime, etc
Let’s do it simply and intuitively… Using density estimation ðŸ‘‡

If we have samples of each population, we can approximate the density function of the data for each of those population
Then, when classifying a new organism we find what density function is bigger when evaluated in the new data
That’s it!

Ok, I know! How can I estimate the freaking function?!
Well, you could make a histogram! That’s the simpler density estimation technique ever…
But of course, it has several issues because a histogram is too sensitive to the bin range selection
Then what?!

Well, there are other more sophisticated techniques
I’m not going to get too deep in those but I’ll mention just one
Kernel Density Estimation (KDE)
I won’t tell you what it is about, but I’ll tell you how to use it instead!ðŸ‘‡

Python practitioners…
Just use the sklearn.neighbors.KernelDensity class
It receives two parameters
kernel: the type of kernel to use.
Gaussian by default
bandwidth: you can think of it as the size of the bins in a histogram
And how to make the next step?ðŸ‘‡

You can fit the model with a set of data, it should always be numerical
Then you can use the sample_scores method to get the log of the probability of a new input
There you go!

You have your new, sophisticated, density function estimation!
But, beware!
Density estimation is not exempt from common ML problems
It also suffers from the curse of dimensionality
Having too small bandwidth will result in overfitting
Having a big bandwidth… Guess what

And that’s it!
Hope you have enjoyed the thread and have learned/remembered something useful
If so, consider following me cause I like to post content like this from time to time
Stay tuned!