Density Estimation in Machine Learning: Here is a Primer for you

So, a density function tells us what will happen if we sample a random variable What values will appear more frequently and what values will appear rarely, or never So, what’s the deal with these functions and why do I need to estimate them?

Jose Jorge
Want to know a new, simple, useful, and unpopular machine learning technique?๐Ÿค“ This is a thread ๐Ÿงต on basic and practical Density Estimation ๐Ÿ‘‡
It tells us that there is a lot of data in the range (90, 110) but there are not so many examples below 60 We can assume that the data comes from some random variable… The histogram is an approximation of the density function of such variable
So, a density function tells us what will happen if we sample a random variable What values will appear more frequently and what values will appear rarely, or never So, what’s the deal with these functions and why do I need to estimate them?
Imagine you have two populations of organisms… And you want to tell what population a given organism belongs to just by looking at some characteristics like size, color, lifetime, etc Let’s do it simply and intuitively… Using density estimation ๐Ÿ‘‡
If we have samples of each population, we can approximate the density function of the data for each of those population Then, when classifying a new organism we find what density function is bigger when evaluated in the new data That’s it!
Ok, I know! How can I estimate the freaking function?! Well, you could make a histogram! That’s the simpler density estimation technique ever… But of course, it has several issues because a histogram is too sensitive to the bin range selection Then what?!
Well, there are other more sophisticated techniques I’m not going to get too deep in those but I’ll mention just one Kernel Density Estimation (KDE) I won’t tell you what it is about, but I’ll tell you how to use it instead!๐Ÿ‘‡
Python practitioners… Just use the sklearn.neighbors.KernelDensity class It receives two parameters kernel: the type of kernel to use. Gaussian by default bandwidth: you can think of it as the size of the bins in a histogram And how to make the next step?๐Ÿ‘‡
You can fit the model with a set of data, it should always be numerical Then you can use the sample_scores method to get the log of the probability of a new input There you go!
You have your new, sophisticated, density function estimation! But, beware! Density estimation is not exempt from common ML problems It also suffers from the curse of dimensionality Having too small bandwidth will result in overfitting Having a big bandwidth… Guess what
And that’s it! Hope you have enjoyed the thread and have learned/remembered something useful If so, consider following me cause I like to post content like this from time to time Stay tuned!

Sign Up for nextbigwhat newsletter

Delivered everyday 8 AM. Most comprehensive coverage of the tech ecosystem.

Download Pluggd.in, the short news app for busy professionals