Statistics, Machine Learning, Causal Inference, …

People coming from these different modeling cultures often have misunderstandings, and fight over whether logistic regression is “AI” or statistics.

People coming from these different modeling cultures often have misunderstandings, and fight over whether logistic regression is “AI” or statistics.

Why the big differences?

1/n 🧵

2/n The tools and methods have quite a big overlap. Linear regression models are used in stats, ml and causality. ML draws heavily from statistical models. So, misunderstandings and differences can’t be solely explained by the tools that are used.

But what is it then?

3/n The difference is subtle, but fundamental.

Each modeling culture has its own definition of what constitutes a good model.

Let’s have a look at these different “generalization mindsets”.

Each modeling culture has its own definition of what constitutes a good model.

Let’s have a look at these different “generalization mindsets”.

5/n Model fitting is usually done by maximizing the likelihood of the probability model, and evaluation is also based on the likelihood (AIC, BIC, Bayes factor, likelihood ratio, …). A lot of thought about the data-generating process goes into the model of a good statistician.

6/n A good statistical model is one that reflects the data-generating process well (qualitatively + quantiatively) and has a good fit based on the distributional assumptions (quantitatively).

If the model captures the data-generating process, it generalizes well.

If the model captures the data-generating process, it generalizes well.

8/n ML therefore has a very simple generalization mindset: A model is a good model if it has good test performance on test data. It’s easily quantifiable and therefore has a certain automatability.

9/n In causal modeling, a good model is one that captures causal structures well. If you got the assumptions about causal structures right, the model should be robust, explain effects well and generalize to similar situations.

10/n All generalization mindsets have their flaws. Performance on test data can produce unrobust models that are vulnerable to adversarial attacks and use non-causal association (clever hans predictors).

11/n Evaluating statistical models in-sample and against the distributional assumptions is possibly biased and relies on many assumptions of the modeler. Similarly, causal inference makes simplfying assumptions about causal structures, whereas the world might be more complex.

12/n I was educated with the statistical modeling mindset. When I failed with that approach in my first kaggle competition, I “converted” to the performance-on-test-data mindset. Later I learned a bit about causal inference.

Today I believe the best approach is to be adaptive.

Today I believe the best approach is to be adaptive.

13/n Once you accept that there is not one mindset that is better than all the others, you become a better modeler.

You can think about the data-generating process, encode causal assumptions and measure test performance.

You can think about the data-generating process, encode causal assumptions and measure test performance.

14/n You can decide against a better performing model based on considerations of a probability model. But why not include a benchmark on test data, which sets the model in context.

You should also learn about causal inference, know which variables to include and which to exclude.

You should also learn about causal inference, know which variables to include and which to exclude.

15/n If you unquestioningly follow one generalization mindset, you are seriously missing out. I know that many people have a mix of views, but I also know many that are very stubborn in how data modeling should work.

Don’t let your education hold back your modeling skills.

Don’t let your education hold back your modeling skills.