Causal Inference — Smoking and Sleep Quality

Causal effect estimation of smoking on sleep quality across 413,768 observations using Double Machine Learning.

Project overview

Rigorous estimation of the causal effect of current smoking on the probability of poor sleep quality from observational data. The project applies three complementary methods (naive difference, OLS, DML) and concludes with an estimated causal effect of approximately 13%, robust and convergent.

Current state

Full analysis completed on 413,768 observations.
Three methods applied and compared: naive, OLS, Double ML.
Causal effect estimated at ~13% — convergent and robust.
Final report produced with rigorous interpretation.

Tech stack

PythonScikit-learnStatsmodelsPandasJupyter

Tags & Code

Causal InferenceMachine LearningPythonStatistics

Private code (academic project)

Vision

Estimate a real causal effect — not a mere association — between smoking and sleep.
Apply rigorous causal methodology to large-scale observational data.
Demonstrate the value of Double Machine Learning over classical approaches.

Architecture

Data: 413,768 observations, 16 variables — cleaning and filtering (current smokers vs non-smokers).
Causal modeling: DAG, backdoor criterion, confounders (age, alcohol consumption).
Method 1 — Naive difference: ATT = 0.099 (biased, no adjustment).
Method 2 — OLS with HC3 robust standard errors: θ = 0.132 (linear adjustment).
Method 3 — Double ML (Random Forest, 5-fold cross-fitting): θ = 0.134 (robust, non-linear).
Sensitivity analysis: confounder removal to validate DAG robustness.

Roadmap

Phase 1: exploration, cleaning, and construction of treatment and outcome variables.
Phase 2: causal modeling with DAG and backdoor path identification.
Phase 3: naive estimation, adjusted OLS, and Double Machine Learning.
Phase 4: sensitivity analysis, method comparison, and report writing.

Engineering decisions

Double ML to relax linearity assumption and obtain solid theoretical guarantees.
Random Forest in DML to model non-linear relationships between confounders and variables.
5-fold cross-fitting to avoid overfitting bias in nuisance estimation.
Systematic sensitivity analysis to validate the robustness of conclusions.

Possible improvements

Test additional causal methods (DiD, instrumental variables).
Enrich the analysis with additional confounders (stress, physical activity).
Study causal effect heterogeneity across subgroups.

Lessons learned

OLS/DML convergence is a strong signal of result robustness.
An explicit DAG forces you to formalize and justify every causal assumption.
Double Machine Learning is methodologically superior even when relationships are close to linear.

← All projects Contact me