Completed2025Full causal analysis — modeling, estimation, interpretation
Causal Inference — Smoking and Sleep Quality
Causal effect estimation of smoking on sleep quality across 413,768 observations using Double Machine Learning.

Project overview
Rigorous estimation of the causal effect of current smoking on the probability of poor sleep quality from observational data. The project applies three complementary methods (naive difference, OLS, DML) and concludes with an estimated causal effect of approximately 13%, robust and convergent.
Current state
- Full analysis completed on 413,768 observations.
- Three methods applied and compared: naive, OLS, Double ML.
- Causal effect estimated at ~13% — convergent and robust.
- Final report produced with rigorous interpretation.
Tech stack
PythonScikit-learnStatsmodelsPandasJupyter
Tags & Code
Causal InferenceMachine LearningPythonStatistics
Private code (academic project)
Vision
- Estimate a real causal effect — not a mere association — between smoking and sleep.
- Apply rigorous causal methodology to large-scale observational data.
- Demonstrate the value of Double Machine Learning over classical approaches.
Architecture
- Data: 413,768 observations, 16 variables — cleaning and filtering (current smokers vs non-smokers).
- Causal modeling: DAG, backdoor criterion, confounders (age, alcohol consumption).
- Method 1 — Naive difference: ATT = 0.099 (biased, no adjustment).
- Method 2 — OLS with HC3 robust standard errors: θ = 0.132 (linear adjustment).
- Method 3 — Double ML (Random Forest, 5-fold cross-fitting): θ = 0.134 (robust, non-linear).
- Sensitivity analysis: confounder removal to validate DAG robustness.
Roadmap
- Phase 1: exploration, cleaning, and construction of treatment and outcome variables.
- Phase 2: causal modeling with DAG and backdoor path identification.
- Phase 3: naive estimation, adjusted OLS, and Double Machine Learning.
- Phase 4: sensitivity analysis, method comparison, and report writing.
Engineering decisions
- Double ML to relax linearity assumption and obtain solid theoretical guarantees.
- Random Forest in DML to model non-linear relationships between confounders and variables.
- 5-fold cross-fitting to avoid overfitting bias in nuisance estimation.
- Systematic sensitivity analysis to validate the robustness of conclusions.
Possible improvements
- Test additional causal methods (DiD, instrumental variables).
- Enrich the analysis with additional confounders (stress, physical activity).
- Study causal effect heterogeneity across subgroups.
Lessons learned
- OLS/DML convergence is a strong signal of result robustness.
- An explicit DAG forces you to formalize and justify every causal assumption.
- Double Machine Learning is methodologically superior even when relationships are close to linear.