Elements of Causal Inference: Reading Notes 1/N

1.1 Probability Theory and Statistics

AKA “Machine learning”

Problem forumation: Given observations of random variables, infer the underlying causal structure.

Observations:

  • inputs = cases = covariates
  • outputs = targets = labels

Critical assumption on observations

  • i.i.d. = independent and identically distributed

i.i.d. can be violated if

  • distribution on random variable changes
  • OR intervention happened

We can answer these questions:

1: Regression:

\[f(x) = \mathbb{E}[Y \mid X = x]\]

2: Classification:

\[f(x) = argmax _{y \in Y} P(Y = y \mid X = x), \text{where } Y = {classes}\]

3: Density of $p_{X, Y} $ of $P_{X, Y} $

1.2 Learning Theory

Statistical learning is ill-posed:

  • Extending $ f(x) $ from seen inputs to unseen inputs is a “leap of faith”

Solution:

  • Must make assumptions to estimate $ f(x) $ from seen inputs to unseen inputs
  • These assumptions are formalized in terms of capacity measure
  • In practice, these assumptions translate to chose function class, which incorporate a priori knowledge
  • In Bayesian approaches, we specify prior distribution over function classes and noise model
  • In regularization theory, we construct suitable regularizers to use in optimization problem.

Causal learning is harder than statistical learning because an additional ill-posedness:

  • “Even complete knowledge of an observational distribution usually does not determine the underlying causal model”
  • Namely the observed distribution does not necessarily correspond to a known causal structure.

Statistical learning example

  • Problem: binary pattern recognition
  • Goal: minimize the expected error or risk, over some function class
  • Since $P_{X, Y}$ is unknown, need to appeal to an induction principle, such as empirical risk minimization
  • This learning is consistent (i.e. converges) only if function class is “small”
  • The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity or size of function class.
  • “Universal Consistency”: convergence of a learning algorithm to the lowest achievable risk with any function. i.e. everything can be learned in the limit of infinite data. It does not imply that every problem is learnable well from finite data, due to the phenomenon of slow rates.
  • “Universally consistent” algorithms: nearest neighbor classifiers and Support Vector Machines
  • “Phenomenon of slow rates”: For any learning algorithm, there exist problems for which the learning rates are arbitrarily slow.

“Deep learning” only works if “the underlying distribution does not differ between training and testing, be it by interventions or other change”, “does not provide us with the right means to describe what might change”

References