Boaz Barak and David Steurer
UCSD winter school on Sum-of-Squares, January 2017
Better question: when is the mathematical theory of probability applicable?
“The probability of winning a battle” has no place in our theory because it does not belong to any collective. Probability cannot be applied to this problem any more than the physical concept of work can be applied to the “work” done by an actor reciting his part. Richard Von Mises, 1928 (paraphrased)
The theory of inverse probability is founded upon an error, and must be wholly rejected, Ronald Fisher, 1925
If anyone wishes to study frequencies he is free to do so and we wish him every success. But why does he insist on appropriating the word “probability”? E.T. Jaynes, 1978
I am unable to see why ‘objectivity’ requires us to interpret every probability as a frequency in some random experiment; particularly when in most problems probabilities are frequencies only in an imaginary universe invented just for the purpose of allowing a frequency interpretation., E.T. Jaynes, 1976
Bayesians: Probability of \(A\) is the odds I would give that \(A\) happened.
\(A\) can even be in the past, if I don’t know about it.
Frequentist: Probability of \(A\) is the fraction of times it would happen in independent experiments.
\(A\) is some future event, depends on our random choices.
If \(A\) already happened then \({\mathbb{P}}[A] \in \{0,1\}\).
Bayesian: Distribution \(\Theta\) over \(\theta\) encoding prior+observations, make bet \(b\) minimizing \(\color{green}{{\mathbb{E}}_{\theta \sim \Theta}} {\mathbb{E}}_{x \sim p(X|\theta)} loss_b(x)\).
Frequentist: \(\theta\) already happened. Make bet \(b\) that minimizes \(\color{red}{\max_\theta} {\mathbb{E}}_{x\sim p(x|\theta)} loss_b(x)\)
Bayesian could do better in single race, if prior accurate. (worst case over \(x\), average case over \(\theta\))
Frequentist calbirated: Minimize loss “long run” if \(\theta\) fixed over many races. (worst case over \(\theta\), average case over \(x\))
Frequentist might be incoherent: Bets could be “arbitraged” if don’t correspond to a model (Cox 1946).
A set of estimates \(\{ {\tilde{\mathbb{E}}}f \}_f\) is coherent if \({\tilde{\mathbb{E}}}f \geq {\tilde{\mathbb{E}}}g\) whenever \(\forall_x f(x) \geq g(x)\).
You may contradict reality, but at least you don’t contradict yourself
Fundamental unpleasantness: Need to choose between coherence and efficiency
e.g., If \(\varphi\) is unsatisfiable then \(\varphi(x) \lt1\) for all \(x\), but efficient algs can’t know this.
Bayesian: Always coherent, sometimes at steep price.
SOS: pseudo coherent - “diving board principle”
If \(\vdash_d f \geq g\) then \({\tilde{\mathbb{E}}}f \geq {\tilde{\mathbb{E}}}g\) for all \(d\) pdist
If a computationally bounded agent can’t tell there isn’t a real distribution, we could pretend that it is.
Algorithms: Pretend that we get moments of actual dist, and worry about SOS-ing later.
Lower bounds: Don’t construct pseudo-dist that don’t respect observable correlations.
Input: Graph \(G\) s.t. \(\exists\) \(x\in {\{0,1\}}^n\) with \(\sum x_i = n/2\) and \(f_G(x) = \sum_{(i,j)\in E_G} (x_i-x_j)^2 \leq \varphi |E| \;\;(*)\)
Goal: Find balanced \(x'\) s.t. \(f_G(x') = O(\sqrt{\log n} \varphi)\).
(in fact, assume \({\mathbb{E}}(x_i-x_j)^2 \leq \varphi\) for all \((i,j) \in E_G\))
Reduced partitioning arbitrary graph to isoperimetric question on cube.
(including highly non trivial isoperimteric results)