Boaz Barak and David Steurer
UCSD winter school on Sum-of-Squares, January 2017
optimization: find solution with
best possible quality\[ \min_{x\in \Omega} f(x) \]
arises in many applications, e.g., machine learning
challenge: avoid searching through all solutions (takes forever)
not always possible (assuming P \(\neq\) NP)
goal: identify best algorithm (most efficient, reliable, accurate as possible)
convexity: average of two good solutions is also good
\[ f\left(\frac{x+y}{2}\right) \le \frac{f(x)+f(y)}{2} \]
in this case: local minimum \(\equiv\) global minimum
\(\leadsto\) versions of gradient descent work well
bad news: applications often require non-convex or even discrete optimization
sometimes gradient descent or local search still works
but even then, we often have no strong guarantees
sum-of-squares (SOS): [Shor’85,Parrilo’00,Lasserre’00] powerful unified approach to efficient and reliable algorithms for general optimization, including non-convex and discrete
generalizes efficient algorithms with best known guarantees
for a wide-range of problems
based on generalization of classical probability, where uncertainty stems from computational difficulty
goal: understand the strengths and limitations of SOS for wide ranges of non-convex and discrete optimization problems
when does SOS achieve strictly stronger guarantees
than other efficient algorithms?
for what kind of problems, could SOS be optimal?
can we prove limitations of SOS to argue about
inherent typical-case difficulty of problems?
what are the practical implications and
how does SOS relate to popular heuristics?
overview of what is known about SOS so far
main take-away:
pseudo-probability: powerful tool to understand both strengths and limitations of SOS and potentially many other algorithms
generalizes classical probability; uncertainty arises by complexity in restricted but powerful and intuitive proof system
real-valued function on \(n\)-dim. hypercube
\[ f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\]
astounding number of applications
rich mathematical structure
(fills at least one book …)
naive representation of \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) requires \(2^n\) numbers
(table of all evaluations of \(f\))
“low-degree functions” have more concise representations
def’n: \(f\) has degree \(\le d\) if \(\exists\) scalars \(\{c_S\}\),
\[ f(x) = \sum_{S \subseteq [n],~ {\lvert S \rvert}\le d} c_S \cdot \underbrace{\prod_{i\in S} x_i}_{\text{multilinear monomial}} \]
\(\leadsto\) number of parameters to represent degree-\(d\) functions,
multilinear monomials form a linear basis for real-valued functions on the hypercube
consequence: every function \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) has unique representation as multilinear polynomial (and degree at most \(n\))
great number of discrete optimization problems boil down to deciding non-negativity of low-degree polynomials
given: degree-\(\le d\) function \(f{\colon}{\{0,1\}}^n\to{\mathbb{R}}\)
(represented in monomial basis)goal: either find \(x\in {\{0,1\}}^n\) such that \(f(x)\lt0\) or certify that \(f\ge 0\) over \({\{0,1\}}^n\)
challenge: avoid checking \(f(x)\ge 0\) for all \(x\in {\{0,1\}}^n\)
examples: sparsest cut (this lecture), max cut (next lecture)
\(d\)-regular undirected graph \(G\), vertex set \(V=[n]\), subset \(S\subseteq V\)
how well is \(S\) connected to the rest of the graph?
sparsity \(\displaystyle \Phi_G(S) = \frac{E_G(S,V\setminus S)}{\frac d n{\lvert S \rvert}{\lvert V\setminus S \rvert}}\)
\(\Phi_G(S)\) ranges from \(0\) to \(2\); most sets have sparsity \(\approx 1\)
sparsest cut: given \(G\), find \(S\) so as to minimize \(\Phi_G(S)\)
NP-hard; outstanding testbed for approximation algorithms
identify subsets of \([n]\) with points in \({\{0,1\}}^n\)
\(x \mapsto \{i\in [n] \mid x_i = 1\}\)
notation: \({\lvert x \rvert}=\sum_{i=1}^n x_i\) (weight)
sparsest cut in \(G\) has sparsity at least \(\color{red}{{\varepsilon}\gt0}\)
\(\Leftrightarrow\) following deg-2 f’n is nonneg. over \({\{0,1\}}^n\)\[ \underbrace{\sum_{\{i,j\}\in E_G} (x_i-x_j)^2}_{ \text{sparsity numerator}} - \color{red}{\varepsilon}\underbrace{\tfrac dn \cdot {\lvert x \rvert} (n-{\lvert x \rvert})}_{\text{sparsity denominator}} \]
this abstract viewpoint turns out to be surprisingly useful!
given low-degree \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\), SOS algorithm outputs
level \(\ell\): parameter of algorithm we can choose to trade-off running time with the extent to which \(\mu\) “looks like” a distribution
many classical algorithms captured by level-2 SOS
will see that higher-level SOS can lead to better algorithms
idea: decompose function into “obviously” nonnegative parts
def’n: degree-\(\ell\) SOS certificate for \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) consists of \(g_1,\ldots,g_r{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) with \(\deg g_i \le \ell/2\) such that
\[ f = g_1^2 + \dots + g_r^2 \]
key property: can find this certificate in time \(n^{O(\ell)}\) if it exists
→ very useful for designing efficient algorithms ☻
we say inequality \(\{f\ge 0\}\) has deg.-\(\ell\) SOS proof, denoted \(\vdash_\ell \{ f\ge 0\}\), if \(f\) has a degree-\(\ell\) SOS certificate
idea: characterize SOS certificates in terms of PSD matrices
let \(v_d(x)=(1,x)^{\otimes d}\) (“Veronese map”, deg-\(\le d\) monomials)
claim: \(\vdash_d \{ f \ge 0 \}\) iff some positive semidefinite matrix \(A\) has
\[ f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\text{ for all }x\in{\{0,1\}}^n \]
proof: if \(f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\) for psd \(A\), then
\[ f(x)={\left\| A^{1/2} v_{d/2}(x) \right\|}_2^2 \]is sum of squares of degree-\(d/2\) functions; thus \(\vdash_d \{ f \ge 0 \}\)
does larger degree help? yes!
claim: every nonneg. \(f\) on \(n\)-cube has deg-\(2n\) SOS certificate
proof: \(f=g^2\) with \(g=\sqrt f\) of degree at most \(n\) 🞏
this proof requires to write down \(2^n\) numbers in general
→ not useful for efficient algorithms 🙁
what low-degree polynomials have low-degree certificates?
given graph \(G\) and two nodes \(s\) and \(t\),
find minimum cut between \(s\) and \(t\)
efficient algorithm using maximum flows
will see:
SOS can also solve this problem, but
without explicitly using any combinatorial structure
Laplacian \(f_G(x)=\sum_{\{i,j\}\in E_G} (x_i-x_j)^2\) (# edges cut by \(x\))
minimum cut in \(G\) between \(s\) and \(t\) is at least \(k\)
iff \((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative
(restriction to subcube with \(x_s=0\) and \(x_t=1\))
claim: If \((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative,
then it has deg-4 SOS certificate
\(\leadsto\) poly-time algorithm for min cut
without explicit use of combinatorial structure
proof: suppose min \(s\)-\(t\) cut is \(\ge k\)
by Mader’s th’m, \(\exists\) \(k\) edge-disjoint \(s\)-\(t\) paths \(P_1,\ldots,P_k\)
thus, \(f_G=\underbrace{f_{P_1}+\cdots+f_{P_k}}_{\text{edge disjoint paths}}+\underbrace{f_{G-P_1\cdots-P_k}}_{\text{remaining graph}}\)
to show: \(\color{red}{\vdash_4 \{f_{P_i}\ge (x_s-x_t)^2\}}\) and \(\vdash_4 \{f_{G-P_1\cdots-P_k}\ge 0\}\)
idea: use \(\vdash_4 \{ (x_i-x_j)^2+(x_j-x_k)^2\ge (x_i-x_k)^2\}\) repeatedly \({\square}\)