sum-of-squares and sparsest cut

Boaz Barak and David Steurer

UCSD winter school on Sum-of-Squares, January 2017

what is sum-of-squares?
why might you care?

optimization

optimization: find solution with
best possible quality
\[ \min_{x\in \Omega} f(x) \]

arises in many applications, e.g., machine learning

challenge: avoid searching through all solutions (takes forever)

not always possible (assuming P \(\neq\) NP)

goal: identify best algorithm (most efficient, reliable, accurate as possible)

convex optimization

convexity: average of two good solutions is also good
\[ f\left(\frac{x+y}{2}\right) \le \frac{f(x)+f(y)}{2} \]

in this case: local minimum \(\equiv\) global minimum

\(\leadsto\) versions of gradient descent work well

bad news: applications often require non-convex or even discrete optimization

sometimes gradient descent or local search still works

but even then, we often have no strong guarantees

sum-of-squares (SOS): [Shor’85,Parrilo’00,Lasserre’00] powerful unified approach to efficient and reliable algorithms for general optimization, including non-convex and discrete

generalizes efficient algorithms with best known guarantees
for a wide-range of problems

based on generalization of classical probability, where uncertainty stems from computational difficulty

research opportunities

goal: understand the strengths and limitations of SOS for wide ranges of non-convex and discrete optimization problems

when does SOS achieve strictly stronger guarantees
than other efficient algorithms?

for what kind of problems, could SOS be optimal?

can we prove limitations of SOS to argue about
inherent typical-case difficulty of problems?

what are the practical implications and
how does SOS relate to popular heuristics?

this winter school

overview of what is known about SOS so far

main take-away:

pseudo-probability: powerful tool to understand both strengths and limitations of SOS and potentially many other algorithms
generalizes classical probability; uncertainty arises by complexity in restricted but powerful and intuitive proof system

interlude: functions on the hypercube

real-valued function on \(n\)-dim. hypercube
\[ f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\]

astounding number of applications

rich mathematical structure
(fills at least one book …)

low-degree functions

naive representation of \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) requires \(2^n\) numbers
(table of all evaluations of \(f\))

“low-degree functions” have more concise representations

def’n: \(f\) has degree \(\le d\) if \(\exists\) scalars \(\{c_S\}\),
\[ f(x) = \sum_{S \subseteq [n],~ {\lvert S \rvert}\le d} c_S \cdot \underbrace{\prod_{i\in S} x_i}_{\text{multilinear monomial}} \]

\(\leadsto\) number of parameters to represent degree-\(d\) functions,

\[ \binom{n}{1} + \cdots + \binom{n}{d} \approx n^d \]

multilinear monomial basis

\[ x_S {\stackrel{\mathrm{def}}{=}}\prod_{i\in S} x_i, \quad S\subseteq [n] \]

multilinear monomials form a linear basis for real-valued functions on the hypercube

consequence: every function \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) has unique representation as multilinear polynomial (and degree at most \(n\))

non negativity

great number of discrete optimization problems boil down to deciding non-negativity of low-degree polynomials

given: degree-\(\le d\) function \(f{\colon}{\{0,1\}}^n\to{\mathbb{R}}\)
(represented in monomial basis)
goal: either find \(x\in {\{0,1\}}^n\) such that \(f(x)\lt0\) or certify that \(f\ge 0\) over \({\{0,1\}}^n\)

challenge: avoid checking \(f(x)\ge 0\) for all \(x\in {\{0,1\}}^n\)

examples: sparsest cut (this lecture), max cut (next lecture)

sparsest cut problem

\(d\)-regular undirected graph \(G\), vertex set \(V=[n]\), subset \(S\subseteq V\)

how well is \(S\) connected to the rest of the graph?

sparsity \(\displaystyle \Phi_G(S) = \frac{E_G(S,V\setminus S)}{\frac d n{\lvert S \rvert}{\lvert V\setminus S \rvert}}\)

\(\Phi_G(S)\) ranges from \(0\) to \(2\); most sets have sparsity \(\approx 1\)

sparsest cut: given \(G\), find \(S\) so as to minimize \(\Phi_G(S)\)

NP-hard; outstanding testbed for approximation algorithms

enter the hypercube

identify subsets of \([n]\) with points in \({\{0,1\}}^n\)

\(x \mapsto \{i\in [n] \mid x_i = 1\}\)

notation: \({\lvert x \rvert}=\sum_{i=1}^n x_i\) (weight)

sparsest cut in \(G\) has sparsity at least \(\color{red}{{\varepsilon}\gt0}\)
\(\Leftrightarrow\) following deg-2 f’n is nonneg. over \({\{0,1\}}^n\)
\[ \underbrace{\sum_{\{i,j\}\in E_G} (x_i-x_j)^2}_{ \text{sparsity numerator}} - \color{red}{\varepsilon}\underbrace{\tfrac dn \cdot {\lvert x \rvert} (n-{\lvert x \rvert})}_{\text{sparsity denominator}} \]

this abstract viewpoint turns out to be surprisingly useful!

sum-of-squares on the hypercube

bird’s eye view of sos algorithm

given low-degree \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\), SOS algorithm outputs

either, short certificate that \(f\ge 0\)
or, object \(\mu\) that “looks like” a distribution over \(x\in{\{0,1\}}^n\) under which \(f\) has negative expected value

level \(\ell\): parameter of algorithm we can choose to trade-off running time with the extent to which \(\mu\) “looks like” a distribution

many classical algorithms captured by level-2 SOS

will see that higher-level SOS can lead to better algorithms

SOS certificates of nonnegativity

idea: decompose function into “obviously” nonnegative parts

def’n: degree-\(\ell\) SOS certificate for \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) consists of \(g_1,\ldots,g_r{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) with \(\deg g_i \le \ell/2\) such that
\[ f = g_1^2 + \dots + g_r^2 \]

key property: can find this certificate in time \(n^{O(\ell)}\) if it exists
→ very useful for designing efficient algorithms ☻

we say inequality \(\{f\ge 0\}\) has deg.-\(\ell\) SOS proof, denoted \(\vdash_\ell \{ f\ge 0\}\), if \(f\) has a degree-\(\ell\) SOS certificate

SOS certificates and PSD matrices

idea: characterize SOS certificates in terms of PSD matrices

let \(v_d(x)=(1,x)^{\otimes d}\) (“Veronese map”, deg-\(\le d\) monomials)

claim: \(\vdash_d \{ f \ge 0 \}\) iff some positive semidefinite matrix \(A\) has
\[ f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\text{ for all }x\in{\{0,1\}}^n \]

proof: if \(f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\) for psd \(A\), then
\[ f(x)={\left\| A^{1/2} v_{d/2}(x) \right\|}_2^2 \]
is sum of squares of degree-\(d/2\) functions; thus \(\vdash_d \{ f \ge 0 \}\)

high degree SOS certificates

does larger degree help? yes!

claim: every nonneg. \(f\) on \(n\)-cube has deg-\(2n\) SOS certificate

proof: \(f=g^2\) with \(g=\sqrt f\) of degree at most \(n\) 🞏

this proof requires to write down \(2^n\) numbers in general
→ not useful for efficient algorithms 🙁

what low-degree polynomials have low-degree certificates?

example: min cut

given graph \(G\) and two nodes \(s\) and \(t\),
find minimum cut between \(s\) and \(t\)

efficient algorithm using maximum flows

will see: SOS can also solve this problem, but
without explicitly using any combinatorial structure

min-cut as polynomial over hypercube

Laplacian \(f_G(x)=\sum_{\{i,j\}\in E_G} (x_i-x_j)^2\) (# edges cut by \(x\))

minimum cut in \(G\) between \(s\) and \(t\) is at least \(k\)
iff \((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative
(restriction to subcube with \(x_s=0\) and \(x_t=1\))

SOS algorithm for min-cut

claim: If \((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative,
then it has deg-4 SOS certificate

\(\leadsto\) poly-time algorithm for min cut
without explicit use of combinatorial structure

proof: suppose min \(s\)-\(t\) cut is \(\ge k\)
by Mader’s th’m, \(\exists\) \(k\) edge-disjoint \(s\)-\(t\) paths \(P_1,\ldots,P_k\)
thus, \(f_G=\underbrace{f_{P_1}+\cdots+f_{P_k}}_{\text{edge disjoint paths}}+\underbrace{f_{G-P_1\cdots-P_k}}_{\text{remaining graph}}\)
to show: \(\color{red}{\vdash_4 \{f_{P_i}\ge (x_s-x_t)^2\}}\) and \(\vdash_4 \{f_{G-P_1\cdots-P_k}\ge 0\}\)
idea: use \(\vdash_4 \{ (x_i-x_j)^2+(x_j-x_k)^2\ge (x_i-x_k)^2\}\) repeatedly \({\square}\)