Boaz Barak and David Steurer

UCSD winter school on Sum-of-Squares, January 2017

why might you care?

optimization:find solution withbest possible quality\[ \min_{x\in \Omega} f(x) \]

arises in many applications, e.g., machine learning

challenge:avoid searching through all solutions (takes forever)

not always possible (assuming P \(\neq\) NP)

goal:identifybest algorithm(most efficient, reliable, accurate as possible)

convexity:average of two good solutions is also good\[ f\left(\frac{x+y}{2}\right) \le \frac{f(x)+f(y)}{2} \]

in this case: *local minimum \(\equiv\) global minimum*

\(\leadsto\) versions of **gradient descent work well **

bad news:applications often requirenon-convexor evendiscreteoptimization

sometimes gradient descent or local search still works

but even then, we often have *no strong guarantees *

sum-of-squares (SOS):
[Shor’85,Parrilo’00,Lasserre’00]
*powerful unified approach* to efficient and reliable algorithms for general optimization, including **non-convex** and **discrete**

generalizes efficient algorithms with *best known* guarantees

for a **wide-range of problems**

based on *generalization of classical probability*, where **uncertainty stems from computational difficulty**

goal:understand thestrengths and limitationsof SOS for wide ranges of non-convex and discrete optimization problems

when does SOS achieve *strictly stronger guarantees*

than other efficient algorithms?

for what kind of problems, could SOS be *optimal*?

can we prove limitations of SOS to argue about*inherent typical-case difficulty* of problems?

what are the *practical implications* and

how does SOS relate to popular heuristics?

*overview* of what is known about SOS so far

main *take-away*:

pseudo-probability:powerful tool to understandboth strengths and limitationsof SOS and potentially many other algorithmsgeneralizes classical probability;

uncertainty arises by complexityin restricted but powerful and intuitive proof system

real-valued function on \(n\)-dim.

hypercube\[ f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\]

astounding number of *applications*

rich *mathematical structure*

(fills at least one book …)

naive representation of \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) requires \(2^n\) numbers

(table of all evaluations of \(f\))

“low-degree functions” have more *concise representations*

def’n:\(f\) hasdegree \(\le d\)if \(\exists\) scalars \(\{c_S\}\),\[ f(x) = \sum_{S \subseteq [n],~ {\lvert S \rvert}\le d} c_S \cdot \underbrace{\prod_{i\in S} x_i}_{\text{multilinear monomial}} \]

\(\leadsto\) *number of parameters* to represent degree-\(d\) functions,

\[
\binom{n}{1} + \cdots + \binom{n}{d} \approx n^d
\]

\[
x_S {\stackrel{\mathrm{def}}{=}}\prod_{i\in S} x_i, \quad S\subseteq [n]
\]

multilinear monomialsform a linear basis for real-valued functions on the hypercube

**consequence:**
every function \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) has *unique representation* as multilinear polynomial (and degree at most \(n\))

great number of discrete optimization problems boil down to deciding *non-negativity of low-degree polynomials*

given:degree-\(\le d\) function \(f{\colon}{\{0,1\}}^n\to{\mathbb{R}}\)

(represented in monomial basis)

goal:either find \(x\in {\{0,1\}}^n\) such that \(f(x)\lt0\) or certify that \(f\ge 0\) over \({\{0,1\}}^n\)

*challenge:* avoid checking \(f(x)\ge 0\) for all \(x\in {\{0,1\}}^n\)

*examples:* sparsest cut (this lecture), max cut (next lecture)

\(d\)-regular undirected graph \(G\), vertex set \(V=[n]\), subset \(S\subseteq V\)

*how well is \(S\) connected to the rest of the graph?*

sparsity\(\displaystyle \Phi_G(S) = \frac{E_G(S,V\setminus S)}{\frac d n{\lvert S \rvert}{\lvert V\setminus S \rvert}}\)

\(\Phi_G(S)\) ranges from \(0\) to \(2\); most sets have sparsity \(\approx 1\)

sparsest cut:given \(G\), find \(S\) so as to minimize \(\Phi_G(S)\)

NP-hard; outstanding testbed for *approximation algorithms*

identify subsets of \([n]\) with points in \({\{0,1\}}^n\)

\(x \mapsto \{i\in [n] \mid x_i = 1\}\)

notation: \({\lvert x \rvert}=\sum_{i=1}^n x_i\) (weight)

sparsest cut in \(G\) has sparsity at least \(\color{red}{{\varepsilon}\gt0}\)

\(\Leftrightarrow\) following deg-2 f’n is nonneg. over \({\{0,1\}}^n\)\[ \underbrace{\sum_{\{i,j\}\in E_G} (x_i-x_j)^2}_{ \text{sparsity numerator}} - \color{red}{\varepsilon}\underbrace{\tfrac dn \cdot {\lvert x \rvert} (n-{\lvert x \rvert})}_{\text{sparsity denominator}} \]

*this abstract viewpoint* turns out to be **surprisingly useful!**

given low-degree \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\), SOS algorithm outputs

- either,
*short certificate*that \(f\ge 0\) - or, object \(\mu\) that
*“looks like” a distribution*over \(x\in{\{0,1\}}^n\) under which \(f\) has*negative*expected value

level \(\ell\):parameter of algorithm we can choose to trade-offrunning timewith the extent to which \(\mu\) “looks like” a distribution

many classical algorithms captured by *level-2 SOS*

will see that *higher-level SOS* can lead to **better algorithms**

*idea:* decompose function into “obviously” nonnegative parts

def’n:degree-\(\ell\) SOS certificatefor \(f{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) consists of \(g_1,\ldots,g_r{\colon}{\{0,1\}}^n\to {\mathbb{R}}\) with \(\deg g_i \le \ell/2\) such that\[ f = g_1^2 + \dots + g_r^2 \]

*key property:* can find this certificate in time \(n^{O(\ell)}\) if it exists

→ very useful for designing efficient algorithms **☻**

we say inequality \(\{f\ge 0\}\) has *deg.-\(\ell\) SOS proof*, denoted \(\vdash_\ell \{ f\ge 0\}\), if \(f\) has a degree-\(\ell\) SOS certificate

*idea:* characterize SOS certificates in terms of PSD matrices

let \(v_d(x)=(1,x)^{\otimes d}\) (“Veronese map”, deg-\(\le d\) monomials)

claim:\(\vdash_d \{ f \ge 0 \}\) iff some positive semidefinite matrix \(A\) has\[ f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\text{ for all }x\in{\{0,1\}}^n \]

proof:if \(f(x)={\langle v_{d/2}(x), A v_{d/2}(x) \rangle}\) for psd \(A\), then\[ f(x)={\left\| A^{1/2} v_{d/2}(x) \right\|}_2^2 \]is sum of squares of degree-\(d/2\) functions; thus \(\vdash_d \{ f \ge 0 \}\)

*does larger degree help?* yes!

claim:every nonneg. \(f\) on \(n\)-cube has deg-\(2n\) SOS certificate

*proof:* \(f=g^2\) with \(g=\sqrt f\) of degree at most \(n\) 🞏

this proof requires to write down \(2^n\) numbers in general

→ not useful for efficient algorithms 🙁

*what low-degree polynomials have low-degree certificates?*

given graph \(G\) and two nodes \(s\) and \(t\),

find minimum cut between \(s\) and \(t\)

efficient algorithm using maximum flows

**will see:**
SOS can also solve this problem, but

without explicitly using any combinatorial structure

*Laplacian*
\(f_G(x)=\sum_{\{i,j\}\in E_G} (x_i-x_j)^2\)
(# edges cut by \(x\))

minimum cut in \(G\) between \(s\) and \(t\) is at least \(k\)

iff\((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative

(restriction to subcube with \(x_s=0\) and \(x_t=1\))

claim:If \((f_G)_{| x_s=0,x_t=1}-k\) is nonnegative,

then it hasdeg-4 SOS certificate

\(\leadsto\) poly-time algorithm for min cut

without explicit use of combinatorial structure

proof:suppose min \(s\)-\(t\) cut is \(\ge k\)by Mader’s th’m, \(\exists\) \(k\) edge-disjoint \(s\)-\(t\) paths \(P_1,\ldots,P_k\)

thus, \(f_G=\underbrace{f_{P_1}+\cdots+f_{P_k}}_{\text{edge disjoint paths}}+\underbrace{f_{G-P_1\cdots-P_k}}_{\text{remaining graph}}\)

to show:\(\color{red}{\vdash_4 \{f_{P_i}\ge (x_s-x_t)^2\}}\) and \(\vdash_4 \{f_{G-P_1\cdots-P_k}\ge 0\}\)

idea:use \(\vdash_4 \{ (x_i-x_j)^2+(x_j-x_k)^2\ge (x_i-x_k)^2\}\) repeatedly \({\square}\)