The Global Landscape of Phase Retrieval I:Perturbed Amplitude Models

2022-01-18 03:39:24JianFengCaiMengHuangDongLiandYangWang

Annals of Applied Mathematics 2021年4期

Jian-Feng Cai,Meng Huang,Dong Liand Yang Wang

1Department of Mathematics,The Hong Kong University of Science and Technology,Clear Water Bay,Kowloon,Hong Kong

2SUSTech International Center for Mathematics and Department of Mathematics,Southern University of Science and Technology,Shenzhen,Guangdong 518055,China

Abstract.A fundamental task in phase retrieval is to recover an unknown signal x∈Rn from a set of magnitude-only measurements yi=|〈ai,x〉|,i=1,···,m.In this paper,we propose two novel perturbed amplitude models(PAMs)which have a non-convex and quadratic-type loss function.When the measurements ai∈Rn are Gaussian random vectors and the number of measurements m≥Cn,we rigorously prove that the PAMs admit no spurious local minimizers with high probability,i.e.,the target solution x is the unique local minimizer(up to a global phase)and the loss function has a negative directional curvature around each saddle point.Thanks to the well-tamed benign geometric landscape,one can employ the vanilla gradient descent method to locate the global minimizer x(up to a global phase)without spectral initialization.We carry out extensive numerical experiments to show that the gradient descent algorithm with random initialization outperforms state-of-the-art algorithms with spectral initialization in empirical success rate and convergence speed.

Key words:Phase retrieval,landscape analysis,non-convex optimization.

1 Introduction

1.1 Background

The basic amplitude model for phase retrieval can be written as

Many algorithms have been designed to solve the phase retrieval problem,which can be categorized into convex algorithms and non-convex ones.The convex algorithms usually rely on a “matrix-lifting” technique,which lifts the phase retrieval problem into a low rank matrix recovery problem.By using convex relaxation one can recast the matrix recovery problem as a convex optimization problem.The corresponding algorithms include PhaseLift[2,4],PhaseCut[33]etc.It has been shown[2]that PhaseLift can achieve the exact recovery under the optimal sampling complexity with Gaussian random measurements.

Although convex methods have good theoretical guarantees of convergence,they tend to be computationally inefficient for large scale problems.In contrast,many non-convex algorithms bypass the lifting step and operate directly on the lowerdimensional ambient space,making them much more computationally efficient.Early non-convex algorithms were mostly based on the technique of alternating projections,e.g.,Gerchberg-Saxton[14]and Fineup[9].The main drawback,however,is the lack of theoretical guarantee.Later Netrapalli et al.[25]proposed the AltMinPhase algorithm based on a technique known as

spectral

initialization

.They proved that the algorithm linearly converges to the true solution with

(

log

)resampling Gaussian random measurements.This work led further to several other non-convex algorithms based on spectral initialization.A common thread is first choosing a good initial guess through spectral initialization,and then solving an optimization model through gradient descent.Two widely used optimization estimators are the intensity-based loss

and the amplitude-based loss

Specifically,Cand`es et al.developed the Wirtinger Flow(WF)method[3]based on(1.1)and proved that the WF algorithm can achieve linear convergence with

(

log

)Gaussian random measurements.Chen and Cand`es in[6]improved the results to

(

)Gaussian random measurements by incorporating a truncation which leads to a novel Truncated Wirtinger Flow(TWF)algorithm.Other methods based on(1.1)include the Gauss-Newton method[11],the trust-region method[29]and the like[17].For the amplitude flow estimator(1.2),several algorithms have also been developed recently,such as the Truncated Amplitude Flow(TAF)algorithm[35],the Reshaped Wirtinger Flow(RWF)[37]algorithm,randomized Kaczmarz methods[18,19,31,36]and the Perturbed Amplitude Flow(PAF)[10]algorithm.Those algorithms have been shown to converge linearly to the true solution up to a global phase with

(

)Gaussian random measurements.Furthermore,there is ample evidence from numerical simulations showing that algorithms based on the amplitude flow loss(1.2)tend to outperform algorithms based on loss(1.1)when measured in empirical success rate and convergence speed.

1.2 Prior arts and connections

As was already mentioned earlier,producing a good initial guess using spectral initialization seems to be a prerequisite for prototypical non-convex algorithms to succeed with good theoretical guarantees.A natural and fundamental question is:

possible

for

non

convex

algorithms

achieve

successful

recovery

with

random

initialization

(

without

spectral

initialization

any

additional

truncation

)?For the intensity-based estimator(1.1),the answer is affirmative.In the recent work[29],Ju Sun et al..carried out a deep study of the global geometric structure of the loss function of(1.1).They proved that the loss function

(

)does not have any spurious local minima under

(

log

)Gaussian random measurements.More specifically,it was shown in[29]that all minimizers coincide with the target signal

up to a global phase,and the loss function has a negative directional curvature around each saddle point.Thanks to this benign geometric landscape any algorithm which can avoid saddle points converges to the true solution with high probability.A trust-region method was employed in[29]to find the global minimizers with random initialization.To reduce the sampling complexity,it has been shown in[22]that a combination of the loss function(1.1)with a judiciously chosen activation function also possesses the benign geometry structure under

(

)Gaussian random measurements.Recently,a smoothed amplitude flow estimator has been proposed in[5]and the authors show that the loss function has benign geometry structure under the optimal sampling complexity.Numerical tests show that the estimator in[5]yields very stable and fast convergence with random initialization and performs as good as or even better than the existing gradient descent methods with spectral initialization.

The emerging concept of a benign geometric landscape has also recently been explored in many other applications of signal processing and machine learning,e.g.,matrix sensing[1,26],tensor decomposition[12],dictionary learning[30]and matrix completion[13].For general optimization problems there exist a plethora of loss functions with well-behaved geometric landscapes such that all local optima are also global optima and each saddle point has a negative direction curvature in its vincinity.Correspondingly several techniques have been developed to guarantee that the standard gradient based optimization algorithms can escape such saddle points efficiently,see e.g.,[8,20,21].

1.3 Our contributions

This paper aims to give a positive answer to the problem proposed in Subsection 1.2,especially for the amplitude-based model.We first introduce two novel estimators based on a deep modification of(1.2)and then we prove rigorously that their loss functions have a benign geometric landscape under the optimal sampling complexity

(

),namely,the loss functions have no spurious local minimizers and have a negative directional curvature around each saddle point.Such properties allow first order method like gradient descent to find a global minimum with random initial guess.We carry out extensive numerical experiments and show that the gradient descent algorithm with random initialization outperforms the state-of-theart algorithms with spectral initialization in empirical success rate and convergence speed.

We now give a slightly more detailed summary of the main theoretical results proved in our papers.Consider the loss function which is akin to the estimator(1.2):

The following theorem shows that the loss function above has benign geometry structure under optimal sampling complexity.

The geometric landscape is stated below.

Remark 1.1.

In a con-current work[5],we considered another new smoothed amplitude-based estimator which is based on a piece-wise smooth modification of the amplitude estimator(1.2).The estimator takes the form

where the function

(

)is taken to be

For 0<

β≤

1/2,we prove that the loss function has a benign landscape under the optimal sampling threshold

(

).There are subtle technical difficulties in connection with the piecewise-smoothness of the loss function which make the overall proof therein quite special.On the other hand,there are exciting evidences that the machinery developed in this work can be generalized significantly in various directions(including complex-valued cases etc.).We plan to address some of these important issues in forthcoming works.

1.4 Organization

The paper is organized as follows.In Section 2,we analyze the global geometric structure for the first estimator,and the global analysis for the second estimator is given in Section 3.For both estimators,we show that their loss functions have no spurious local minimizers under optimal sampling complexity

(

).In Section 4,we give some numerical experiments to demonstrate the efficiency of our proposed estimators.In Appendix,we collect the technique lemmas which are used in the proof.

1.5 Notations

where

>0,

>0 are constants.The constants

and

are allowed to depend on

and the small constants

mentioned before.

2 Perturbed amplitude model I

Recall the loss function of perturbed amplitude model(PAM1)(1.3):

where

>0 is a parameter.Here,we denote

for the convenience and write

to alleviate the notation.The global geometric structure of above empirical loss is stated below.

Remark 2.1.

We shall show that most of the statements can be proved with high probability 1

.The only part where the weaker probability 1

(

)is used comes in the analysis of the strong convexity near the global minimizer

±x

(see e.g.,Lemma A.10).This can be refined but we shall not dwell on it here.In view of this homogeneity and the rotation invariance of the Gaussian distribution,we may assume without loss of generality that

when studying the landscape of

(

).Thus throughout the rest of the proof we shall assume

2.1 The regimes are fine

The derivative with respect to

Proof

.To prove this lemma,we need to lower bound the first term and upper bound the last two terms of

.For the first term,by using Bernstein’s inequality,we have with high probability,

It immediately gives

For the second term,simple calculation leads to

Finally,it is easy to derive from(2.3)that

Putting all above estimators into(2.2)gives

Proof

.By Bernstein’s inequality,we have with high probability,

Putting this into(2.2)gives

The point

=0 needs to be treated with care since our loss function

(

)is only Lipschitz at this point.To this end,we define the one-sided directional derivative of

along a direction

ξ∈

Sas

It is easy to check that

Proof

.Clearly with high probability and uniformly in

ξ∈

So,we complete the proof.

In summary,we have the following theorem.

where

(0)

was

ned

(2.4).

Proof

.This follows from Lemmas 2.1,2.2 and 2.3.

2.2Analysis of the regime ρ～1,||·e1|?1|≥?0>0

where

∈

Ssatisfies

=0.Clearly in the regime

ρ～

1,|

|<1,we have a smooth representation

The following pedestrian proposition shows that the landscape of a smooth function undergoes mild changes under smooth change of variables.

Proposition 2.1

(Criteria for no local minimum).

the

regime

～

1,|

|<1,

consider

(

))=:

(

Then

the

following

hold

some

point

|>0,

then

∥?f∥

the

corresponding

point

Proof

.These easily follow from the formulae:

where

)denotes the Hessian matrix of

.Proposition 2.1 allows us to simplify the computation greatly by looking only at the derivatives

and

.We shall use these in the regime|

|<1

,where 0<

1.Now observe that

where

a～N

(0,I).Denote

Lemma 2.4

(The limiting profile).

For

any

the

following

hold

Proof

.See appendix.

∥?f

(

)

∥

>0;

(

)=0,

and

has

negative

directional

curvature

this

point

Proof

.Denote

Clearly

By Lemma 2.4,we can take

1 such that

By taking

>0 sufficiently small,we can then guarantee that

The desired result then follows from Proposition 2.1.

2.3Localization of ρ,the regime||·e1|?1|?1

where

(

)

→

Proof

Lemma

2.5.Recall

Without loss of generality we assume

It immediately gives

Clearly it holds with high probability that

where

>0 is a constant.For

ρ≤

1,we have

Then assuming

<1,it holds with high probability that

Lemma 2.6

(

is good).

have

almost

surely

holds

that

where

constant

depending

only

(

and the equality holds if and only if

=0.Now define

(

).Then we can rewrite

It then follows from(2.5)that

Thus,we complete the proof.

where

(

)

→

First observe that

Then by a calculation similar to the estimate of

term in Lemma 2.5,we have with high probability that

where

(

)

→

0 as

→

0.Now by Lemma 2.6,it holds with high probability that

Redefining

(

)suitably then yields the result.The argument for

ρ≤

(

)is similar.We omit the details.

2.4 Strong convexity near the global minimizers u=±e1:analysis of the limiting profile

In this section we shall show that in the small neighborhood of

±e

where

the Hessian of the expectation of the loss function must be strictly positive definite.In yet other words E

must be strictly convex in this neighborhood so that

±e

are the unique minimizers.To this end consider

where

a～N

(0,I).

Theorem 2.5

(Strong convexity of E

when

∥u±e

∥?

1).

Consider

ned

(2.6).

There

exist

and

positive

constant

such

that

the

following

hold

If∥u?e

∥

≤?

then

for

any

ξ∈

holds

∥u

∥

≤?

then

for

any

ξ∈

holds

Proof

.We shall only consider the case

∥u?e

∥

1.The other case

∥u

∥

1 is similar and therefore omitted.Note that

We now need to make a change of variable.The representation

2.5 Near the global minimizer:strong convexity

In this section we show strong convexity of the loss function

(

)near the global minimizer

±e

We now complete the proof of the main theorem.

Proof

Theorem

2.1.We proceed in several steps.1.By Theorem 2.2,we see that with high probability the function

(

)has nonvanishing gradient in the regimes

2.By Theorem 2.6,there exists

>0 sufficiently small,such that with probability at least 1

(

)is strongly convex in the neighborhood

∥u±e

∥

≤?

3.By Theorem 2.4,we have that with high probability

This completes the proof.

3 Perturbed amplitude model II

In this section,we introduce the second perturbed amplitude model for solving phase retrieval problem and consider the global landscape of it.Specifically,we consider the following empirical loss for some parameter

>0,

is smooth.In particular,we can compute(for each realization)the derivatives of the summands in(3.1)without any problem.

Remark 3.2.

Thanks to the regularization term(

),our new model(3.1)enjoys a better probability concentration bound 1

than the model(2.1)where the weaker probability concentration 1

(

)is proved.Without loss of generality we shall assume

throughout the rest of the proof.

3.1 The regimes ∥u∥2?1 and ∥u∥2?1 are fine

3.2 Analysis of the regime ρ～1,||·e1|?1|≥?0>0

3.3 Strong convexity near the global minimizers u=±e1:analysis of the limiting profile

In this section we shall show that in the small neighborhood of

±e

where

3.4 Near the global minimizer:strong convexity

In this section we show strong convexity of the loss function

(

)near the global minimizer

±e

4 Numerical experiments

In this section,we demonstrate the numerical efficiency of our estimators by simple gradient descent and compare their performance with other competitive algorithms.

In a concurrent work[5],we considered the following piecewise Smoothed Amplitude loss(SAF):

We have show theoretically that any gradient descent algorithm will not get trapped in a local minimum for the loss functions above.Here we present numerical experiments to show that the estimators perform very well with randomized initial guess.

We use the following vanilla gradient descent algorithm

with a random initial guess to minimize the loss function

(

)given above.The pseudocode for the algorithm is as follows.

Algorithm 4.1 Gradient descend algorithm based on our new models.Input:Measurement vectors:ai∈Rn,i=1,···,m;Observations:y∈Rm;Parameter β;Step sizeμ;Tolerance?>0 1:Random initial guess u0∈Rn.2:For k=0,1,2,···,if∥?f(uk)∥≥?do uk+1=uk?μ?f(uk)3:End for Output:The vector uT.

The performance of our PAM1 and PAM2 algorithms are conducted via a series of numerical experiments in comparison against SAF,Trust Region[30],WF[3],TWF[6]and TAF[35].Here,it is worth emphasizing that random initialization is used for SAF,Trust Region[30]and our PAM1,PAM2 algorithms while all other algorithms have adopted a spectral initialization.Our theoretical results are for real Gaussian case,but the algorithms can be easily adapted to the complex Gaussian and CDP cases.All experiments are carried out on a MacBook Pro with a 2.3GHz Intel Core i5 Processor and 8 GB 2133 MHz LPDDR3 memory.

4.1 Recovery of 1D signals

In our numerical experiments,the target vector

x∈

is chosen randomly from the standard Gaussian distribution and the measurement vectors

=1,···,

are generated randomly from standard Gaussian distribution or CDP model.For the real Gaussian case,the signal

x～N

(0,

)and measurement vectors

a～N

(0,

)for

=1,···,

.For the complex Gaussian case,the signal

～

(0,

)and measurement vectors

a～N

(0,

/2)+

(0,

/2).For the CDP model,we use masks of octanary patterns as in[3].For simplicity,our parameters and step size are fixed for all experiments.Specifically,we adopt parameter

=1/2 and step size

=1 for SAF.We choose the parameter

=1,step size

=0.6 and

=2.5 for PAM1 and PAM2,respectively.For Trust Region,WF,TWF and TAF,we use the code provided in the original papers with suggested parameters.

Example 4.1.

In this example,we test the empirical success rate of PAM1,PAM2 versus the number of measurements.We conduct the experiments for the real Gaussian,complex Gaussian and CDP cases respectively.We choose

=128 and the maximum number of iterations is

=2500.For real and complex Gaussian cases,we vary

within the range[

,10

].For CDP case,we set the ratio

from 2 to 10.For each

,we run 100 times trials to calculate the success rate.Here,we say a trial to have successfully reconstructed the target signal if the relative error satisfies dist(

u?x

∥x∥≤

10.The results are plotted in Fig.1.It can be seen that 6

Gaussian phaseless measurement or 7 octanary patterns are enough for exactly recovery for PAM2.

Figure 1:The empirical success rate for different m/n based on 100 random trails.(a)Success rate for real Gaussian case,(b)Success rate for complex Gaussian case,(c)Success rate for CDP case.

Figure 2:Relative error versus number of iterations for PAF,SAF,WF,TWF,and TAF method:(a)Real Gaussian case;(b)Complex Gaussian case.

Example 4.2.

In this example,we compare the convergence rate of PAM1,PAM2 with those of SAF,WF,TWF,TAF for real Gaussian and complex Gaussian cases.We choose

=128 and

.The results are presented in Fig.2.Since PAM1 as well as PAM2 algorithm chooses a random initial guess according to the standard Gaussian distribution instead of adopting a spectral initialization,it sometimes need to escape the saddle points with a small number of iterations.Due to its high efficiency to escape the saddle points,it still performs well comparing with state-ofthe-art algorithms with spectral initialization.

Example 4.3.

In this example,we compare the time elapsed and the iteration needed for WF,TWF,TAF,SAF and our PAM1,PAM2 to achieve the relative error 10and 10,respectively.We choose

=1000 with

.We adopt the same spectral initialization method for WF,TWF,TAF and the initial guess is obtained by power method with 50 iterations.We run 50 times trials to calculate the average time elapsed and iteration number for those algorithms.The results are shown in Table 1.The numerical results show that PAM2 takes around 15 and 42 iterations to escape the saddle points for the real and complex Gaussian cases,respectively.

Table 1:Time Elapsed and Iteration Number among Algorithms on Gaussian Signals with n=1000.

4.2 Recovery of natural image

We next compare the performance of the above algorithms on recovering a natural image from masked Fourier intensity measurements.The image is the Milky Way Galaxy with resolution 1080×1920.The colored image has RGB channels.We use

=20 random octanary patterns to obtain the Fourier intensity measurements for each R/G/B channel as in[3].Table 2 lists the averaged time elapsed and theiteration needed to achieve the relative error 10and 10over the three RGB channels.We can see that our algorithms have good performance comparing with state-of-the-art algorithms with spectral initialization.Furthermore,our algorithms perform well even with

=10 under 300 iterations,while WF fails.Fig.3 shows the image recovered by PAM2.

Table 2:Time elapsed and iteration number among algorithms on recovery of galaxy image.

4.3 Recovery of signals with noise

We now demonstrate the robustness of PAM1,PAM2 to noise and compare them with SAF,WF,TWF,TAF.We consider the noisy model

〈a

x〉

and add different level of Gaussian noises to explore the relationship between the signal-tonoise rate(SNR)of the measurements and the mean square error(MSE)of the recovered signal.Specifically,SNR and MSE are evaluated by

Figure 3:The milky way galaxy image:PAM2 with L=10 takes 300 iterations,computation time is 524.1s,relative error is 7.26×10?13.

Figure 4:SNR versus relative MSE on a dB-scale under the noisy Gaussian model:(a)Real Gaussian case;(b)Complex Gaussian case.

where

is the output of the algorithms given above after 2500 iterations.We choose

=128 and

.The SNR varies from 20db to 60db.The result is shown in Fig.4.We can see that our algorithms are stable for noisy phase retrieval.

Appendix A:auxiliary estimates for Section 2

Thanks to the strong damping provided by

,it is tedious but not difficult to check that the terms(B.3),(B.5),(B.6)can be easily controlled with the help of Lemma B.3.The term(B.7)can be estimated in a similar way as in the estimate of(A.9)in the proof of Lemma A.10(note that this is done in high probability therein!).The term(B.4)is also easy to handle.We omit further details.

Acknowledgements

J.F.Cai was supported in part by Hong Kong Research Grant Council General Research Grant Nos.16309518,16309219,16310620 and 16306821.Y.Wang was supported in part by the Hong Kong Research Grant Council General Research Grant Nos.16306415 and 16308518.

登錄APP查看全文

Annals of Applied Mathematics 2021年4期

Annals of Applied Mathematics的其它文章: On the Cahn-Hilliard-Brinkman Equations in R4:Global Well-Posedness; High-Order Fully Discrete Energy Diminishing Evolving Surface Finite Element Methods for a Class of Geometric Curvature Flows