Jian-Feng Cai,Meng Huang,Dong Liand Yang Wang
1Department of Mathematics,The Hong Kong University of Science and Technology,Clear Water Bay,Kowloon,Hong Kong
2SUSTech International Center for Mathematics and Department of Mathematics,Southern University of Science and Technology,Shenzhen,Guangdong 518055,China
Abstract.A fundamental task in phase retrieval is to recover an unknown signal x∈Rn from a set of magnitude-only measurements yi=|〈ai,x〉|,i=1,···,m.In this paper,we propose two novel perturbed amplitude models(PAMs)which have a non-convex and quadratic-type loss function.When the measurements ai∈Rn are Gaussian random vectors and the number of measurements m≥Cn,we rigorously prove that the PAMs admit no spurious local minimizers with high probability,i.e.,the target solution x is the unique local minimizer(up to a global phase)and the loss function has a negative directional curvature around each saddle point.Thanks to the well-tamed benign geometric landscape,one can employ the vanilla gradient descent method to locate the global minimizer x(up to a global phase)without spectral initialization.We carry out extensive numerical experiments to show that the gradient descent algorithm with random initialization outperforms state-of-the-art algorithms with spectral initialization in empirical success rate and convergence speed.
Key words:Phase retrieval,landscape analysis,non-convex optimization.
The basic amplitude model for phase retrieval can be written as


Many algorithms have been designed to solve the phase retrieval problem,which can be categorized into convex algorithms and non-convex ones.The convex algorithms usually rely on a “matrix-lifting” technique,which lifts the phase retrieval problem into a low rank matrix recovery problem.By using convex relaxation one can recast the matrix recovery problem as a convex optimization problem.The corresponding algorithms include PhaseLift[2,4],PhaseCut[33]etc.It has been shown[2]that PhaseLift can achieve the exact recovery under the optimal sampling complexity with Gaussian random measurements.
Although convex methods have good theoretical guarantees of convergence,they tend to be computationally inefficient for large scale problems.In contrast,many non-convex algorithms bypass the lifting step and operate directly on the lowerdimensional ambient space,making them much more computationally efficient.Early non-convex algorithms were mostly based on the technique of alternating projections,e.g.,Gerchberg-Saxton[14]and Fineup[9].The main drawback,however,is the lack of theoretical guarantee.Later Netrapalli et al.[25]proposed the AltMinPhase algorithm based on a technique known asspectral
initialization
.They proved that the algorithm linearly converges to the true solution withO
(n
logn
)resampling Gaussian random measurements.This work led further to several other non-convex algorithms based on spectral initialization.A common thread is first choosing a good initial guess through spectral initialization,and then solving an optimization model through gradient descent.Two widely used optimization estimators are the intensity-based loss
and the amplitude-based loss

O
(n
logn
)Gaussian random measurements.Chen and Cand`es in[6]improved the results toO
(n
)Gaussian random measurements by incorporating a truncation which leads to a novel Truncated Wirtinger Flow(TWF)algorithm.Other methods based on(1.1)include the Gauss-Newton method[11],the trust-region method[29]and the like[17].For the amplitude flow estimator(1.2),several algorithms have also been developed recently,such as the Truncated Amplitude Flow(TAF)algorithm[35],the Reshaped Wirtinger Flow(RWF)[37]algorithm,randomized Kaczmarz methods[18,19,31,36]and the Perturbed Amplitude Flow(PAF)[10]algorithm.Those algorithms have been shown to converge linearly to the true solution up to a global phase withO
(n
)Gaussian random measurements.Furthermore,there is ample evidence from numerical simulations showing that algorithms based on the amplitude flow loss(1.2)tend to outperform algorithms based on loss(1.1)when measured in empirical success rate and convergence speed.As was already mentioned earlier,producing a good initial guess using spectral initialization seems to be a prerequisite for prototypical non-convex algorithms to succeed with good theoretical guarantees.A natural and fundamental question is:
Is
it
possible
for
non
-convex
algorithms
to
achieve
successful
recovery
with
a
random
initialization
(i
.e
.,without
spectral
initialization
or
any
additional
truncation
)?For the intensity-based estimator(1.1),the answer is affirmative.In the recent work[29],Ju Sun et al..carried out a deep study of the global geometric structure of the loss function of(1.1).They proved that the loss functionF
(z
)does not have any spurious local minima underO
(n
logn
)Gaussian random measurements.More specifically,it was shown in[29]that all minimizers coincide with the target signalx
up to a global phase,and the loss function has a negative directional curvature around each saddle point.Thanks to this benign geometric landscape any algorithm which can avoid saddle points converges to the true solution with high probability.A trust-region method was employed in[29]to find the global minimizers with random initialization.To reduce the sampling complexity,it has been shown in[22]that a combination of the loss function(1.1)with a judiciously chosen activation function also possesses the benign geometry structure underO
(n
)Gaussian random measurements.Recently,a smoothed amplitude flow estimator has been proposed in[5]and the authors show that the loss function has benign geometry structure under the optimal sampling complexity.Numerical tests show that the estimator in[5]yields very stable and fast convergence with random initialization and performs as good as or even better than the existing gradient descent methods with spectral initialization.The emerging concept of a benign geometric landscape has also recently been explored in many other applications of signal processing and machine learning,e.g.,matrix sensing[1,26],tensor decomposition[12],dictionary learning[30]and matrix completion[13].For general optimization problems there exist a plethora of loss functions with well-behaved geometric landscapes such that all local optima are also global optima and each saddle point has a negative direction curvature in its vincinity.Correspondingly several techniques have been developed to guarantee that the standard gradient based optimization algorithms can escape such saddle points efficiently,see e.g.,[8,20,21].
O
(n
),namely,the loss functions have no spurious local minimizers and have a negative directional curvature around each saddle point.Such properties allow first order method like gradient descent to find a global minimum with random initial guess.We carry out extensive numerical experiments and show that the gradient descent algorithm with random initialization outperforms the state-of-theart algorithms with spectral initialization in empirical success rate and convergence speed.We now give a slightly more detailed summary of the main theoretical results proved in our papers.Consider the loss function which is akin to the estimator(1.2):

The following theorem shows that the loss function above has benign geometry structure under optimal sampling complexity.



The geometric landscape is stated below.

Remark 1.1.
In a con-current work[5],we considered another new smoothed amplitude-based estimator which is based on a piece-wise smooth modification of the amplitude estimator(1.2).The estimator takes the form
γ
(t
)is taken to be
β≤
1/2,we prove that the loss function has a benign landscape under the optimal sampling thresholdm
=O
(n
).There are subtle technical difficulties in connection with the piecewise-smoothness of the loss function which make the overall proof therein quite special.On the other hand,there are exciting evidences that the machinery developed in this work can be generalized significantly in various directions(including complex-valued cases etc.).We plan to address some of these important issues in forthcoming works.O
(n
).In Section 4,we give some numerical experiments to demonstrate the efficiency of our proposed estimators.In Appendix,we collect the technique lemmas which are used in the proof.



c
>0,C
>0 are constants.The constantsc
andC
are allowed to depend onβ
and the small constants?
,δ
mentioned before.Recall the loss function of perturbed amplitude model(PAM1)(1.3):

β
>0 is a parameter.Here,we denote
a
asa
,x
asx
to alleviate the notation.The global geometric structure of above empirical loss is stated below.
Remark 2.1.
We shall show that most of the statements can be proved with high probability 1?e
.The only part where the weaker probability 1?O
(m
)is used comes in the analysis of the strong convexity near the global minimizeru
=±x
(see e.g.,Lemma A.10).This can be refined but we shall not dwell on it here.In view of this homogeneity and the rotation invariance of the Gaussian distribution,we may assume without loss of generality thatx
=e
when studying the landscape off
(u
).Thus throughout the rest of the proof we shall assumex
=e
.

ρ
is


Proof
.To prove this lemma,we need to lower bound the first term and upper bound the last two terms of?f
.For the first term,by using Bernstein’s inequality,we have with high probability,
It immediately gives

For the second term,simple calculation leads to

Finally,it is easy to derive from(2.3)that

Putting all above estimators into(2.2)gives




Proof
.By Bernstein’s inequality,we have with high probability,
Putting this into(2.2)gives


u
=0 needs to be treated with care since our loss functionf
(u
)is only Lipschitz at this point.To this end,we define the one-sided directional derivative off
along a directionξ∈
Sas
It is easy to check that



Proof
.Clearly with high probability and uniformly inξ∈
S,
So,we complete the proof.
In summary,we have the following theorem.


where
Df
(0)was
de
fi
ned
in
(2.4).Proof
.This follows from Lemmas 2.1,2.2 and 2.3.

e
∈
Ssatisfiese
·e
=0.Clearly in the regimeρ~
1,|t
|<1,we have a smooth representation
The following pedestrian proposition shows that the landscape of a smooth function undergoes mild changes under smooth change of variables.
Proposition 2.1
(Criteria for no local minimum).In
the
regime
ρ
~
1,|t
|<1,consider
f
(u
)=f
(ψ
(ρ
,t
,e
))=:g
(ρ
,t
,e
).Then
the
following
hold
:1
.If
at
some
point
|?g
|>0,then
∥?f∥
>0at
the
corresponding
point
.
Proof
.These easily follow from the formulae:
?
f
=(?f
)denotes the Hessian matrix off
.Proposition 2.1 allows us to simplify the computation greatly by looking only at the derivatives?
and?
.We shall use these in the regime|t
|<1??
,where 0<?
?
1.Now observe that
a~N
(0,I).Denote
Lemma 2.4
(The limiting profile).For
any
0<η
?
1,the
following
hold
:


Proof
.See appendix.
1
.∥?f
(u
)∥
>0;2
.?f
(u
)=0,and
f
has
a
negative
directional
curvature
at
this
point
.Proof
.Denote






Clearly

t
?
1 such that
?
>0 sufficiently small,we can then guarantee that
The desired result then follows from Proposition 2.1.



where
c
(η
)→
0as
η
→
0.
Proof
of
Lemma
2.5.Recall
Without loss of generality we assume



It immediately gives

Clearly it holds with high probability that

B
>0 is a constant.Forρ≤
1,we have
ρ
<1,it holds with high probability that

Lemma 2.6
(?f
is good).We
have
almost
surely
it
holds
that



where
α
>0is
a
constant
depending
only
on
(c
,c
,β
).
x
=0.Now defineh
(x
)=h
(x
).Then we can rewrite?f
as
It then follows from(2.5)that



Thus,we complete the proof.


where
c
(η
)→
0as
η
→
0.
First observe that

H
term in Lemma 2.5,we have with high probability that
c
(η
)→
0 asη
→
0.Now by Lemma 2.6,it holds with high probability that


c
(η
)suitably then yields the result.The argument forρ≤
1?c
(η
)is similar.We omit the details.u
=±e
where
f
must be strictly convex in this neighborhood so thatu
=±e
are the unique minimizers.To this end consider
a~N
(0,I).Theorem 2.5
(Strong convexity of Ef
when∥u±e
∥?
1).Consider
h
de
fi
ned
by
(2.6).There
exist
0<?
?
1and
a
positive
constant
γ
such
that
the
following
hold
:1
.If∥u?e
∥
≤?
,then
for
any
ξ∈
Sit
holds

2
.If
∥u
+e
∥
≤?
,then
for
any
ξ∈
Sit
holds

Proof
.We shall only consider the case∥u?e
∥
?
1.The other case∥u
+e
∥
?
1 is similar and therefore omitted.Note that


We now need to make a change of variable.The representation



f
(u
)near the global minimizeru
=±e
.

We now complete the proof of the main theorem.
Proof
of
Theorem
2.1.We proceed in several steps.1.By Theorem 2.2,we see that with high probability the functionf
(u
)has nonvanishing gradient in the regimes

?
>0 sufficiently small,such that with probability at least 1?O
(m
),f
(u
)is strongly convex in the neighborhood∥u±e
∥
≤?
.3.By Theorem 2.4,we have that with high probability







This completes the proof.
β
>0,



is smooth.In particular,we can compute(for each realization)the derivatives of the summands in(3.1)without any problem.
Remark 3.2.
Thanks to the regularization term(a
·x
),our new model(3.1)enjoys a better probability concentration bound 1?e
than the model(2.1)where the weaker probability concentration 1?O
(m
)is proved.Without loss of generality we shall assumex
=e
throughout the rest of the proof.










u
=±e
where


f
(u
)near the global minimizeru
=±e
.


In this section,we demonstrate the numerical efficiency of our estimators by simple gradient descent and compare their performance with other competitive algorithms.
In a concurrent work[5],we considered the following piecewise Smoothed Amplitude loss(SAF):

We have show theoretically that any gradient descent algorithm will not get trapped in a local minimum for the loss functions above.Here we present numerical experiments to show that the estimators perform very well with randomized initial guess.
We use the following vanilla gradient descent algorithm

f
(u
)given above.The pseudocode for the algorithm is as follows.
Algorithm 4.1 Gradient descend algorithm based on our new models.Input:Measurement vectors:ai∈Rn,i=1,···,m;Observations:y∈Rm;Parameter β;Step sizeμ;Tolerance?>0 1:Random initial guess u0∈Rn.2:For k=0,1,2,···,if∥?f(uk)∥≥?do uk+1=uk?μ?f(uk)3:End for Output:The vector uT.
The performance of our PAM1 and PAM2 algorithms are conducted via a series of numerical experiments in comparison against SAF,Trust Region[30],WF[3],TWF[6]and TAF[35].Here,it is worth emphasizing that random initialization is used for SAF,Trust Region[30]and our PAM1,PAM2 algorithms while all other algorithms have adopted a spectral initialization.Our theoretical results are for real Gaussian case,but the algorithms can be easily adapted to the complex Gaussian and CDP cases.All experiments are carried out on a MacBook Pro with a 2.3GHz Intel Core i5 Processor and 8 GB 2133 MHz LPDDR3 memory.
x∈
Ris chosen randomly from the standard Gaussian distribution and the measurement vectors
a
,i
=1,···,m
are generated randomly from standard Gaussian distribution or CDP model.For the real Gaussian case,the signalx~N
(0,I
)and measurement vectorsa~N
(0,I
)fori
=1,···,m
.For the complex Gaussian case,the signalx
~
N
(0,I
)+iN
(0,I
)and measurement vectorsa~N
(0,I
/2)+iN
(0,I
/2).For the CDP model,we use masks of octanary patterns as in[3].For simplicity,our parameters and step size are fixed for all experiments.Specifically,we adopt parameterβ
=1/2 and step sizeμ
=1 for SAF.We choose the parameterβ
=1,step sizeμ
=0.6 andμ
=2.5 for PAM1 and PAM2,respectively.For Trust Region,WF,TWF and TAF,we use the code provided in the original papers with suggested parameters.Example 4.1.
In this example,we test the empirical success rate of PAM1,PAM2 versus the number of measurements.We conduct the experiments for the real Gaussian,complex Gaussian and CDP cases respectively.We choosen
=128 and the maximum number of iterations isT
=2500.For real and complex Gaussian cases,we varym
within the range[n
,10n
].For CDP case,we set the ratiom
/n
=L
from 2 to 10.For eachm
,we run 100 times trials to calculate the success rate.Here,we say a trial to have successfully reconstructed the target signal if the relative error satisfies dist(u?x
)/∥x∥≤
10.The results are plotted in Fig.1.It can be seen that 6n
Gaussian phaseless measurement or 7 octanary patterns are enough for exactly recovery for PAM2.
Figure 1:The empirical success rate for different m/n based on 100 random trails.(a)Success rate for real Gaussian case,(b)Success rate for complex Gaussian case,(c)Success rate for CDP case.

Figure 2:Relative error versus number of iterations for PAF,SAF,WF,TWF,and TAF method:(a)Real Gaussian case;(b)Complex Gaussian case.
Example 4.2.
In this example,we compare the convergence rate of PAM1,PAM2 with those of SAF,WF,TWF,TAF for real Gaussian and complex Gaussian cases.We choosen
=128 andm
=6n
.The results are presented in Fig.2.Since PAM1 as well as PAM2 algorithm chooses a random initial guess according to the standard Gaussian distribution instead of adopting a spectral initialization,it sometimes need to escape the saddle points with a small number of iterations.Due to its high efficiency to escape the saddle points,it still performs well comparing with state-ofthe-art algorithms with spectral initialization.Example 4.3.
In this example,we compare the time elapsed and the iteration needed for WF,TWF,TAF,SAF and our PAM1,PAM2 to achieve the relative error 10and 10,respectively.We choosen
=1000 withm
=8n
.We adopt the same spectral initialization method for WF,TWF,TAF and the initial guess is obtained by power method with 50 iterations.We run 50 times trials to calculate the average time elapsed and iteration number for those algorithms.The results are shown in Table 1.The numerical results show that PAM2 takes around 15 and 42 iterations to escape the saddle points for the real and complex Gaussian cases,respectively.
Table 1:Time Elapsed and Iteration Number among Algorithms on Gaussian Signals with n=1000.
L
=20 random octanary patterns to obtain the Fourier intensity measurements for each R/G/B channel as in[3].Table 2 lists the averaged time elapsed and theiteration needed to achieve the relative error 10and 10over the three RGB channels.We can see that our algorithms have good performance comparing with state-of-the-art algorithms with spectral initialization.Furthermore,our algorithms perform well even withL
=10 under 300 iterations,while WF fails.Fig.3 shows the image recovered by PAM2.
Table 2:Time elapsed and iteration number among algorithms on recovery of galaxy image.
y
=|〈a
,x〉
|+η
and add different level of Gaussian noises to explore the relationship between the signal-tonoise rate(SNR)of the measurements and the mean square error(MSE)of the recovered signal.Specifically,SNR and MSE are evaluated by

Figure 3:The milky way galaxy image:PAM2 with L=10 takes 300 iterations,computation time is 524.1s,relative error is 7.26×10?13.

Figure 4:SNR versus relative MSE on a dB-scale under the noisy Gaussian model:(a)Real Gaussian case;(b)Complex Gaussian case.
whereu
is the output of the algorithms given above after 2500 iterations.We choosen
=128 andm
=8n
.The SNR varies from 20db to 60db.The result is shown in Fig.4.We can see that our algorithms are stable for noisy phase retrieval.Appendix A:auxiliary estimates for Section 2




































A
,it is tedious but not difficult to check that the terms(B.3),(B.5),(B.6)can be easily controlled with the help of Lemma B.3.The term(B.7)can be estimated in a similar way as in the estimate of(A.9)in the proof of Lemma A.10(note that this is done in high probability therein!).The term(B.4)is also easy to handle.We omit further details.Acknowledgements
J.F.Cai was supported in part by Hong Kong Research Grant Council General Research Grant Nos.16309518,16309219,16310620 and 16306821.Y.Wang was supported in part by the Hong Kong Research Grant Council General Research Grant Nos.16306415 and 16308518.