999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

CONVERGENCE OF CONTROLLED MODELS FOR CONTINUOUS-TIME MARKOV DECISION PROCESSES WITH CONSTRAINED AVERAGE CRITERIA??

2019-12-10 13:58:28WenzhaoZhangXianzhuXiong
Annals of Applied Mathematics 2019年4期

Wenzhao Zhang, Xianzhu Xiong

(College of Math.and Computer Science,Fuzhou University,Fuzhou 350108,Fujian,PR China)

Abstract This paper attempts to study the convergence of optimal values and optimal policies of continuous-time Markov decision processes(CTMDP for short)under the constrained average criteria. For a given original model M∞of CTMDP with denumerable states and a sequence{Mn}of CTMDP with finite states,we give a new convergence condition to ensure that the optimal values and optimal policies of{Mn}converge to the optimal value and optimal policy of M∞as the state space Sn of Mn converges to the state space S∞of M∞,respectively.The transition rates and cost/reward functions of M∞are allowed to be unbounded.Our approach can be viewed as a combination method of linear program and Lagrange multipliers.

Keywords continuous-time Markov decision processes;optimal value;optimal policies;constrained average criteria;occupation measures

1 Introduction

Markov decision processes have wide application in queueing system,telecommunications systems,etc.;see,for instance,[2,11,13,16,18]and the reference therein.The existence and computation of optimal value and optimal policies form a hot research area in Markov decision processes.The basic method to study the existence of optimal policies include the dynamic programming approach,the linear programming and duality programming method.Based on above methods,the value iteration algorithms,policy iteration algorithms,linear programming algorithms for unconstrained optimality problems and linear programming algorithms for constrained optimality problems have been proposed; see,for instance,[4,11,13,15]. However,these algorithms are only adapt to tackle the optimality problems with finite states.It is natural to use finite-state models to approximate the original model with denumerable state space or general Borel state space.Hence,from the theoretical and practical point of view,the convergence of optimal values and optimal policies are important and interesting issues in Markov decision processes.

In the discrete-time context,[3]considered the convergence of optimal value and optimal policies of Markov decision processes with denumerable states under the constrained expected discounted cost criteria.[5,6]developed the approximation method of optimal value and optimal policies of Markov decision processes with Borel state and action spaces under the constrained expected discounted cost criteria.

In the continuous-time formulation,[16]studied the convergence of optimal value and optimal policies of Markov decision processes with denumerable states under the expected discounted cost and average cost criteria.[17,18]developed an approximation procedure for CTMDP with denumerable state space under the finitehorizon expected total cost criterion and risk-sensitive finite-horizon cost criterion,respectively.For constrained optimal problem,[12]proposed an approach based on occupation measures to study the convergence problem of optimal value and optimal policies,and gave condition imposed on the original model with denumerable states to ensure the original model can be approximated by a sequence of CTMDP with finite states.

In this paper,we consider the similar convergence problem as in[12]with denumerable states but under the constrained expected average criteria.More precisely,the original controlled model has the following features:1)The state space is denumerable and the action space is a Polish space;2)the transition rates,cost and reward functions may be unbounded from above and from below.Firstly,by introducing the average occupation measures and Lagrange multipliers,we prove that the constrained optimality problem of each model Mnof CTMDP equals to a unconstrained optimality problem,and deduce the optimality equation which includes some Lagrange multipliers. These results are extension of the results in[16]for constrained optimality problem with one constraint.Then,we derive the bound of the Lagrange multipliers in each model Mn.Secondly,according to the optimality equations,we give the exact bound of of the optimal values between the finite-state model Mnand the original model M∞.Finally,using some approximation properties of expected average reward/cost,we obtain the asymptotic convergence of optimal policies of finite-state models to the optimal policy of the original model.

The rest of the paper is organized as follows. In Section 2,we introduce the constrained average model we are concerned with. In Section 3,we deduce the optimality equation of each constrained model Mnand give the error bounds of the Lagrange multipliers in optimality equations.In Section 4,we obtain the asymptotic convergence of optimal values of finite-state models to the optimal value of the original model and asymptotic convergence of policies of finite-state models to the optimal policy of the original model.

2 The Models

In this section we introduce the models we are concerned with.

NotationIf X is a Polish space,we denote by B(X)its Borel σ-algebra,by P(X)the set of probability measures on B(X)endowed with the topology of weak convergence,bythe Banach space of all bounded measurable functions on X,bythe Banach space of all bounded continuous functions on X. Letand

Consider the sequence of models{Mn}for constrained CTMDP:

where Snis the state space. We assume Sn:={0,1,···,n}for eachand S∞:={0,1,···}.As a consequence,for each i ∈S∞,we can define n(i):=min{n ≥1,i ∈Sn}.The set A is the action space which is assumed to be a Polish space and the set A(i)∈B(A)represents the set of all available actions or decisions at state i ∈Snfor eachLetrepresent the set of all feasible state-action pairs.For fixedthe function qn(·|i,a)in(2.1)denotes the transition rates,that is,qn(j|i,a)≥0 for all(i,a)∈Knand ij.Furthermore,qn(i|j,a)is assumed to be conservative,that is

and stable,that is

Moreover,qn(j|i,a)is a measurable function on A(i)for each fixed i,j ∈Sn.Finally,rncorresponds to the reward function that is to be maximized,andcorresponds to the cost function on which the constraintis imposed for each 1 ≤l ≤p.The γndenotes the initial distribution for Mn.

Next,we briefly recall the construction of the stochastic basis(?n,Fn,{Ft,n}t≥0,for each n ∈N. Let i∞S∞be an isolated point andThus,we obtain the sample space(?n,Fn),where Fnis the standard Borel σ-algebra.For each m ≥1 and each sample ω=(i0,θ1,i1,···,θn?1,in?1,···)∈?n,we define some maps on ?nas follows:Here,Θm,Tm,Xmdenote the sojourn time,jump moment and the state of the process on the interval[Tm,Tm+1),respectively.Define a process{ξt,t ≥0}on(?n,Fn)by

In what follows,hn(ω):=(i0,θ1,i1,···,θn,in)is the n-component internal history,the argument ω=(i0,θ1,i1,···,θn,in,···)∈?nis often omitted.Since we do not consider the process after T∞,i∞is regarded as absorbing. Define A(i∞):=a∞,where a∞A is a isolated point,A?:=A ∪{a∞}and q(i∞|i∞,a∞):=0. Let Ft,n:=σ({Tm≤s,Xm=j}:j ∈Sn,s ≤t,m ≥0)be the internal history to time t for the game model G,Fs?,n,s>0))which denotes the predictable σ-algebra on

Below we introduce the concept of policies.Let Φndenote the set all kernels on A?given

Definition 2.1(i)A Pn-measurable transition probability function π(·|ω,t)on(A?,B(A?)),concentrated on A(ξt?(ω)),is called a randomized Markov policy if there exists φ(·|·,t)∈Φnfor each t>0 such that

(ii)A randomized Markov policy π is said to be randomized stationary if there exists a stochastic kernel φ ∈Φnsuch thatfor each t>0.Such policies are denoted as φ.

We denote by Πnthe family of all randomized Makov policies of Mn.The set of all stationary policies is denoted byFor each given n ∈N and policy π ∈Πn,according to Theorem 4.27 in[14],there exists a unique probability measureon(?n,Fn).Expectations with respect tois denoted as.When γn(i)=1,we writeforandforrespectively.

To guarantee the state processes{ξt,t ≥0}for each model Mnis nonexplosive,we impose the following so-called drift conditions.

Assumption 2.1There exist a nondecreasing function w ≥1 on S∞,constants κ1≥ρ1>0,L>0 and a finite set Cn?Snfor eachsuch that

Remark 2.1Under Assumptions 2.1(a)-(c),it follows from Theorem 2.13 in[16]that there exist constantsandfor each 0<τ<2 such that

The expected average criteriafor each givenandare defined as follows:

Definition 2.2(i)For anya policy π ∈Unis called an(constrained)optimal policy of Mnif

(ii)A sequence{φn}withfor eachis said to converge weakly to,if the sequence{φn(·|i)}converges weakly to φ(·|i)in P(A(i))for each i ∈S∞and n ≥n(i).We denote it by φn→φ.

3 Preliminary Results

Assumption 3.1(a)For each i ∈Sn,A(i)is a compact set.

ous in a ∈A(i),for each fixedi,j ∈Snand 1 ≤l ≤p.

(c)There exists a constant M>0,such that|rn(i,a)|≤Mw(i)andMw(i)for all(i,a)∈Knand 1 ≤l ≤p.

Under Assumptions 2.1 and 3.1(c),we can obtain the finiteness of the expected average criteria.

Lemma 3.1[11,16]Suppose that Assumptions 2.1 and 3.1(c)hold.Then

To obtain our main results,we consider the following assumptions:

Assumption 3.2(a)for each i,j ∈S∞,where n ≥max{n(i),n(j)};

Assumption 3.3For eachandthe corresponding Markov process ξtin each model Mnis irreducible.

Remark 3.2(i)Under Assumptions 2.1and 3.3,Theorem 2.5 in[16]yields that for eachand,the Markov chain{ξt}has a unique invariant probability measure,denoted by

(ii)Under Assumptions 2.1,3.1 and 3.3,Theorem 2.11 in[16]implies that the control model(2.1)is uniformly w-exponentially ergodic,that is,there exist constants δn>0 and βn>0 such that

(iii)Under Assumptions 2.1,3.1 and 3.3,for each0 ≤l ≤p and stationary policytheis a constant and does not depend on the initial state i,more precisely,

Assumption 3.4(Slater condition) There exists a policysuch that

Lemma 3.2Suppose Assumptions 2.1,3.1,3.2(a)-(b)and 3.3 hold.Then,for each 0 ≤l ≤p,

This statement has been established by Theorem 4.21 in[16]for deterministic stationary policies. It is easy to extend the result to the class of randomized stationary policies.

In particular,under Assumptions 2.1,3.1 and 3.3,for eachandNow,we introduce the following sets

and

Definition 3.1The w-weak topology on Pw(Kn)is the coarsest topology for which all mappings

are continuous.

The following lemma characters the w-weak topology;for a proof,see[8,Corollary A.45].

Lemma 3.3A sequenceconverges w-weakly toμif and only iffdμfor every measurable function f which isμ-a.e continuous on Knand for which exists a constant c such that|f|≤c·wμ-almost everywhere.In this case,we write

Lemma 3.4Suppose that Assumptions 2.1,3.1 and 3.3 hold.Then,the set Λnandare convex,compact and closed under the w-weak topology for each

ProofUnder Assumptions 2.1,3.1 and 3.3,it follows from Lemma 8.12 in[16]that Λnis convex and compact. Letwithandby the definition of w-weakly convergence and Assumption 3.1(b),it follows from Lemma 8.11 in[16]that

which implies thatμ∈Λn.Hence,Λnis closed.Furthermore,suppose thatfor each m,then it follows that

for each 1 ≤l ≤p,which implies thatandis closed.Now,suppose thatwe have that

for each 1 ≤l ≤p,which implies the convexity ofThe proof is completed.

Lemma 3.5[10]Suppose that Assumptions 2.1,3.1 and 3.3 hold.Then,for eachthere exists a stationary policysuch thatandfor each 1 ≤l ≤p.

By Assumption 3.4 and Lemma 3.5,there exists a stationary policysuch thatfor each 1 ≤l ≤p. Next,we need to introduce some notations:

Lemma 3.6Suppose that Assumptions 2.1 and 3.1-3.4 hold,then there exist an integer N?andfor each n ≥N?such that

ProofFor each fixed δ<θ,it follows from Lemma 3.2 and Assumption 3.2(c)that there exists an integer N such that for each n≥N,we have ?1(n)+?2(n)≤δ and

for each 1 ≤l ≤p.Hence,

The following lemma gives the optimality equation for constrained problem which has been established by Theorem 8.13 in[16]for the case of a single constraint.For completeness,we extend Theorem 8.13 in[16]to any finite numerable of constraints.First,we need to introduce some notations.For each x={x1,···,xp},we defineand say x ≤y if xk≤ykfor each 1 ≤k ≤p.

The proof of the following lemma is similar to that of Theorem 4.10 in[5]for discrete-time MDP,we prove it here only for completeness.

Lemma 3.7Suppose that Assumptions 2.1 and 3.1-3.4 hold. Then,for each

(i)there exist a function hn∈Bw(Sn)and a vectorsuch that

where cn(i,a)?dndenotes the p-dimensional vector whose components arefor each 1 ≤l ≤p;

(ii)there exists a stationary optimal policyof Mn.

Proof(i)Letbe fixed and

For each z=(z1,z2,···,zp)∈Onwe define

and

Next,we need to show that Un(z)is concave in z. Let zk∈Onwith zk=for each k=1,2.There existμ1∈Nn(z1)andμ2∈Nn(z2)such thatandLet λ ∈(0,1).We have

and

which implies that λμ1+(1 ?λ)μ2∈Nn(λz1+(1 ?λ)z2). Thus,it follows that Un(λz1+(1 ?λ)z2)≥λUn(z1)+(1 ?λ)Un(z2),that is Un(z)is concave on On.

For arbitrary z1,z2∈Onsatisfyingfor each 1 ≤l ≤p,we have Un(z1)≤Un(z2). By Lemma 3.6,dnis the interior of On. Then,it follows from Theorem 7.12 in[1]that there existswithsuch that,for eachTakefor someμ∈Λn.Then,

for eachμ∈Λn.That is,

which implies that

Under Assumptions 2.1,3.1 and 3.3,by(3.6),Theorem 3.20 in[16]and Remark 3.2(ii),there exists hn∈Bw(Sn)such that

for each 1 ≤l ≤p,then we have

On the other hand,by the definition of Un(dn),we haveHence,(ii)LetBy Assumption 3.1(b)and Lemma 3.4,there exist a p.m.and a stationary policysuch that

The proof of the following lemma is similar to that of Theorem 8.14 in[16]with one constraint,we prove it here for the sake of completeness.

Lemma 3.8Suppose that Assumptions 2.1 and 3.1-3.4 hold. Then,for eachthere existandsuch that

ProofLetbe the vector as in Lemma 3.7,we can get the third equality.For each fixedsuppose thatμ∈Λnsuch that the l-th constraints is violated,that isIt follows that

Hence,

which together with Lemma 3.7 implies the first equality.

Hence,Tn(μ,λ)is convex inμfor each fixed λ.

which implies that Tn(μ,λ)is upper semi-continuous inμfor each fixed λ.

For each fixedμ∈Λn,let λ1,λ2∈(?∞,0]p.It follows that

which implies that Tn(μ,λ)is concave in λ for eachμ. By the Minmax Theorem in[2,p.129]or[7,Theorem 2],we have

that is

which implies the second equality.By Lemma 3.7,there exists a stationary policysuch thatLet,then

Hence,the last inequality holds.The proof of Lemma 3.8 is finished.

4 The Main Results

Theorem 4.1Suppose that Assumptions 2.1 and 3.1-3.4 hold.Then,

Then,we have

As in Lemma 3.6,there exists a stationary policysuch thatfor each 1 ≤l ≤p.Similarly,we have

which implies that

Under Assumptions 2.1,3.1 and 3.3,by Lemma 3.8,there existssuch that

By(4.1)and(4.3),there existssuch thatLetdenote the restriction of φ in the set Sn,then

Similarly,by Lemma 3.8,there existssuch that

It follows from(4.2)and(4.4)-(4.5)that

Hence,there exists an integer N such that for eachwhich implies the desired result.The proof is completed.

Lemma 4.1[11]Suppose that Assumptions 2.1,3.1 and 3.3 hold.For any fixedif there existssuch thatthenfor each 0 ≤l ≤p.

Theorem 4.2Suppose that Assumptions 2.1 and 3.1-3.4 hold.Ifis an optimal policy of Mnfor eachand,then φ∞is an optimal policy of M∞.

ProofFirst,we should show that φ∞is a feasible solution of M∞. Letdenote an extension oftoby replacing φnin(4.5)withhere. Under Assumptions 2.1 and 3.1-3.3,by Lemmas 3.2 and 4.1,for each ?>0,there exists an integer N such thatandfor each n ≥N and 1 ≤l ≤p.Hence,

which together with Assumption 3.2(c)implies that

for each 1 ≤l ≤p.Hence,φ∞is feasible for M∞.By Lemma 4.1 and Theorem 3.6,we have that

Hence,φ∞is optimal for M∞.The proof is completed.

主站蜘蛛池模板: 国产精品污污在线观看网站| 久草青青在线视频| www精品久久| 手机在线免费不卡一区二| 国产一区二区精品福利| 天堂成人在线| 国产乱子伦精品视频| 日韩二区三区| 老司机久久99久久精品播放 | 亚洲中文无码av永久伊人| 久久青草精品一区二区三区| 亚洲视频无码| 亚洲三级影院| 亚洲天堂啪啪| 国产浮力第一页永久地址| 激情无码视频在线看| 一级爆乳无码av| AV在线天堂进入| av一区二区三区在线观看| 华人在线亚洲欧美精品| 91啦中文字幕| 国产精品一区二区无码免费看片| 日韩天堂视频| 在线欧美国产| 在线观看网站国产| 国产一级做美女做受视频| 国产国模一区二区三区四区| 伦伦影院精品一区| 伊人久久大线影院首页| 日韩精品一区二区三区视频免费看| 91免费观看视频| 成人在线不卡| 亚洲VA中文字幕| 亚洲Aⅴ无码专区在线观看q| 亚洲视频一区在线| 欧美日本在线| 亚洲成肉网| 99精品伊人久久久大香线蕉| 亚洲精品视频网| 欧美不卡视频在线| 91精品综合| 亚洲IV视频免费在线光看| 人妻一区二区三区无码精品一区| 国产波多野结衣中文在线播放| 亚洲AV无码一二区三区在线播放| 国产成人精品18| 五月婷婷伊人网| 亚洲性一区| 国内精品小视频在线| 亚洲男人的天堂网| 亚洲无码高清免费视频亚洲 | 国产第一页第二页| 高清免费毛片| 国产爽爽视频| 国产91无毒不卡在线观看| 日韩在线网址| 国产激爽大片在线播放| 女人爽到高潮免费视频大全| 亚洲高清中文字幕在线看不卡| 午夜精品区| 亚洲国产精品不卡在线| 夜夜操天天摸| 丰满少妇αⅴ无码区| 国产一级视频久久| 91精品国产一区| 日本一区高清| 国产人前露出系列视频| 日韩精品资源| 亚洲国产综合自在线另类| 中文字幕乱妇无码AV在线| 亚洲高清无码精品| 91蝌蚪视频在线观看| 国产区在线观看视频| 国产SUV精品一区二区6| 中文字幕无码制服中字| 91外围女在线观看| 久久九九热视频| 国产精品va| 国产超碰一区二区三区| 波多野结衣在线一区二区| a级毛片一区二区免费视频| 久久精品娱乐亚洲领先|