999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Learning control of fermentation process with an improved DHP algorithm☆

2016-05-26 09:28:31DaziLiNingjiaMengTianhengSong
Chinese Journal of Chemical Engineering 2016年10期

Dazi Li*,Ningjia Meng,Tianheng Song

Institute of Automation,Beijing University of Chemical Technology,Beijing 100029,China

1.Introduction

Ethanol fermentation is a complex chemical reaction process.Properties of noise,nonlinear,and dynamics make it difficult to control.An optimal substrate feeding policy problem was proposed[1],where the fed-batch fermentation process was described by four inner state variables including the substrate concentration and the product concentration.In[2],the dynamic reoptimization problem of a fed-batch biochemical process was studied to make the applicability of a class of adaptive critic designs(ACDs)for online reoptimization.

As pointed out in[3],the objectives of fed-batch processes contain maximizing biomass concentration,and minimizing ethanol formation.Also in[3],the genetic search approaches are used to solve these problems.Nonlinear optimal control methods developed for the fed-batch fermentation process have been widely researched[4].However,due to the difficulties in designing nonlinear controllers for chemical processes with uncertain dynamics,intelligent control methods including various learning control techniques have been studied in recent years[5,6].

As studied in[7],a controller based on a dynamic programming method for a fed-batch bioreactor was developed to find the optimal glucose feed rate pro file.Based on the Bellman optimality principle,the dynamic programming methods are mainly used to solve the optimization problems of the dynamic processes with the constraint conditions.However,the original dynamic programming algorithms suffer from the “curse of dimensionality”.Consequently,these methods show no good performance for the complex,nonlinear continuous environments.

Developments of Q-Learning algorithm[8]still cannot avoid the problem of the “curse of dimensionality”.Aiming at this problem,certain progresses have been made.ACDs[9]can approximately obtain the system state value functions through establishing a “critic”,that is,in order to obtain optimal solution,ACDs make use of function approximation structure(e.g.,the neural network)to approximate the value functions.As a class of ACDs,heuristic dynamic programming(HDP)is widely studied as an important learning control method for a bioreactor system[10].As well known,dual heuristic programming(DHP)algorithm needs more model information but has better learning control effect,especially when the system parameters and environmental conditions are time-variant[11,12].HDP,DHP and Globalized DHP(GDHP)[13]are all belong to adaptive dynamic programming(ADP)[14].They are applied to the online learning control and optimal control problems[15,16].In recent years,Xu et al.proposed kernel DHP and its performance is analyzed both theoretically and empirically[17,18].

In the critic learning of DHP,the critic network constructed with neural network shows low efficiency in previous research.Furthermore,derivatives of state value functions are affected by the setting of initial weights,structure selection and iteration speed in the neural network.To solve the above problems,a new DHP algorithm integrating the least squares temporal difference with gradient correction(LSTDC)algorithm[19]as the critic is developed in this paper.The main purpose is replacing neural network structure with LSTDC to simplify the weight adjustment process and to improve approximation precision and convergence speed.Then the improved DHP algorithm based on LSTDC(LSTDC-DHP)is applied to realize efficient online learning control for the fed-batch ethanol fermentation process.Simulation results demonstrate that LSTDC-DHP can obtain the near-optimal feed rate trajectory in continuous spaces.

TThe rest of this paper is organized as follows. Following the introduction,the model of fed-batch ethanol fermentation process is described in Section 2.A review of the Markov decision processes(MDPs)and ACDs is presented in Section 3.The proposed LSTDC-DHP is derived in Section 4.To show the applicability of LSTDC-DHP,simulation experiments are implemented and empirical results are presented in Section 5.Finally the main conclusions are summarized in Section 6.

2.Problem Formulation

The studied process is the fed-batch ethanol fermentation process presented in[20].The main components in reaction tank contain cell mass,nutrients,water and substrate.The most prominent feature of this process plant is continuously fed by concentrated substrate shown in Fig.1.The chemical reaction is shown in Eq.(1).

Fig.1.The ethanol fermentation plant.

In a certain temperature,the raw materials including glucose,zymase and ethanol take complex dynamic biochemical reactions.

The reaction mechanistic model can be described by the following equations:

where c1is the cell mass concentration(g·L?1),c2denotes the substrate concentration(g·L?1),c3stands for the product concentration(g·L?1),and V is the liquid volume of there actor(L).The feed-rate to there actor is denoted by u,which is constrained by 0L·h?1≤u≤12L·h?1.The specific growth rate,ε,and the specific productivity,τ,are functions of c2and c3.

In Eqs.(2)and(3),the assumed measurable initial state contains four variables which are specified as[c1,c2,c3,V]=[1,150,0,10].The liquid volume V of the reactor is limited by 200 L.The final batch time tfis fixed as 3 h.

Target of the optimal control is to maximize the yield at the end of a fixed processing time of 3 h.The operated variable is the feed-rate of substrate,u.By combining and integrating the last two differential formulas of Eq.(2),the maximum product output can be obtained as follows:

3.Brief Overview on MDPs and ACDs

3.1.Markov decision processes

Markov chain is a kind of discrete stochastic process with Markov property.This kind of stochastic process has the following characteristics:the next state of the random process is only related to the current state,but not to any other historical states.A MDP is composed of a tuple{X,A,P,R},where X is the state space,A is the action space,P is the state transition probabilities,and R is the reward function.When a certain state xtat time-step t is reached, an action atis chosen under a fixed policy πt.This deterministic stationary policy directly maps states to actions,which is denoted as

When the actions atsatisfy Eq.(7),π is called a Markov policy.The objective of a decision maker is to estimate the optimal policy π?to satisfy

where γ∈(0,1)is a discount factor for infinite horizon problems;Eπ[·]stands for the expectation with respect to the policy π and the state transition probabilities.

The state value function V(xt)is defined as the future expected discounted total reword beginning with xtwhen getting through the action of the discount factor:

For any policy π,the value function is defined as infinite time accumulated expected discounted rewards,namely,

3.2.Adaptive critic designs

In the last few years,ACDs have been widely studied in the field of reinforcement learning.ACDs mainly consist of three parts:the critic network,the model network and the actor network.The basic idea of ACDs is to construct the approximate function that meets the Bellman optimality principle by adjusting the weights of the critic network.ACDs are applicable in the noisy,nonlinear,and non stationary environments.DHP is a very important class of ACD methods.DHP uses both value function approximation(VFA)and policy gradient learning strategy algorithm to search for a near-optimal control policy in the continuous spaces.

Generally,the training of ACDs is carried out by the policy iteration process of the dynamic programming.The critic network evaluates the control performance of actor network,which is called policy evaluation process.Meanwhile,the actor network produces the control actions and improves its policy according to the evaluation of the critic network,namely,policy improvement process.

4.The Improved DHP Algorithm Based on LSTDC

4.1.Previous work on value function approximation

Since the real-world engineering application problems with large or continuous state spaces exist widely,tabular representations of value functions suffer from the difficulties of great computation and large storage requirements.In order to deal with the generalization of Markov chain learning prediction problem with large and continuous state spaces,and to realize the temporal differences(TD)learning,the TD(λ)algorithm[21]based on VFA has been widely studied in the reinforcement learning and operations research community.

The estimated value function represented by a general linear VFA can be denoted as

For arbitrary 0≤λ≤1,the linear TD(λ)algorithm with probability 1 converges to the unique solution of the following equation:

Least squares temporal difference(LSTD)algorithm was proposed to overcome the difficulty of selecting learning factor in linear TD(λ)algorithm,and to improve the efficiency of the data and the convergence speed of the algorithm[22]. In LSTD(0), the performance function is defined as

By employing the instrumental variables approach, the least-squares solution of the above equation is given as

where ?tis the instrumental variable which is not with respect to the input and output observation noises.

In order to reduce the computational burden of the matrix inversion and to realize online learning,the LSTD(0)algorithm was further derived to the recursive form RLS-TD(0)[22].The update rules of RLSTD(0)take the form of

In LSTD(0)and RLS-TD(0),the decaying factor is 0.Equivalently,the above two algorithms do not employ the eligibility trace.Accordingly,the RLS-TD(λ)algorithm[23]is considered to further improve the efficiency of TD algorithm.Therefore,Eq.(19)is further modified as

Based on the above derivation,the weight vector update rules of RLS-TD(λ)have the following form

where Ktis a recursive gain matrix and Ptdenotes the variance matrix calculated in the weight updating process.

4.2.LSTDC-DHP algorithm

When facing with the more complex control problem,the online reinforcement learning algorithm is needed to further speed up the convergence,and to make the control algorithm more stable.Therefore,another improved algorithm, TD with gradient correction(TDC)[24–27],is derived.The mean-square projected Bellman error performance function employed in TDC is described by

where T is the Bellman operator and Π is a projection that can projectback to the VFA function space.

Based on the above performance function, the weight update rules of the TDC algorithm can be obtained

where αtand ηtare the learning factors,andis the temporal difference error.

Next,a framework of LSTDC is introduced to the weight adjustment in the critic learning of DHP.Based on this framework,we present the idea on how to improve the critic network in DHP.Different from previous works,our work focuses on extending LSTDC-DHP to learning control problems in the batch fermentation process. The aim is to obtain better performance compared with previous DHPs with TD algorithm.A general framework of DHP is shown in Fig.2.The main components include a critic,an actor,a model,and a plant.Model is the system model which is established.And the plant is the system process.The critic is used to approximate the derivatives of value functions.The actor receives the plant's current state xtand outputs the control at.

Fig.2.Learning control structure of DHP.

Then the model receives the control at,and estimates the next state x.The polynomialsare derived in the next

t+1section.

4.2.1.Critic network

ACDs are actor–critic learning control architecture forms.As shown in Fig.2,the critic network estimates the state value function V(xt)by using the following Bellman equation of dynamic programming

To approximate the derivatives of state value functions,the main route to improve the DHP algorithm is to change the approximation method of two key parameters λ(xt)and λ(xt+1).Therefore,VFA is constructed by LSTDC which is used to replace the critic neural network structure,namely

Since the value function parameters make the sum of TD updates over the observed trajectory to be zero,the sums of TDC updates in Eqs.(31)and(32)should be zero as follows:

In order to make sure Gtis full rank, the ridge regression is employed by initializing Gtwith an identity matrix multiplied by a small positive number g.As a basic ridge regression problem,the effect of g is not only to make Gtinvertible,but also to affect the critic weight on regularization.Since few literatures have previously discussed the sensitivity relating to g value choice, the choice of g value is based on trial and error in simulation experiments. The LSTDC update rules for critic learning in LSTDC-DHP are as follows

LSTDC is used to construct a proper basis function to replace the critic neural network structure in LSTDC-DHP.difficulties of selecting parameters in the previous DHP based on neural network function approximation can be overcome.Therefore,the approximation precision and convergence speed of critic network are improved.The critic weight can be rapidly estimated only by constructing a suitable basis function to obtain continuous observation data.

4.2.2.Actor network

The actor network is used to generate the control actions based on the observed states of the plant. The output of the actor network is given by

4.2.3.Model network

The model network is to approximate the characteristics of dynamic process. The mechanistic model of fed-batch ethanol fermentation process is regarded as the model network.

Algorithm 1.Learning control of fermentation process based on LSTDC-DHP

(1)Give the initial critic weightsand actor weights

(2)Set t=0 and J0=0

(3)According to the current state xt,compute the out put of the actorselect the action ut,and observe the next state xt+1,get next reward rt+1;

(4)Compute the performance index function Jtaccording to Eq.(4);

(5)Apply LSTDC described in Eqs.(47)and(48)to update the weights of the critic network;

(6)Update the weights of the actor network using Eq.(52);

(7)Until Jt+1?Jt≤0 or the fermentation time reaches tf;

(8)Let t=t+1,return to(3).

In the weight adjustment process of LSTDC-DHP,the weight is estimated by using continuous observation data which is generally easy to obtain.Therefore,LSTDC-DHP can be applied in the batch process with greatly improved efficiency of the weight update and reduced parameter setting.The procedure of LSTDC-DHP is briefly summarized in Algorithm 1.

5.Simulation Studies and Discussion

The learning control of the fed-batch ethanol fermentation process is studied to illustrate the effectiveness of LSTDC-DHP applied to the continuous process.To solve the reinforcement learning problem with continuous state space,a class of linear function approximators,which is called Cerebellar Model Articulation Controller(CMAC)is used.It is well known that CMAC has been widely used in process control and function approximation.In the LSTDC-DHP learning algorithm,a CMAC neural net work with four inputs and one output is used in the actor network.

In the experiment,the actor's learning factor is βt=0.3.The weights of the critic network and the actor network are all initialized as 0.Gtis initialized to an identity matrix and htis initialized as 0.The other parameters are set as γ=0.95,c1=0.4 and c2=0.5.The initial state vector[c1,c2,c3,V]is initialized as[1,150,0,10].The time step is 5min in the simulation process.A learning control process of ethanol fermenter is defined as starting from the initial state to 36 time steps.The learning control results are shown in Fig.3.The optimal control policy curve is continuous.The cell mass concentration,the substrate concentration and the product concentration reach steady states at the end of fixed batch time.We can find that the yield of the ethanol fermentation process is becoming smooth gradually in the learning control process. It can also be seen that the performance index converges to the steady-state value eventually.

Considering productive benefit,the specific growth rate and the specific productivity which is in the reaction mechanistic model should be strictly increasing until reaction termination.According to Eq.(3),the product concentration shown in Fig.3 should decline when the substrate concentration rises for the sake of ascending ε and τ.The liquid volume of the reactor increases all the time.Though the cell mass concentration descends,the cell mass amount is increased on the whole.In order to improve the production,the cell mass needs to consume more substrates.Therefore,the feed-rate to the reactor suddenly rises.

In addition,two other DHP methods integrated with RLSTD(λ)and TDC as the critic algorithms are also implemented for the same control problem,named RLSTD(λ)-DHP and TDC-DHP.Including RLSTD(0)-DHP by Xu et al.[18],three methods are used for comparison with LSTDC-DHP.For TDC-DHP,the critic's learning factor is αt=0.1 and ηtis initialized as 0.1.Then,in a similar way,the variance matrix Ptis set to be identity matrix in RLSTD(0)-DHP.λ is equal to 0.6 and the eligibility trace vector is initialized as 0 in RLSTD(λ)-DHP.The performances of the LSTDC-DHP,RLSTD(0)-DHP,RLSTD(λ)-DHP and TDC-DHP algorithms are compared in Fig.4.The yield of the reactor is chosen as the performance index in the simulation experiments.In LSTDC-DHP,we can see that better yield of the ethanol fermentation process and better values of performance index are obtained as shown in Table 1 due to the use of LSTDC.Fig.5 shows the product concentration comparison by different algorithms and Fig.6 shows the control policy curves of the ethanol fermentation process.We can see that the learning control of the fed-batch ethanol fermentation process is completed continuously.Therefore, suitable feed rate obtained by LSTDC-DHP can lead to the maximum yield.

Fig.4.Performance indexes of four different algorithms.

Table 1 Performance comparison under different algorithms

Fig.3.The learning control results of the fed-batch ethanol fermentation process.

Fig.5.Comparison of the product concentration by different algorithms.

Fig.6.Curves of control policies by four different algorithms.

6.Conclusions

In this paper,LSTDC-DHP has been developed to deal with learning control problems. LSTDC is chosen as the critic to replace the neural network structure in LSTDC-DHP.Based on the simulations,performance of LSTDC-DHP is analyzed.From the experiment results,LSTDC-DHP takes effect in simplifying the weight adjustment process and improving the critic's approximation precision.LSTDC-DHP can be used to design the optimal feed rate trajectory curve continuously.Consequently,the aim of obtaining the maximum ethanol product is achieved.Simulation results on the learning control of the fed-batch ethanol fermentation process illustrate the effectiveness of LSTDC-DHP and verify the excellent performance of LSTDC-DHP in continuous spaces.This research also shows that it is promising to study the learning control LSTDCDHP algorithm and apply it to complex,nonlinear dynamic chemical industrial process efficiently.

[1]J.Hong,Optimal substrate feeding policy for a fed batch fermentation with substrate and product inhibition kinetics,Biotechnol.Bioeng.28(9)(1986)1421–1431.

[2]M.S.Iyer,D.C.Wunsch,Dynamic re-optimization of a fed-batch fer mentor using adaptive critic designs,IEEE Trans.Neural Netw.12(6)(2001)1433–1444.

[3]U.Yüzge?,M.Türker,A.Hocalar,On-line evolutionary optimization of an industrial fed-batch yeast fermentation process,ISA Trans.48(1)(2009)79–92.

[4]A.Ashoori,B.Moshiri,A.Khaki-Sedigh,M.R.Bakhtiari,Optimal control of a nonlinear fed-batch fermentation process using model predictive approach,J.Process Control 19(7)(2009)1162–1173.

[5]C.V.Peroni,N.S.Kaisare,J.H.Lee,Optimal control of a fed-batch bioreactor using simulation-based approximate dynamic programming,IEEE Trans.Control Syst.Technol.13(5)(2005)786–790.

[6]S.Syafiie,F.Tadeo,M.Villafín,A.Alonso,Learning control for batch thermal sterilization of canned foods,ISA Trans.50(1)(2011)82–90.

[7]C.Valencia,G.Espinosa,J.Giralt,F.Giralt,Optimization of invertase production in a fed-batch bioreactor using simulation based dynamic programming coupled with a neural classifier,Comput.Chem.Eng.31(9)(2007)1131–1140.

[8]D.Z.Li,L.Qian,Q.B.Jin,T.W.Tan,Reinforcement learning control with adaptive gain for a Saccharomyces cerevisiae fermentation process,Appl.Soft Comput.J.11(8)(2011)4488–4495.

[9]D.V.Prokhorov,D.C.Wunsch,Adaptive critic designs,IEEE Trans.Neural Netw.8(5)(1997)997–1007.

[10]C.Q.Lian,X.Xu,L.Zuo,Learning control of a bioreactor system using kernel-based heuristic dynamic programming,Proc.World Congr.Intelligent Control Autom.WCICA 2012,pp.316–321.

[11]B.Wang,D.B.Zhao,C.Alippi,D.R.Liu,Dual heuristic dynamic programming for nonlinear discrete-time uncertain systems with state delay,Neurocomputing134(2014)222–229.

[12]C.Q.Lian,X.Xu,L.Zuo,Z.H.Huang,Adaptive critic design with graph Laplacian for online learning control of nonlinear systems,Int.J.Adapt.Control Signal Process.28(2014)290–304.

[13]M.Fairbank,E.Alonso,D.Prokhorov,Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,IEEE Trans.Neural Netw.Learn.Syst.23(10)(2012)1671–1676.

[14]J.Fu,H.B.He,X.M.Zhou,Adaptive learning and control for MIMO system based on adaptive dynamic programming,IEEE Trans.Neural Netw.22(7)(2011)1133–1148.

[15]Z.Ni,H.B.He,J.Y.Wen,Adaptive learning in tracking control based on the dual critic network design,IEEE Trans.Neural Netw.Learn.Syst.24(6)(2013)913–928.

[16]F.X.Tan,D.R.Liu,X.P.Guan,Online optimal control for VTOL aircraft system based on DHP algorithm,Proc.of the 33rd Chinese Control Conf.,CCC 2014,pp.2882–2886.

[17]X.Xu,Z.S.Hou,C.Q.Lian,H.B.He,Online learning control using adaptive critic designs with sparse kernel machines,IEEE Trans.Neural Netw.Learn.Syst.24(5)(2013)762–775.

[18]X.Xu,C.Q.Lian,L.Zuo,H.B.He,Kernel-based approximate dynamic programming for real-time online learning control:An experimental study,IEEE Trans.Control Syst.Technol.22(1)(2014)146–156.

[19]T.H.Song,D.Z.Li,L.L.Cao,K.Hirasawa,Kernel-based least squares temporal difference with gradient correction,IEEE Trans.Neural Netw.Learn.Syst.27(4)(2016)771–782.

[20]Z.H.Xiong,J.Zhang,Neural network model-based on-line re-optimisation control of fed-batch processes using a modified iterative dynamic programming algorithm,Chem.Eng.Process.44(4)(2005)477–484.

[21]R.S.Sutton,Learning to predict by the method of temporal differences,Mach.Learn.3(1998)9–44.

[22]S.J.Bradtke,A.G.Barto,Linear least-squares algorithms for temporal difference learning,Mach.Learn.22(1–3)(1996)33–57.

[23]X.Xu,H.G.He,D.W.Hu,Efficient reinforcement learning using recursive least-squares methods,J.Artif.Intell.Res.16(2002)259–292.

[24]S.Bhatnagar,D.Precup,D.Silver,Convergent temporal-difference learning with arbitrary smooth function approximation,Adv.Neural Inf.Process.Syst.-Proc.Conf 2009,pp.1204–1212.

[25]R.S.Sutton,H.R.Maei,D.Precup,Fast gradient-descent methods for temporal difference learning with linear function approximation,Proc.Int.Conf.Mach.Learn.,ICML 2009,pp.993–1000.

[26]M.Geist,O.Pietquin,Algorithmic survey of parametric value function approximation,IEEE Trans.Neural Netw.Learn.Syst.24(6)(2013)845–867.

[27]H.R.Maei,C.Szepesvári,S.Bhatnagar,R.S.Sutton,Toward off-policy learning control with function approximation,Proc.Int.Conf.Mach.Learn.,ICML 2010,pp.719–726.

主站蜘蛛池模板: 亚洲精品成人片在线观看| 无码丝袜人妻| 日韩二区三区| 91精品国产自产在线老师啪l| 激情网址在线观看| 99久久精品视香蕉蕉| 精品国产电影久久九九| 一级香蕉人体视频| 国产剧情国内精品原创| 久久久久亚洲Av片无码观看| 国产精鲁鲁网在线视频| 国产一区二区免费播放| 在线看片国产| 91久久偷偷做嫩草影院| 久久国产成人精品国产成人亚洲| 欧美区一区| 国产成人精品日本亚洲77美色| 免费国产黄线在线观看| 伊人国产无码高清视频| 国产99视频精品免费观看9e| 国产区91| 一级全免费视频播放| 天堂成人在线视频| 国产麻豆91网在线看| 亚洲三级电影在线播放| 露脸真实国语乱在线观看| 熟妇丰满人妻av无码区| 一区二区三区四区在线| 成人韩免费网站| 怡红院美国分院一区二区| 亚洲一欧洲中文字幕在线| 亚洲第一综合天堂另类专| 亚洲首页在线观看| 亚洲欧洲日韩综合色天使| 红杏AV在线无码| 思思热精品在线8| 国产高清不卡视频| 国产日韩欧美在线视频免费观看| 国产精品密蕾丝视频| 99久久免费精品特色大片| 啦啦啦网站在线观看a毛片| 美女国产在线| 亚洲一区二区黄色| 亚洲日本中文综合在线| 日韩中文无码av超清| 人妻21p大胆| 青青操国产视频| 国产99在线| 国产精品欧美日本韩免费一区二区三区不卡 | 亚洲第一成年免费网站| 国产一区二区三区免费| 亚洲日韩精品无码专区97| 国产精品19p| 污网站在线观看视频| 国产福利在线免费| 伊人国产无码高清视频| 欧美日韩成人在线观看| 国产美女免费网站| 国产va在线| 亚洲V日韩V无码一区二区| 免费人成视网站在线不卡| 国产精品久久国产精麻豆99网站| 在线观看无码a∨| 无码国内精品人妻少妇蜜桃视频 | 国产丝袜啪啪| 久久精品人人做人人爽电影蜜月| 亚洲精品无码成人片在线观看| 成年看免费观看视频拍拍| 亚洲国产成人麻豆精品| 国产亚洲日韩av在线| 中文字幕亚洲专区第19页| 久久 午夜福利 张柏芝| 国产一区二区三区夜色| 狠狠亚洲婷婷综合色香| 超清无码一区二区三区| 亚洲综合二区| 91在线无码精品秘九色APP| 亚洲精品天堂自在久久77| 日韩av无码精品专区| 精品无码视频在线观看| 中文字幕永久视频| 日本欧美在线观看|