Zhijie Zhao and J?rn Ostermann
(Institut für Informationsverarbeitung Leibniz Universit?t,Hannover D-30167,Germany)
Abstract In this paper,low-complexity error-resilience and error-concealment methods for the scalable video coding(SVC)extension of H.264/AVC are described.At the encoder,multiple-description coding(MDC)is used as error-resilient coding.Balanced scalable multiple descriptions are generated by mixing the pre-encoded scalable bit streams.Each description is wholly decodable using a standard SVC decoder.A preprocessor can be placed before an SVC decoder to extract the packets from the highest-quality bit stream.At the decoder,error concealment involves using a lightweight decoder preprocessor to generate a valid bit stream from the available network abstraction layer(NAL)units when medium-grain scalability(MGS)layers are used.Modifications are made to the NALunit header or slice header if some NAL units of MGSlayers are lost.The number of additional packets that a decoder discards as a result of a packet loss is minimized.The proposed error-resilience and error-concealment methods require little computation,which makes them suitable for real-time video streaming.Experiment results show that the proposed methods significantly reduce quality degradation caused by packet loss.
Keyw ords error resilience;error concealment;SVC;MDC
R eal-time video streaming over packet-switched networks can be impeded by packet loss,which often produces undesirable effects at the decoder.Usually,packet loss can make part of a frame,a whole frame,or even several frames undecodable using a standard decoder.Therefore,error-resilient coding and error-concealment techniques are widely used in video streaming systems to reduce the effect of transmission errors and to minimize end-to-end distortion in error-prone environments.
Error-resilient coding in the encoder produces redundancy,and this limits packet loss.Error-resilient coding tools for scalable video coding(SVC)can be classified as standard or non-standard.SVC supports several standard error-resilient coding tools,including intra-MB/picture refresh,slice coding,parameter sets,flexible MB order,and redundant slices/pictures[1].Loss-aware rate-distortion-optimized mode decision[1],forward-error correction,and multiple-description coding(MDC)are non-standard error-resilient coding tools.MDC is used to code a video sequence into two or more bit streams,called descriptions,and these descriptions are transmitted using independent paths.Each description can be decoded independently so that the reproduction of the original source reaches a basic level of quality.Ahigh level of quality can be achieved when all descriptions are reconstructed together.Some early forays into MDC include multiple-description scalar quantizer[2],MDC with pairwise correlating transform[3],multiple-description compensation schemes[4],multiple-state video coding[5],and multiple-description coding based on forward-error correction[6].Alternatively,error concealment can be used in the decoder to passively reduce transmission errors.In this way,available and correctly decoded information is used without modifying source-and channel-coding schemes.The tools and structure of a special codec are also used to reduce video quality degradation.The temporaland spatial correlation between frames or within a frame is frequently used to conceal the artifacts caused by transmission errors.Motion data is one of the most important types of data for decoding a frame in hybrid video codecs,so motion copy and motion prediction are widely used.With SVC,a video sequence is coded into one or more layers,and data-rate adaptation is allowed.This is an attractive solution for dealing with heterogeneous networks and different terminal capacities.The scalable extension of H.264/AVC is the latest SVC standard[7].SVCprovides temporal,spatial,and quality scalability that can be combined for greater adaptability to different network conditions or terminal capacities.In this paper,we describe low-complexity error-control methods for SVC.In particular,we propose a flexible,standard-compatible MDC method for error-resilient coding[8]and a low-complexity error-concealment method for SVC at the decoder when an MGSlayer is used[9].In section 2,we give an overview of work related to SVCand introduce our proposed methods.In section 3,we present simulation results.Section 4 concludes the paper.
The SVC extension of H.264/AVC incorporates the key features of H.264/AVCas well as new techniques to improve scalability and coding efficiency.Temporal scalability can be achieved by using hierarchical prediction structures,for example,hierarchical B-pictures or non-dyadic hierarchical prediction structures.Scalable quality can be achieved by using coarse-grain quality-scalable coding(CGS)and medium-grain quality-scalable coding(MGS).Spatial scalability can be achieved by using multilayer coding,and each layer corresponds to a supported spatial resolution.Redundancy between spatial layers can be further exploited by interlayer prediction mechanisms such as interlayer motion prediction,interlayer residual prediction,and interlayer intraprediction.
H.264/AVC has a video coding layer(VCL)and a network abstraction layer(NAL).In the VCL,a coded representation of the input video signalis generated,and in the NAL,this coded representation is fragmented.The NAL provides header information to ease the use of VCLdata.Being an extension of H.264/AVC,SVC has the H.264/AVC
MGSsupports packet-based quality scalability by distributing the transform coefficients of a slice.When MGSis used to provide quality scalability,an access unit can include several MGSNAL units.With MGS,the enhancement layer transform coefficients can be distributed between a maximum of 16 slices,and each slice corresponds to an NALunit.MGS layers inside each dependency layer are identified by a quality identifier.
Combined SVCand MDC has attracted the interest of researchers.Multiple-state video coding[5]splits the input video into subsequences in the temporal domain and encodes each subsequence as an independent description.This coding can be used to generate scalable multiple descriptions with half the temporal resolution.In[10],two complementary descriptions are generated for the high-pass frame of each enhancement layer.These descriptions are generated by assigning only half the motion vectors and texture information of the originalcoded stream in each description.Alternatively,two descriptions are generated for the base layer of SVC by downsampling the residual data[11].Ageneral multirate allocation scheme for multiple-description coding is proposed in[12].This scheme has been used in JPEG 2000 and H.264/AVC to produce multiple descriptions[13].Afully redundant unbalanced MDC scheme for SVC is proposed in[14].In[15],an SVC bit stream produces two balanced descriptions by assigning MGSNALunits to one of the descriptions on alternating frames with a period of two group of pictures(GOPs).The spatial scalability of SVC is not taken into account.Several error-concealment methods have been proposed for SVC[1],[16].Intralayer error concealment and interlayer error concealment deal with frame loss when there are two spatial layers.Error-concealment algorithms copy a picture from another view,generate a motion vector from the same spatial layer,or upsample motion and residual data from the available base layer to generate a lost picture.With a slice support,the error-concealment scheme in[16]not only uses reference-frame information but also uses correctly received information from the same frame and from higher spatial layers.In this way,packet loss can be concealed on the base layer of the compressed stream[17].In[18],a frame-loss error-concealment algorithm based on hallucination is proposed for the spatial-enhancement layer.A training database of hallucinations for missing enhancement frames is generated from the two most-recently decoded Ior Pframes.This algorithm performs better than the motion and residual upsampling proposed in[16].Another error-concealment method for whole-picture loss in hierarchical B-picture coding is proposed[19].This method performs better than the error-concealment method in SVC reference software.In contrast,a low-complexity error-concealment algorithm can be used in the network abstraction layer of SVC[20].The algorithm in[20]recognizes the bit-stream structure and creates a valid sequence of packets from the received packets.In this paper,we call this method NAL unit removal.Unlike other error-concealment algorithms,NAL unit removal does not generate missing frames.In[20],the case of two spatial layers with fine-grain scalability(FGS)is considered.A similar work is[21],which builds on the work in[16]by supporting quality scalability using FGSlayer[22].
In SVC,each quality base layer and quality-enhancement layer of a spatial layer is usually quantized using different step sizes.The coefficients of a quality refinement picture are quantized with a quantization parameter(QP)and can be distributed over several layers.Each of these layers contains partial refinement coefficients that use MGS.QPs can be cascaded over the temporallevels according to a given pattern,or default QPcascading can be used.
The parameters for quantization step sizes are stored in several places,including the picture parameter set(PPS),slice header(SH),and macroblock layer.The lum quantization parameter is initially QPY,and this value is used for all macroblocks in the slice until modified by mb_qp_delta in the macroblock layer.QPYis given as

▲Figure 1.Two descriptions are generated by combining bitstreams.

where pic_init_qp_minus 26 is the initial QPYof-26 for each slice and is stored in the PPS,and slice_qp_delta is a value that changes the quantizer step size at each slice and is stored in the SH.The slice header contains a codeword that indicates the PPSto be used,and the PPSincludes the identifier of the active sequence parameter set(SPS).An active SPSremains unchanged throughout a coded video sequence,and an active PPSremains unchanged within a coded picture.
Two standard-compatible scalable descriptions can be produced by combing streams pre-encoded at different bit rates.To do this,we change only the quantization step size in order to generate low-and high-bit-rate steams.Moreover,in order to combine NALunits of different descriptions at the decoder,both descriptions use the same PPSs and SPSs.Different quantization step sizes are indicated by slice_qp_delta parameter in the SH.Each description generated by combined pre-encoded streams supports spatial,temporal,and quality scalability and can be decoded by a standard SVCdecoder.
To generate balanced multiple descriptions,different bit streams are combined so that high-and low-rate NALunits from alternating frames can be assigned to a description over a period of two GOPs.Fig.1 shows the proposed combination scheme for generating descriptions.
For the first description in Fig.1,the even-numbered frames of the first GOPcome from a high-bit-rate stream,and the odd-numbered frames come from a low-bit-rate stream.The even-numbered frames of the second GOPcome from a low-bit-rate stream,and the odd-numbered frames come from a high-bit-rate stream.For the second description,the even-numbered frames of the first GOPcome from a low-bit-rate stream,and the odd-numbered frames come from a high-bit-rate stream.The even-numbered frames of the second GOPcome from a high-bit-rate stream,and the odd-numbered frames come from a low-bit-rate stream.This produces two descriptions that are balanced in terms of bit rate and quality.
At the decoder,a preprocessor is placed before a standard SVC decoder to parse newly arrived packets and extract the packets from the highest-quality bit stream.However,without the preprocessor,each description is still decodable using a SVC decoder.In our proposed scheme,a side decoder is not needed for an MDC description.The received packets from both descriptions are parsed and arranged into a new stream that is passed to an SVC decoder.
In H.264/AVC and SVC,an NAL unit starts with a single-byte header that signals the type of contained data,for example,NALunit.In SVC,a three-byte extension header is used to indicate the scalable information for the coded slice in a scalable extension and for the prefix NALunit.The parameter's dependency identifier(D),temporal identifier(T),and quality identifier(Q)determine which spatial layer,temporal layer,and quality layer an NALunit belongs to.An access unit corresponds to one picture after decoding and comprises several consecutive NAL units with specific properties.In the SVC design,MGSis used and not FGS;however,no research has been done on MGSquality scalability.We therefore propose extending the NAL-unit-removal algorithm to dealwith packet loss in the MGSlayer.Our approach is motivated by multilayer adaptation for an MGS-based SVC bit stream[23].NAL unit headers or slice headers are parsed to produce a valid bit stream from the available NALunits at the receiver.When a frame belonging to the highest temporal level is lost,the handling method of the NAL-unit-removal algorithm is changed.
When MGSis used to provide scalable quality,an access unit can include several MGSNAL units.With the MGS,the enhancement-layer transform coefficients can be distributed between a maximum of 16 slices,and each slice corresponds to an NALunit.MGSlayers inside each dependency layer are identified by a quality identifier.For the quality base layer of a spatial-enhancement layer,a syntax element called ref_layer_dq_id in the slice header is used to signal which MGSlayer is used for interlayer prediction(assuming that interlayer prediction is enabled).For quality-refinement MGS layers with quality identifier Q>0,the preceding quality layer with quality identifier Q-1 is used for interlayer prediction.Fig.2 shows a layer-prediction structure in an access unit with two spatial layers and two MGSlayers.If an MGSlayer is employed as reference layer for interlayer prediction and is lost,the received bit stream becomes invalid for a standard decoder.For example,the decoding of MGSlayer Q=2 in spatial layer 0 depends on the MGSlayer Q=1 in spatial layer 0.If MGSlayer Q=1 in spatial layer 0 is lost and the other NAL units are received,the bit stream cannot be decoded by a standard decoder.The packets of MGSlayer Q=2 in spatial layer 0 and the whole spatiallayer 1 can be discarded[20].In the following,we discuss how to deal with MGS-layer loss.

?Figure 2.Alayer-prediction structure in an access unit.
To simplify the description without losing generality,we consider the case where the source video is coded with two spatial layers and two MGSlayers within each spatiallayer.Table 1 shows the NAL unit order in a bit stream for a group of pictures(GOP)of size four that has three temporal levels.With the NALunit-removalmethod,if a NALunit of a GOPis lost,a valid NALunit order with lower spatial resolution and/or lower frame rate is chosen.With multiple-quality-layer coding,if an NALunit not from the highest MGSlayer is lost,a valid NAL unit order with a lower-quality layer is chosen.For example,if the 11th NAL unit(MGSlayer 1)of a GOPin Table 1 is lost,the 12th NAL unit,which belongs to dependant MGSlayer 2,is also discarded to create a valid bit stream,even if the 12th NAL unit is correctly received.Because the slice data of the MGSquality-refinement layers include different distributions of transform coefficients,the 12th NALunit can still be used to improve the decoded image quality.Therefore,we do not discard the higher MGSlayers if one or severallower MGS layers are lost.At the client,we use the same layer-dependent modification described in[23],which is made for data-rate adaptation at the server.If the lost MGS layers belong to spatial layer 1,only the quality_id parameter of the NAL headers of the remaining MGSNAL units in spatial layer 1 need to be modified so that continuity of quality_id values is maintained.Because an NALunit header is not compressed,the modification requires very low computing power.
When the 15th NAL unit in Table 1 is lost and the layer-prediction structure in Fig.2 is used,NAL units 16 to 18 are discarded to create a valid bit stream.To use the received higher MGSlayers of a spatial-enhancement layer and to maintain a standard decoder-compliant bit stream,both NAL unit header and slice header are modified.When an MGS NAL unit in spatial layer 0 is lost,the header of the NAL units within spatial layer 0,and the slice header of the quality base layer in spatial layer 1,need to be changed.If the maximum-quality identifier in spatial layer 0 is changed,ref_layer_dq_id in the slice header of the quality base layer in spatial layer 1 is updated according to maximum-quality identifier.The slice header is coded using Exp-Golomb codes in SVC,and parsing and modification is not time consuming.Fig.3 shows the reconstructed video quality of the joint test sequence used in section 0.We discard the first MGSlayer of the spatial enhancement layer.In our proposed method,the NALheader of the second MGSlayer is modified in order to maintain a valid bit stream.In the NAL unit-removal method,only the quality base layer of the spatial-enhancement layer is kept in order to maintain a valid bit stream.The proposed method improves average luma PSNRby 0.57 d B.Our method may introduce drift,so atwo-alternative forced-choice test was performed to assess the subjective quality of our method and the NAL unit-removal method in case the first MGSlayer of the spatial-enhancement layer is lost.Two short videos were shown sequentially,and observers had to choose the one they thought was higher quality.For low and medium qualities,the proposed method is preferred,but for higher qualities,the NALunit-removal method is preferred because of its smoother motion rendition.

▼Table 1.NALunit order in a bit stream for a GOP(four frames)with two spatial layers,three temporallayers and,two MGSlayers

▲Figure 3.Average distortion when the first MGSlayer in spatial layer 1 is discarded.
With the NAL unit-removalmethod,if a quality-base-layer NALunit of the highest temporal layer is lost,an entire temporal layer is removed.For example,if the 13th NAL unit is lost,NALunits 14 to 24 are discarded to arrange a valid bit stream,even if these units are received.However,if hierarchical B picture is used for temporal scalability,the highest temporallayers are B pictures and are not used as reference frames.This means that if one frame of the highest temporal layer is lost,it does not affect the other frames in the temporallayers.In our proposal,the remaining highest temporallayers are retained.The missing frame can be concealed using frame copy or other error-concealment methods for whole-frame loss.
In this section,we present experiment results for the proposed error-resilient and error-concealment methods.JSVM 9.18 SVC reference software was used to encode the input sequences[21].The tested bits streams had three video sequences—Foreman,Mobile,and Akiyo—combined into a single sequence to produce long test-bit streams.Spatial layer 0 had quarter common intermediate format(QCIF)resolution,and spatial layer 1 had common intermediate format(CIF)resolution.The joint sequence contained 897 frames;the GOPsize was 8 frames;and an Iframe was used as the key picture.The RTPpacket size was limited to 1400 bytes,and packet loss in the transmission channel was simulated by a two-state Markov model—where a good state means packets are received correctly and promptly,and a bad state means packets are lost.
In this experiment,we used a streaming scenario with path diversity.The two descriptions are delivered through independent paths.In case of packet loss,all these paths have the same packet-loss probability.Five packet-loss ratios were used:1%,3%,5%,8%and 10%.We assume that parameter sets are conveyed using a reliable transport mechanism.If spatial scalability is supported,a coded bit stream contains two spatial layers(QCIFand CIF),four temporal levels,and one quality layer.CIFresolution is used if spatialscalability is not considered.For simplicity,spatial base layers of the pre-encoded streams are quantized by the same QP,and only the spatial-enhancement layers or quality-enhancement layers are quantized using two different QPs.Where SVC cannot decode the base layer,or the SMDC receiver lacks both descriptions,one or severalframes cannot be decoded.In this case,frame copy is used as error concealment.
First,we compare our proposed SMDC scheme with single-description SVC and method Vproposed in[15].Method Vis based on SVC with both descriptions containing the active base layer and every other quality-enhancement layer.Therefore,in method V,the redundancy is only the base layer.In the case of no packet loss,the proposed scheme and method Vpay a penalty of reduced coding efficiency compared with single-description SVC.Method Vhas slightly less coding-efficiency loss than the proposed method.Fig.4 shows the average luma PSNRas a function of the network packet-loss rate for the joint sequence of Foreman,Mobile,and Akiyo.The proposed scheme outperforms single-description SVC when the packet-loss rate is greater than 3%and outperforms method Vwhen the packet-loss rate is greater than 5%(Fig.4).At 10%of the packet-loss rate,the gain over single-description SVC is 2.6 d B,and the gain over method Vis about 1.1 d B.When the packet-loss rate is less than 2%,the proposed SMDC scheme is inferior to single-description SVC,and when the packet-loss rate is less than 3%,the proposed SMDCmethod is inferior to method V.The additional redundancy introduced in the proposed scheme plays a minor role at low packet-loss rates.However,method Vin[15]cannot be extended to support spatial scalability.

▲Figure 4.Average Y-PSNRvs.packetloss rate,without spatial scalability.

▲Figure 5.Average Y-PSNRvs.packet loss rate,with spatialscalability.

▲Figure 6.Y-PSNRcompared to TD-MDCand SD-MDCwith 1%packetloss rate.
To test performance when supporting spatial scalability,we compare our proposed scheme with single-description SVC,spatial downsampling(SD-MDC),and temporal downsampling MDC(TD-MDC).In SD-MDC and TD-MDC,an originalvideo is first downsampled into two subsequences in the spatial and temporal domain,respectively.Then,the two subsequences are independently encoded by an SVC encoder.Fig.5 shows how our scheme performs compared with single-description SVC,SD-MDC,and TD-MDC in terms of average Y-PSNR()versus packet-loss rate.The proposed scheme performs best at a packet-loss loss rate of 1-10%.At a 10%loss rate,the proposed scheme outperforms the SD-MDC and TD-MDC by approximately 3.5 d B and 3.9 d B,respectively.Because the descriptions of SD-MDC and TD-MDC are separately encoded,the losses of packets from one description cannot be effectively compensated from the received packets of the other description.However,the proposed scheme can still produce a whole spatial and temporalresolution video when one description is corrupted.Hence,the redundancy introduced in the proposed scheme is more beneficialthan SD-MDC and TD-MDC in the case of packet loss.Although single-description SVC has the highest coding efficiency,the proposed scheme has a similar gain over single-description SVC.The gain is 4.3 d Bat a 10%packet-loss rate.Fig.6 shows the average Y-PSNRversus bit rate compared with SD-MDC and TD-MDC at 1%packet-loss rate.The results show that the proposed method is superior to SD-MDC and TD-MDC over the encoding bit rates 912 kbit/s,1460 kbit/s,and 2294 kbit/s,where the redundancies are 28%,31%,and 33%,respectively.
The proposed method is implemented as a preprocessing unit before a standard decoder in order to arrange a valid bit stream from the received packets.In these tests,each spatial layer has four temporal layers and two MGSlayers.The first MGSlayer contains four transform coefficients,and the second MGSlayer includes 12 transform coefficients for each spatial layer.The QPdifference between the quality base layer and the quality-enhancement layer is set to three.The default cascading of quantization parameters over the temporal levels is used.Hierarchical B picture is also used.In the experiments,we assume that packets of the quality base layer in spatial layer 0 are protected and not lost.Packet-loss ratios of 3%,5%,and 10%are used.For decoded frames with a spatial resolution of QCIF,we use the upsampling filter in SVC to produce the spatial-resolution CIF.

▲Figure 7.Average distortion,as a function of encoding rate,with 5%packetloss rate(Joint sequence Foreman,Mobile and Akiyo).
Fig.7 shows the rate-distortion curves for the header-modification and NAL unit-removal methods with 5%packet-loss rate,and Fig.8 shows the rate-distortion curves for the header-modification and NAL unit-removal methods with 10%packet-loss rate.We compare the proposed method with the NALunit-removal method only because the proposed method does not substitute methods for concealing a frame loss but only complements them.Fig.7 and Fig.8 show that header-modification outperforms NAL unit-removal over the entire considered range of bit rates.For a 5%packet-loss rate,header modification gains 2.16 d B on average,and for a 10%packet-loss rate,NAL unit removal gains 1.55 d B on average for the joint sequence.

▲Figure 8.Average distortion vs.encoding rate for 10%packet-loss rate(Joint sequence Foreman,Mobile and Akiyo).

▲Figure 10.Average distortion vs.packet-loss rate(JointsequenceForeman,Mobile and Akiyo,488 kbit/s).
To further evaluate the performance of the proposed method,we determine how the average luma PSNRchanges in relation to packet-loss rate.Figs.9 and 10 show that the proposed method outperforms the NAL unit-removal method at all the three simulated packet-loss rates.The proposed method gains a maximum of 1.64 d B over the NAL unit-removal method when the packet-loss rate for the 488 kbit/s stream is 5%,and this can be as high as 2.16 d Bwhen the packet-loss rate is 5%for the 1590 kbit/s stream.At 10%packet-loss rate,the proposed method improves average luma PSNRby 1.4 d Bover the NALunit-removal method at both bit rates.

▲Figure 11.Y-PSNRbetween NALunit-removaland proposed method at 10%packet-loss rate(Foreman,fromthe jointsequence at1590 kbit/s).
Fig.11 shows luma PSNRagainst the number of frames.It shows how luma PSNRchanges for a Foreman sequence taken from the joint sequence coded at 1590 kbit/s with 10%packet-loss rate.The proposed method still uses the correctly received packets of higher MGSlayers in case a lower MGSlayer is lost,so the proposed method provides much better video quality than NAL unit removal.

▲Figure 9.Average distortion vs.packet-loss rate(Joint sequence Foreman,Mobile and Akiyo,1590 kbit/s).
In this paper,we have proposed a standard-compatible MDC scheme for SVCbased on combined pre-encoded streams.This scheme is designed for video streaming applications in error-prone environments.At the decoder,an error-concealment method in the NALin case that MGSis used for the scalable extension of H.264/AVC is presented.Experiment results show that the proposed MDC and error-concealment methods can improve video quality in error-prone environments.The proposed methods have low computational complexity and require low computing power.Hence,they are suitable for real-time scalable video streaming.