搜索
您的当前位置:首页正文

optimal

来源:榕意旅游网
InformationandComputation182(2003)73–94

Information

andComputation

www.elsevier.com/locate/ic

Predictingabinarysequencealmostaswellasthe

optimalbiasedcoin

YoavFreund

ComputerScienceandEngineering,TheHebrewUniversity,Jerusalem,Israel

Received14March2000;revised4January2002

Abstract

Weapplytheexponentialweightalgorithm,introducedandLittlestoneandWarmuth[26]andbyVovk[35]totheproblemofpredictingabinarysequencealmostaswellasthebestbiasedcoin.Wefirstshowthatforthecaseofthelogarithmicloss,thederivedalgorithmisequivalenttotheBayesalgorithmwithJeffreyÕsprior,thatwasstudiedbyXieandBarron[38]underprobabilisticassumptions.Wederiveauniformboundontheregretwhichholdsforanysequence.Wealsoshowthatiftheempiricaldistributionofthesequenceisboundedawayfrom0andfrom1,then,asthelengthofthesequenceincreasestoinfinity,thedifferencebetweenthisboundandacorrespondingboundontheaveragecaseregretofthesamealgorithm(whichisasymptoticallyoptimalinthatcase)isonly1/2.Weshowthatthisgapof1/2isnecessarybycalculatingtheregretofthemin–maxoptimalalgorithmforthisproblemandshowingthattheasymptoticupperboundistight.WealsostudytheapplicationofthisalgorithmtothesquarelossandshowthatthealgorithmthatisderivedinthiscaseisdifferentfromtheBayesalgorithmandisbetterthanitforpredictionintheworst-case.

Ó2003ElsevierScience(USA).Allrightsreserved.

1.Introduction

Inthispaperweshowhowsomemethodsdevelopedincomputationalon-linelearningtheorycanbeappliedtoaverybasicstatisticalinferenceproblemandgivewhatwethinkaresurprisinglystrongresults.Inordertopresenttheseresultswithincontext,wefinditnecessarytoreviewsomeofthestandardstatisticalmethodsusedforthisproblem.

E-mailaddresses:yoavf@cs.huji.ac.il,YFreund@banter.com.URL:http://www.cs.huji.ac.il/$yoavf.

0890-5401/03/$-seefrontmatterÓ2003ElsevierScience(USA).Allrightsreserved.doi:10.1016/S0890-5401(02)00033-0

74Y.Freund/InformationandComputation182(2003)73–94

Considerthefollowingverysimplepredictionproblem.Youobserveasequenceofbitsx1;x2;...onebitatatime.Beforeobservingeachbityouhavetopredictitsvalue.Thepre-dictionofthetthbitxtisgivenintermsofanumberpt2½0;1󰀄.Outputtingptcloseto1cor-respondstoaconfidentpredictionthatxt¼1whileptcloseto0correspondstoaconfidentpredictionthatxt¼0.Outputtingpt¼1=2correspondstomakingavacuousprediction.1Formally,wedefinealossfunction‘ðp;xÞfrom½0;1󰀄Âf0;1gtothenon-negativerealnumbersRþ.Thevalueof‘ðpt;xtÞisthelossweassociatewithmakingthepredictionptandthenob-servingthebitxt.Weshallconsiderthefollowingthreelossfunctions:thesquareloss‘2ðp;xÞ¼ðxÀpÞ2,thelogloss‘logðp;xÞ¼ÀxlogpÀð1ÀxÞlogð1ÀpÞandtheabsoluteloss‘1ðp;xÞ¼jxÀpj.

Thegoalofthepredictionalgorithmistomakepredictionsthatincurminimalloss.Ofcourse,onehastomakesomeassumptionaboutthesequencesinordertohaveanyhopeofmakingpredictionsthatarebetterthanpredicting1/2onalloftheturns.Perhapsthemostpopularsimpleassumptionisthatthesequenceisgeneratedbyindependentrandomcoinflipsofacoinwithsomefixedbiasp.Thegoalistofindanalgorithmthatminimizesthetotallossincurredalongthesequence.However,astheminimalachievablelossdependsonp,itismoreinformativetocon-siderthedifferencebetweentheincurredlossandtheminimallossachievablebyan‘‘omniscient

~forthisdistri-statistician’’whoknowsthetruevalueofpandchoosestheoptimalpredictionpbution.

LetxT¼x1;...;xTdenoteabinarysequenceoflengthTandletxT$pTdenotethedistri-butionofsuchsequenceswhereeachbitischosenindependentlyatrandomtobe1withprobabilityp.TheaverageregretforthepredictionalgorithmAistheaveragedifferencebetweenthetotallossincurredbyAandthetotallossincurredbyanomniscientstatistician.Insymbols,thisis2 !TTXX:

~;xtÞRavðA;T;pÞ¼ExT$pT‘ðpt;xtÞÀ:‘ðpð1Þ

t¼1

t¼1

Weusethetermaverageregrettodifferentiateitfromtheworst-caseregretwhichwedefinebelow.

Ingeneral,theoptimalalgorithmforminimizingtheaverageregretistheBayesalgorithm.TherearetwovariantsoftheBayesianmethodology,wecallthesevariantsthesubjectiveBaye-sianismandthelogicalBayesianism:

SubjectiveBayesianism.Inthisview,thenotionofadistributionoverthesetofmodelsisdefined,axiomatically,asarepresentationoftheknowledge,orstateofmind,ofthestatisticianregardingtheidentityofthecorrectmodelinthemodelclass.Choosinganappropriatepriordistributionlis,inthiscase,theactofcompilingallsuchknowledge,thatexistsbeforeseeingthedata,intotheformofapriordistribution.Afterlhasbeenchosen,thegoalofapredictional-gorithmistominimizetheexpectedaverageregret,wherepischosenatrandomaccordingtothepriordistributionandthenxTisgeneratedaccordingtop,i.e.,tofindAthatminimizes

Strictlyspeaking,thereexistnon-symmetriclossfunctionforwhichthevacuouspredictionisnot1/2.Inthispaperweconcentrateonsymmetriclossfunctionsforwhichthevacuouspredictionisalways1/2.2Forthesakeofsimplicity,werestrictthediscussionintheintroductiontodeterministicpredictionalgorithms.Inthiscasethepredictionisafunctionofthepreviouslyobservedbits.

1Y.Freund/InformationandComputation182(2003)73–9475

Ep$lRavðA;T;pÞ.ItiseasytoshowthattheBayesalgorithmwithlasapriorachievesthisop-timalitycriterionwithrespecttothelogloss.

LogicalBayesianism.Inthisview,whichisattributedtoWald,noassumptionismadeonhowthemodelpischosen,andtheoptimalitycriterionisthattheaverageregretfortheworst-casechoiceofpshouldbeminimized.Interestingly,ifthenumberoftrialsTisfixedaheadoftimethenthemin–maxoptimalstrategyfortheadversarywhochoosesthevalueofpistochooseitaccordingtosomefixeddistribution,PworstðTÞ,over½0;1󰀄(seee.g.[5,15,21]).Themin–maxoptimalpredictionstrategyforthiscaseistheBayespredictionalgorithmwiththepriorsettoPworstðTÞ.However,thesepriordistributionsareverypeculiar(seee.g.[41]),calculatingthemishard,and,mostimportantly,theyareoptimalonlyifTisknowninadvance.AmoreattractiveoptionistofindanalgorithmwhichdoesnotneedtoknowTinadvance.Bernardo[3]sug-gestedusingtheBayesalgorithmwithJeffreyÕsprior3(whichwedenoteherebyBJ)andClarkeandBarron[8]provedthatthischoiceisasymptoticallyoptimalforthemodelsthatareintheinterioroftheset.Finally,XieandBarron[38]performedadetailedanalysisoftheBJalgo-rithmforthemodelclassofbiasedcoinsandhaveshownthatforany󰀅>0andany011plnT¼ln:ð2ÞlimmaxRðBJ;T;pÞÀav

T!1󰀅=Ta6p61À󰀅=Ta222eThesetwomethodologies,andmostpredictionmethodologiesingeneral,arebasedonthefollowingstatisticalassumption.Oneassumesthatthesequencetobepredictedisgeneratedbyindependentrandomdrawsfromsomeunknowndistributionthatisselected,onceandforall,fromaknownclassofdistributions.ThedifferencebetweenBayesianandworst-casemethod-ologiesregardsonlythewayinwhichthespecificdistributionischosenfromtheclass.However,theassumptionthataparticularsourceofsequencesisreallyaso-called‘‘random’’sourceisproblematicbecauseinmostrealworldcasesitisalmostimpossibletoverifythataparticulardatasourceisindeedarandomsource.5Moreover,becauseofthelackofdataorcomputingpower,oneoftenwantstouseasimplestochasticmodeleveninsituationswhereitisknownthatthesequenceisgeneratedbyamuchmorecomplexrandomprocessorbyamechanismwhichisnotrandomatall!

Thequestionis:whatweakerformalassumptioncanbemadethatwouldstillcorrespondtoourintuitionthatabiasedcoinisagoodapproximatemodelforthedata?Theassumptionwestudyinthispaperisthatthereexistsafixedmodelwhosetotallossonthesequenceisnon-trivial.Thisdirectionbuildsupontheideasof‘‘predictionintheworst-case’’[16],‘‘on-linelearning’’[10,25,26,35],‘‘universalcoding’’[12,29,31,33,37],‘‘stochasticcomplexity’’[2,28,30],‘‘universal

JeffreyÕspriorforthiscaseisalsoknownastheKrichevsky-TrofimovpriorortheDirichlet-ð1=2;1=2Þprior,seeEq.(28)fortheexactdefinition.4XieandBarrongivetheirresultsinaslightlystrongerform,ourresultscanbepresentedinthatformtoo.5ItseemsthatthemostacceptedapproachtomeasuringtherandomnessofaparticularsequenceistomeasureitsKolmogorovcomplexity(seee.g.[24]),inotherwords,tocomparethelengthofthesequencewiththelengthoftheshortestprogramforgeneratingit.Randomsequencesarethoseforwhichtheprogramisnotsignificantlyshorterthanthatofthesequence.Whilethismeasureismathematicallyveryelegant,itisusuallynotapracticalmeasuretocompute.

376Y.Freund/InformationandComputation182(2003)73–94

portfolios’’[9,11],and‘‘universalprediction’’[14].AlsocloselyrelatedaremethodsforadaptivestrategiesinrepeatedgamesstudiedbyHannan[19],BlackWell[4],andmorerecentlybyFosterandVohra[17]andHartandMas-Colell[20].

Theresultspresentedherewereinitiallypublishedin[18].Atthesametime,similarrisultshavebeenpublishedbyXieandBarron[39].

Wedefinetrivialper-triallossasthemaximallossincurredbythebestprediction.Morefor-mally,itis

:

ð3ÞL¼infmax‘ðp;xÞ:

p2½0;1󰀄x2f0;1g

Itiseasytoverifythatthepredictionthatminimizesthelossforthelogloss,thesquarelossand

theabsolutelossis1/2andthatthecorrespondingvaluesofthetriviallossarelog2,1=4and1/2,respectively.Wesaythatapredictionalgorithmmakesnon-trivialpredictionsonaparticularsequencex1;...;xTifthetotallossthealgorithmincursonthewholesequenceissignificantlysmallerthanTL.

InthecaseofthebiasedPTcoins,theassumptionisthatthereexistssomefixedvalueofp2½0;1󰀄forwhichthetotallosst¼1‘ðp;xtÞissignificantlysmallerthanTL.Wedonotdefinedirectlytheconceptofa‘‘significant’’difference.Instead,wedefineourrequirementsfromthepredictionalgorithmrelativetothetotallossincurredbytheoptimalvalueofp.Wecanthensaythat󰀅isasignificantdifferenceinthetotalpredictionlossifPthereexistsapredictionalgorithmwhosetotallossisguaranteedtobesmallerthanTLifTLÀTt¼1‘ðp;xtÞ>󰀅.

Wemeasuretheperformanceofapredictionalgorithmonaspecificsequencex1;...;xTusingthedifference

TXt¼1

‘ðpt;xtÞÀmin

TXt¼1

p2½0;1󰀄

‘ðp;xtÞ:

Theworst-caseregretofapredictionalgorithmAisthemaximumofthisdifferenceforsequencesoflengthTisdefinedtobe

!TTXX:

‘ðpt;xtÞÀmin‘ðp;xtÞ:ð4ÞRwcðA;TÞ¼max

xT

t¼1

p2½0;1󰀄

t¼1

^.NotethatfortheloglossandforWedenotetheargumentthatminimizesthesecondtermbyp

T

^isequaltothefractionof1Õsinx.Werefertothefractionof1ÕsinxTasthethesquarelossp

^.Clearly,h^isalwaysarationalnumberoftheformi=Tforempiricaldistributionanddenoteitbyh

^>1=2andis0otherwise.As^is1ifhsomeintegeriintherange0;...;T.Fortheabsoluteloss,p

^,weweshallsee,wecanproveslightlybetterboundsontheregretifweallowadependenceonh

thereforedefinethequantity

!TTXX:^Þ¼max‘ðpt;xtÞÀmin‘ðp;xtÞ:ð5ÞRwcðA;T;h

^ðxTÞ¼h^xT;h

t¼1

p2½0;1󰀄

t¼1

Considerthesituationinwhichthesequenceisgeneratedbyarandomsourceandtheanalysisis

doneintermsoftheworst-caseregret.Wehavethat

Y.Freund/InformationandComputation182(2003)73–94

T󰀇Xt¼1

T󰀇󰀈X

^;xtÞmax‘ðpt;xtÞÀ‘ðp

t¼1

77

RwcðA;TÞ¼maxmax

xT

^2½0;1󰀄p

^;xtÞ‘ðpt;xtÞÀ‘ðp

󰀈

ð6Þð7Þð8Þð9Þ

PmaxExT$pT

p

p^2½0;1󰀄TXt¼1

PmaxExT$pT

pp

ð‘ðpt;xtÞÀ‘ðp;xtÞÞ

¼maxRavðA;T;pÞ:

Thefirstinequalityfollowsfromreplacingamaximumwithanaverageandthesecondin-^withaglobalchoiceofp.equalityfollowsfromreplacingtheoptimalper-sequencechoiceofp

Wethusseethattheworst-caseregretisanupperboundontheaverage-caseregret.Thisre-lationisnotverysurprising.However,thesurprisingresult,whichweproveinthispaper,isthatfortheclassofbiasedcoinsandwithrespecttotheloglosstheworst-caseregretisonlyveryslightlylargerthantheaverage-caseregret,asdescribedinEq.(2).Inthispaperweprovethefollowingboundontheworst-caseregretoftheBJalgorithmforthisproblem.Forany󰀅>0andforany0!

11p

maxRwcðBJ;TÞÀlnT6ln:ð10Þlim

T!1^ðxTÞ2½󰀅=Ta;1À󰀅=Ta󰀄222xT;hObservethatthedifferencebetweenthisboundandtheboundgiveninEq.(2)isjust1/2.

Wealsoshowthatthistinygapisnecessary.Wedothisbycalculatingthemin–maxoptimalalgorithmfortheworst-caseregretandshowingthatitsasymptoticdifferencefromð1=2ÞlnTisalsoð1=2Þlnðp=2Þ.Theonlyasymptoticadvantageofthemin–maxalgorithmisforsequenceswithempiricaldistributionveryclosetoeither0or1,inwhichcasetheregretislargerbyanadditional1/2.ThusthealgorithmsuggestedbythelogicalBayesiananalysisisalso(almost)optimalwithrespecttotheworst-caseregret.ThisresultcomplementstheresultsofXieandBarron[38].

ThisresultalsomergesnicelywiththemethodofstochasticcomplexityadvocatedbyRissanen[28].Thatisbecauseanypredictionalgorithmcanbetranslatedintoacodingalgorithmandviceversa(forexample,usingarithmeticcoding[29].)ThelengthofthecodeforasequencexTisequal(withinonebit)tothecumulativeloglossofthecorrespondingpredictionalgorithmonthesamesequence.ThusourresultmeansthattheBayesmethodusingJeffreyÕsprioristheoptimaluni-versalcodingalgorithmforthemodelsclassofbiasedcoins,includingtheadditiveconstantintheexpressionforthemin–maxredundancy,forallsequencesbutthosewhoseempiricaldistributionisveryextreme.

Thus,ifourgoalistominimizethecumulativeloglossorthetotalcodelengththenstochasticcomplexity,logicalBayesianismandworst-casepredictionallsuggestusingthesamealgorithmandallachieveessentiallyidenticalbounds.However,ifweareinterestedinalossfunctiondif-ferentthanthelogloss,thentheworst-casemethodologysuggestanalgorithmthatdifferssig-nificantlyfromtheBayesalgorithm.Specifically,weshowthatthealgorithmsuggestedfor‘2ðp;xÞisverydifferentfromthealgorithmthatissuggestedbyanyBayesianapproach.Moreover,we

78Y.Freund/InformationandComputation182(2003)73–94

giveanexampleforwhichtheworst-caseregretoftheBayesianalgorithmissignificantlylargerthanthatoftheexponentialweightsalgorithm.6Thepredictionalgorithmspresentedinthispaperareveryefficient.Thepredictionruleisafunctiononlyoft,thenumberofbitsthathavebeenobserved,andn,thenumberofbitsthatwereequal1.Thesuggestedpredictionruleforthecumulativeloglossis

pt¼

nþ1=2tþ1

ð11Þ

andapossiblepredictionruleforthesquarelossis

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiu󰀇qffiffi󰀈󰀇qffiffi󰀈u22

nþerfðtÀnÞerfun2tt1tþ11u󰀇qffiffiffiffiffi󰀈pt¼tþln󰀇qffiffiffiffiffi󰀈þln

t2erftðtþ1Þ422

nþerfðtþ1ÀnÞtþ1tþ1󰀄󰀅t1n1tþ1qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi;%þÀ

2t2t1

þlntþtþ1twhere

:2

erfðpÞ¼pffiffiffip

Z

0p

ð12Þ

eÀxdx

2

isthecumulativedistributionfunctionofthenormaldistributionandtheapproximationisafirst-orderapproximationforsmallvaluesoftandn%t=2.Bothpredictionrulesareclosetopt¼n=tforlargenandt,butareslightlydifferentfortheinitialpartofthesequences.

Thepaperisorganizedasfollows.Insection2wereviewtheexponentialweightspredictionalgorithmandthewell-knownboundforthecasewherethenumberofmodelsisfinite.InSection3weshowhowthealgorithmanditsanalysiscanbeextendedtothecaseinwhichtheclassofmodelsisuncountablyinfinite.InSection4wepresentourbasicbound,thatisbasedontheLaplacemethodofintegration.InSection5weapplyourmethodtothecaseofthelog-lossandinSection6wecompareourboundtootherboundsregardingthecumulativelogloss.InSection7weapplyourmethodtothesquarelossandinSection8webrieflyreviewwhatisknownabouttheabsoluteloss.WeconcludewithsomegeneralcommentsandopenproblemsinSection9.DetailsoftheproofaregiveninAppendicesA,B,C.2.Thealgorithm

Thealgorithmwestudyisadirectgeneralizationofthe‘‘aggregatingstrategy’’ofVovk[34,35],andthe‘‘WeightedMajority’’algorithmofLittlestoneandWarmuth[26].Werefertoitasthe

ItisatrivialobservationthatanypredictionalgorithmcanbeviewedasaBayesianalgorithmifthepriordistributionisdefinedoverthesetofsequences,ratherthanoverthesetofmodels.However,thisobservationisoflittlevalue,becauseitdoesnotsuggestaninterestingwayforfindingthisdistributionorforcalculatingthepredictionsthatwouldbegeneratedbyusingit.

6Y.Freund/InformationandComputation182(2003)73–9479

‘‘exponentialweights’’algorithmanddenoteitbyEW.WedenoteaclassofmodelsbyP.InthissectionweassumethatPisafinitesetofvaluesin½0;1󰀄whosePtelementarep1;...;pN.WedenotethecumulativelossofthemodelpiattimetbyLðpi;tÞ¼t0¼1‘ðpi;xt0Þ.

Thealgorithmissimple.Itreceivestwopositiverealnumbersgandc,asparameters.Witheachmodelintheclass,ateachtimestept,itassociatesaweightxt;i¼expðÀgLðpi;tÞÞ.Theinitialweightssumto1,and,whenthesetofmodelsisfinite,theinitialweightsareusuallysettox1;i¼1=Nforallmodelsi¼1;...;N.Thepredictionofthealgorithmattimet,whichwedenoteby/t,isafunctionoftheweightsassociatedwiththemodelsatthattime.Beforedescribinghowthepredictionsarechosen,wedescribetheboundonthetotalofthealgorithm.Thismightseembackwards,butthechoiceofpredictionistrivialoncetheboundisgiven.Theboundonthetotallossofthealgorithm,foreverytimestept,isoftheform

TXt¼1

‘ð/t;xtÞ6Àcln

NXi¼1

xTþ1;i:ð13Þ

Ifwewantthisboundtoholdatalltimesforallsequences,itisclearhowtomakethepredictions.Thepredictionshouldbechosensothatforbothpossiblevaluesofxtþ1theboundwillholdattþ1giventhatitholdsatt.Specifically,thismeansthattheprediction/tþ1ischosensothat

Àcln

NXi¼1

xt;iþ‘ð/t;1Þ6Àcln

NXi¼1

xtþ1;iandÀcln

NXi¼1

xt;iþ‘ð/t;0Þ6Àcln

NXi¼1

xtþ1;i:ð14Þ

Thewaythisboundisusuallyusedistoobservethatifthetotallossofthebestmodelafter

:

observingallofxTisLÃT¼miniLðpi;TÞthentheweightassociatedwiththebestmodel,andthusalsothetotalweight,arelowerboundedbyexpðÀgLÃTÞ.PluggingthisintoEq.(13),wefind,aftersomesimplealgebra,that

TXt¼1

‘ð/t;xtÞ6clnNþcgLÃT:

ð15Þ

Thusifcg¼1weimmediatelygetanicesimpleboundontheworst-caseregretofthealgorithm

RwcðE;tÞ6clnN:

ð16Þ

Itisthusalsoclearthatwewouldlikec¼1=gtobeassmallaspossible.

Haussleretal.[22]studiedtheproblemofcombiningmodelsforpredictingabinarysequenceindetail.Theygiveaformulaforcalculatingtheminimalvalueofcforanylossfunctionwithinabroadclasssuchthatthebound(13)holdsforg¼1=c.Inthispaperweusetheirresultsfortheloglossthesquarelossandtheabsoluteloss.3.Uncountablyinfinitesetsofmodels

Wenowapplytheexponentialweightsalgorithmtothecasewherethesetofmodelsisthesetofallbiasedcoins.Thenaturalextensionofthenotionoftheweightsthatareassociatedwitheach

80Y.Freund/InformationandComputation182(2003)73–94

modelinafiniteclassistoassumethatthereisameasuredefinedoverthesetofmodelsP¼½0;1󰀄,7andastheinitialweightssumtoone,theinitialmeasure,denotedbyl1isaprobabilitymeasure.Weshallsometimesrefertothisinitialprobabilitymeasureastheprior.Fort>1wedefineameasureltðAÞasfollows:

Z:

ltþ1ðAÞ¼eð1=cÞlðxt;pÞdltðpÞ;

A

wheredltðpÞdenotesintegrationwithrespecttothemeasureltoverp2A󰀖P.SimilarlytothepredictionrulegiveninEq.(14)thepredictionattimetisany/t2½0;1󰀄whichsatisfies

R1!

Àð1=cÞlð0;pÞedltðpÞ0

andlð0;/tÞ6ÀclnR1dltðpÞ0

R1!

Àð1=cÞlð1;pÞedltðpÞ0

ð1;/tÞ6Àcln:ð17ÞR1dltðpÞ0Theboundthatoneisguaranteedinthiscaseis󰀄Z1󰀅TX

lðxt;/tÞ6ÀclndlTþ1ðpÞ:

t¼1

0

ð18Þ

Interestingly,thesetofpairsðc;gÞforwhichtheboundofEq.(18)isguaranteedisidenticaltothe

setforwhichEq.(13)holds.Theproofsarealsoidentical,onehasonlytoreplacesumsbyin-tegralsintheproofsgivenbyHaussleretal.However,itisnotimmediatelyclearhowtorelatethisboundonthetotallosstotheworst-caseregret.AsthenumberofmodelsisinfiniteaboundoftheformclogNismeaninglessandweneedadifferentbound.IntherestofthepaperwedevelopaboundwhichisappropriateforthemodelclassofthebiasedcoinsandisbasedonthemethodofLaplaceintegration.

FromEq.(18)wegetthefollowingboundontheworst-caseregret

RwcðEW;TÞ6

max

󰀄Àcln

󰀄Z

01

^¼i=T;i¼0;...;Th

󰀅󰀅

^;TÞ:dlTþ1ðpÞÀLðh

ð19Þ

NoticenowthatinthecaseofthebiasedcointhecumulativelossofmodelponthesequencexT

canbewrittenintheform

hi^‘ðp;1Þþð1Àh^Þ‘ðp;0Þ:Lðp;TÞ¼Th^;TÞandlðpÞandrewritingthesecondterm(whichisalwaysUsingthisexpressionforbothLðhTþ1

positive)inanexponentialformweget

AmeasureoveraspaceXisafunctionfromthesetofmeasurablesets(inourcase,theBorelsetsinX¼½0;1󰀄)totherealnumbersintherange½0;1󰀄.Aprobabilitymeasureisameasurethatassignsthevalue1tothesetthatconsistsofthewholedomain.

7Y.Freund/InformationandComputation182(2003)73–9481

󰀅i󰀅Th^^Þlð0;pÞdlðpÞRwcðEW;TÞ6maxexpÀÀclnhlð1;pÞþð1Àh1

^¼i=T;i¼0;...;Tch0

󰀄hi󰀅'T^^Þlð0;p^Þþð1Àh^ÞÀclnexphlð1;pc

&

1

󰀄Z󰀄

andwecancombinetheexponentsofthetwotermsandget:

󰀄Z1󰀅

^

RwcðEW;TÞ6maxÀclneÀTgðh;pÞdl1ðpÞ;

^¼i=T;i¼0;...;Th

0

ð20Þ

where

h󰀇󰀈󰀇󰀈i

:1^^^^;1Þþð1ÀhÞ‘ðp;0ÞÀ‘ðp^;0Þ:gðh;pÞ¼h‘ðp;1ÞÀ‘ðpð21Þc

^;pÞasthegapfunction.Thegapfunctionisproportionaltotheadditionalloss-per-Werefertogðh

^whichistheoptimalmodelforanytrialthatthemodelpsuffersoverandabovethemodelp

^.ThustheexponentoftheintegralinEq.(20)iszerosequencewhoseempiricaldistributionish

whenpisanoptimalmodelandisnegativeelsewhere.4.Laplacemethodofintegration

InthissectionwedescribeageneralmethodforcalculatingtheintegralinEq.(20).Theder-ivationgiveninthissectionwasdoneindependently,foramuchmoregeneralscenario,byYa-manishiin[40].However,asweshallsee,thismethod,byitself,isnotsufficienttoproveaboundontheworst-caseregret.Laterinthispaperwedescribetheadditionalstepsrequiredtodothat.Werequirethatthelossfunction‘ðp;xÞhasthefollowingthreeproperties.Itiseasytoverifythattheloglossandthesquarelossoverthemodelclassofbiasedcoinshaveproperties2and3.Theproofthatproperty1holdsfortheselossfunctionscanbefoundin[22].

1.‘ðp;xÞsatisfiesEq.(16)forsome02.Forallvaluesofx,‘ðp;xÞhasacontinuoussecondderivativeasafunctionofp.

^:½0;1󰀄!½0;1󰀄suchthattheuniqueoptimalmodelforanysequence3.Thereexistsafunctionp

^isp^Þ.Weusep^Þwhenh^isclearfromthe^ðh^todenotep^ðhxTwhoseempiricaldistributionish

context.

ThesettingoftheEWalgorithmwesuggestistousetheconstantscandg¼1=cdefinedincondition1andtouseastheinitialprobabilitymeasurethefollowingdensitymeasure:

Z

xðxÞdx;ð22Þl1ðAÞ¼

A

where

!1o2

gðx;pÞxðxÞ¼À

Zop2^ðxÞp¼p

and

82Y.Freund/InformationandComputation182(2003)73–94

Z

0

1

!

o2

gðx;pÞdx:op2^ðxÞp¼p

ð23Þ

Thefollowingtheoremgivesaboundontheperformanceofthisalgorithm:

^<1,thelosssufferedbytheexponentialweightsalgorithmde-Theorem1.Foranyfixed0^satisfiesscribedaboveonanysequencexTwhoseempiricaldistributionish

^Þ6clnTÀclnZþOð1=TÞ;RwcðA;T;h

22p2wherecandZareasdefinedabove.

Thisboundisalmostaboundontheworst-caseregret.However,itisanasymptoticresultwhichappliesonlytosetsoffinitesequencesinwhichallthesequenceshavethesameempirical

^.Ofcourse,anysequencehassomeempiricaldistribution,andsoitbelongstosomedistribution,h

setofsequencesforwhichthetheoremholds.However,thetermOð1=TÞmighthaveahidden

^.8Whatweneedisauniformbound,i.e.aboundthatdoesnothaveanyde-dependenceonh

pendenceonpropertiesofthesequence.Togetsuchaboundweneedamorerefinedanalysiswhich,atthispoint,weknowhowtodoonlyforthespecialcasesdescribedinlatersections.However,Theorem1isimportantbecauseitboundstheregretforimportantsetsofsequences,andbecauseitsuggestsachoicefortheinitialprobabilitymeasure.

TheproofofTheorem1isbasedontheLaplacemethodofintegrationwhichisamethodforapproximatingintegralsoftheform

Zb

fðtÞeÀThðtÞdt;ð25Þ

a

ð24Þ

forlargevaluesofT,whenfandharesufficientlysmoothfunctionsfrom½a;b󰀄tothereals.9The

intuitionbehindthismethodisthatforlargeTthecontributionofasmallneighborhoodof

:

tmin¼argmingðtÞdominatestheintegral.Thus,byusingaTaylorexpansionofgðtÞaroundt¼tminonecangetagoodestimateoftheintegral.ThedependenceofthecontributionofthemaximumonTdependsonwhetherthemaximumisalsoapointofderivativezero.ThisisalwaysthecaseifaAsweshowlater,suchadependencedoesindeedexist,butitvanishesasT!1ifp2½󰀅;1À󰀅󰀄foranyfixed󰀅>0.ForagooddescriptionoftheLaplacemethod,seeChapter4ofDeBruijnÕsbook[13]orChapter2ofMurrayÕsbook[27].10SeethederivationofEquation(2.33)in[27].

98Y.Freund/InformationandComputation182(2003)73–9483

Z

a

b

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiuÀ2pÀThðtÞ

iþOðTÀ3=2Þ:fðtÞedt¼fðtminÞuth2

dTdhðtÞt2t¼tmin

ð26Þ

Wecannowprovethetheorem.

^andletxTbeanysequencewhoseem-ProofofTheorem1.Wefixanempiricaldistributionh

^.weusethefactthelisdefinedbythedensityfunctionxandrewritethepiricaldistributionish1

boundgiveninEq.(20),withouttakingthemaximumoverthesequence

󰀄Z1󰀅TT󰀇󰀈XX

^;pÞxðpÞdp:‘ðpt;xtÞÀ‘ðp;xtÞ6ÀclnexpÀTgðh

t¼1

t¼1

0

^;tÞ,andtmin¼p^.ThisintegralisoftheformdefinedinEq.(25),wherefðtÞ¼xðtÞ=Z,hðtÞ¼gðh

^FromWatsonÕslemmawethusgetthat,foranyfixedvalueofh

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiZ1uÀ2pc^;pÞÀTgðh^iedl1ðpÞ¼xðhÞuð27ÞþOðTÀ3=2Þ:th2

d^0Tdp2gðh;pÞ^

^ðhÞp¼p

Inordertominimizethebound,wewanttheintegraltobelarge.Moreprecisely,wewantto

^2½0;1󰀄.11Asxisachoosexsoastomaximizetheminimalvalueachievedoverallchoicesofh

distribution,itiseasytoseethattheminimumofthefirsttermintheboundismaximizedifthe

^.WethusarriveatthechoiceofxgiveninEq.(22)valueofthistermisequalforallvaluesofh

^ofthefirstterm.Wethusgetwhichexactlycancelsthedependenceonh

rffiffiffiffiffiffiffiffirffiffiffiffiffiffiffiffivffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

uÀ2pc2pZ2pZ^Þuh2iþOðTÀ3=2Þ¼ð1þOð1=TÞÞþOðTÀ3=2Þ6maxxðht^dTT^h2½0;1󰀄Tdp2gðh;pÞ^

p¼p^ðhÞ

andwhenweplugthisestimateintotheboundgiveninEq.(20)wegetthestatementofthe

theorem.Ã

Wenowmoveontoshowthatthesuggestedexponentialweightsalgorithmdoesindeedachieveaverystrongboundontheworst-caseregretfortheloglossandforthesquarelossontheclassofbiasedcoins.5.Log-loss

Thelossfunctioninthiscaseis‘logðð;xÞ;pÞ¼Àlogð1ÀjxÀpjÞ,theoptimalparametersare

^.^¼hc¼1=g¼1andtheoptimalvalueofpforagivensequenceisp

^oftheformi=T.Indeed,wecanfindslightlybetterActually,ifwefixT,weneedtoconsideronlyvaluesofh

choicesofxforfixedvaluesofT.However,ourgoalistochooseasingledistributionthatwillworkwellforalllargeT.

^,which,asthesecondderivativeofhiscontinuous,isequivalenttoconsideringWethushavetoconsiderallrationalh

^2½0;1󰀄.allh

1184Y.Freund/InformationandComputation182(2003)73–94

Itiseasytocheckthatinthiscasethepredictionrule

Z1󰀇󰀇󰀈󰀈

^^^^/t¼pexpÀThlnhþð1ÀhÞlnð1ÀhÞdl1ðpÞ

0

satisfiesEq.(14)forthelogloss.NotealsothatthisruleisequivalenttotheBayesoptimal

predictionruleusingthepriordistributionl1.

Thegapfunctionhinthiscaseis(minus)theKL-divergence

! !

󰀇󰀈^^1Àhh^Þlog^;pÞ¼Àh^log^jjp:¼ÀDKLhÀð1Àhgðh

1ÀppThesecondderivativeofhistheFisherinformation

2!o

gðx;pÞ¼pð1ÀpÞ:op2p¼p^ðxÞSotheoptimalprioris

11

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi:¼xðpÞ¼rhi2ppð1ÀpÞo

Zgðx;pÞop2^ðxÞp¼p

ð28Þ

ThisprioristheJeffreyÕspriorforthismodelclass,thusthealgorithmsuggestedinthiscaseisthe

BayesalgorithmusingJeffreyÕsprior(BJ).

Toboundtheworst-caseregretwecalculatetheintegralofEq.(18)which,inthiscase,isequalto

Z1

^

xðpÞeÀTDKLðhjjpÞdp:

0

DetailsofthiscalculationwillbegiveninAppendixA.Herewestatetheresultingbound.Theorem3.TheregretoftheExponentialweightsalgorithmovertheclassofbiasedcoins,whichusesthepriordistributiondescribedinEq.(28)withrespecttotheloglossisbounded,foranyTP1by

11^Þ61lnðTþ1Þþ1lnpþ1ÀRwcðEW;T;hþ^;1Àh^ÞÞÀ1360ðTþ1Þ322222þðTminðhwhichimpliesthat

RwcðEW;TÞ6

1

lnðTþ1Þþ1:2

ð30Þð29Þ

Thelastinequalityshowsthattheregretoftheexponentialweightsalgorithmholdsuniformly

^2½0;1󰀄,i.e.,forallsequences.Itisworthwhiletoconsiderthemorepreciseboundgiveninforallh

Y.Freund/InformationandComputation182(2003)73–9485

^2½󰀅;1À󰀅󰀄,then,forT!1,thelasttwoInequality(29).Ifforsomefixed󰀅>0wehavethath

termsconvergetozeroandtheboundconvergestoð1=2ÞlnTþð1=2Þlnðp=2Þ.Thisboundalso

^¼0,wegetaslightlylargerboundof^2½ð󰀅=TaÞ;1Àð󰀅=TaÞ󰀄foranya<1.However,ifhholdsifh

^¼Hð1=TÞthenwegetanintermediatebound.ð1=2ÞlnTþð1=2Þlnðp=2Þþð1=2Þ.12Finally,ifh6.Comparisontootherresultsregardinglog-loss

ItisinterestingtocomparetheseboundstotheonesgivenbyXieandBarron[38].Theyanalyze

thesamealgorithmonaverysimilarproblem,buttheyconsidertheexpectedregretandnottheworst-caseregret.Aswasshownintheintroduction,theworst-caseregretupperboundstheaverage-caseregret.However,ourdefinitionofregretismuchstrongerthentheirs,becausewemakenoprobabilisticassumptionaboutthemechanismthatisgeneratingthesequence.Itisthereforsurprisingthattheboundsthatwegetaresoveryclosetotheirbounds.

Fromtheargumentsgivenintheprevioussectionwegetthat,Theorem3impliestheboundgiveninEq.(10).ThedifferencebetweenthisboundandtheboundderivedbyXieandBarron[38]describedinEq.(2)isð1=2Þlnðp=2ÞÀð1=2Þlnðp=2eÞ¼0:5bits¼0.721bits.Inotherwords,knowingthatthesequenceisactuallygeneratedbyindependentrandomdrawsofarandomcoinisworthlessthanonebit!

AsXieandBarronshow,theBJalgorithmisnotanasymptoticallyoptimalalgorithmwithrespecttotheaverageprior.Thatisbecauseontheendpointsp¼0andp¼1thelossislargerthanfortheinteriorpoints,andthisdifferencedoesnotvanishasT!1.Inordertoachievetheasymptoticmin–maxtheysuggestmultiplyingJeffreyÕspriorby1À2gwhereg>OðTÀaÞforsomea>1=2andputtingaprobabilitymassofgoneachofthetwopointsc=Tand1Àc=Tforsomeconstantc.Thisreducestheasymptoticaverageregretattheendpointstotheasymptoticallyoptimalvaluewithoutchangingtheasymptoticaverageregretintheinteriorpoints.Similarobservationsandthesamefixholdfortheworst-caseregret.

Asweshowattheendofthelastsection,theregretofalgorithmBJontheinteriorof½0;1󰀄isð1=2ÞlnTþð1=2Þlnðp=2Þ,largerby1/2fromtheoptimalperformancewithrespecttotheaverageregret.Wenowshowthatthissmallgapcannotberemoved.Wedothisbycalculatingtheregretofthemin–maxoptimalalgorithmfortheworst-caseregretwithrespecttothelogloss.

Weusearatherwellknownlemma,statedforinstanceinShtarkov[31,32]andinCesa-Bianchietal.[6].Thelemmastatesthatthemin–maxoptimalpredictionalgorithmdefinesthefollowingdistributionoverthesetofsequencesoflengthT

X1TT

maxPpðxTÞ;ð31ÞPðxÞ¼maxPpðxÞ;whereZ¼

pZp

xTandthattheregretsufferedbythisoptimalalgorithmonanysequenceisequaltolnZ.Usingthiswecanexplicitlycalculatetheworst-caseregretofthemin–maxoptimalalgorithm(thisresultwaspreviouslyshownbyShtarkov[31]).

^¼0orTheactualasymptoticvalueoftheregret(bothworst-caseandaverage-case)oftheBJalgorithmforh

^¼1isslightlysmaller:ð1=2ÞlnTþð1=2Þlnp.h

1286Y.Freund/InformationandComputation182(2003)73–94

Lemma1.Theworst-caseregretofthemin–maxoptimalpredictionalgorithmforsequencesoflengthT,whichwedenotebyMMT,withrespecttotheclassofbiasedcoinsandthelogloss,is

!T󰀄󰀅XpffiffiffiffiTÀTHði=TÞ11

¼lnðTþ1Þþlnðp=2ÞÀOð1=TÞ:eð32ÞRwcðMMT;TÞ¼ln

i22i¼0

TheproofisgiveninAppendixB.7.Squareloss

Inthissectionweconsiderthelossfunctionlðx;pÞ¼ðxÀpÞ2.AswasshownbyVovk[35]and

Haussleretal.theoptimalparametersinthiscasearec¼1=2,g¼2andtheoptimalmodelis

^.Thegapfunctioninthiscaseis^¼hp

󰀈󰀇󰀈󰀇

2222^^^^gðpÞ¼hð1ÀhÞÀð1ÀpÞþð1ÀhÞð0ÀhÞÀð0ÀpÞ:Anditssecondderivativeisaconstant

h00ðpÞ¼4:

Sotheoptimalprioristheuniformdistribution

xðpÞ¼1:

Toboundtheworst-caseregretwecalculatetheintegralofEq.(18)which,inthiscase,isequal

to

Z1󰀇󰀇󰀈󰀇󰀈󰀈

2222^ð1Àh^ÞÀð1ÀpÞþ2Tð1Àh^Þð0Àh^ÞÀð0ÀpÞexp2Thdp:

0

DetailsaregiveninAppendixC.Theresultingboundis:

Theorem4.TheregretoftheExponentialweightsalgorithmovertheclassofbiasedcoins,which

usestheuniformpriordistributionwithrespecttothethesquaredloss,foranyTP1,isboundedby

21p^Þ61lnTþ1ln󰀇pffiffiffiffi󰀈󰀇󰀈RwcðEW;T;hÀln;pffiffiffiffi42erfh42^Tþerfð1Àh^ÞTwhichimplies

RwcðEW;TÞ6

1121p

lnTþlnÀpffiffiffiÁÀln42erf242

ð34Þð33Þ

Similartothedetailedanalysisofthelog-losscase,Inequality(34)givesusauniformupper

^2½󰀅;1À󰀅󰀄forsomeboundthatdoesnotdependonthesequence,whileifweassumethath

constant󰀅>0thenInequality(33)givesaslightlybetterasymptoticbound.Inthesecondcase

Y.Freund/InformationandComputation182(2003)73–9487

eachofthetwoerfðÁÞfunctionsconvergesto1andsothesecondtermvanishesandweareleft

^¼0orh^¼1thenonlyoneofthetwoerfðÁÞtermswiththenegativetermÀð1=4Þlnðp=2Þ.Ifh

convergesto1whiletheotherremains0,andwegetanadditionaltermoflnð2Þ=2intheregret.

^and0(or1)isabitstrongertheninForthesquarelosstherestrictiononthedistancebetweenh

aa^thelog-losscase.Herepweffiffiffiffihavethatifh2½󰀅=T;1À󰀅=T󰀄forsomea<1=2thenthebetterbound

^¼Hð1=TÞthenwegetaboundthatisbetweentheinteriorboundandtheboundholdsandifh

^¼0.forh

Twocommentsareinorder.First,althoughwehavenotconcernedourselveswiththecom-putationalefficiencyofouralgorithms,boththelog-lossversionandthesquare-lossversionre-quiresmallconstanttimetocalculatethepredictions,whoseformulasaregiveninEqs.(11)and(12).Second,itisnothardtogiveexamplesoffinitemodelclassesinwhichusingtheEWal-gorithmismuchbetterthanusinganyBayesalgorithmwhenthedataisgeneratedbyamodeloutsidetheclass(seeAppendixD).Weconjecturethatsuchanexampleexistsalsoforthecon-tinuousmodelclassP¼½0;1󰀄.8.AbsoluteLoss

Inthissectionweconsidertheabsoluteloss,whichhasverydifferentpropertiesthantheloglossandthesquareloss.Haussleretal.[22]showthatthereisnofinitevalueofcsuchthattheboundeq.(13)holdsforg¼1=c.Moreover,aswasshownbyCesa-Bianchiatal.[6],thereisnopredictionalgorithmwhoseworst-caseregretdoesnotdependonthelossofthebestmodel.Ontheotherhand,therearechoicesofcandg>1=cforwhichEq.(13)holdsandCesa-Bianchiet.al.haveshownthat,basedonthisfact,anexponentialweightsalgorithmforfinitemodelclassescanbedevised.Andtheworst-caseregretofthisalgorithmisboundedbypffiffiffiffiffiffiffiffiffiffiffiffiffiOðlnNLÃTÞ.

Aswecannotchoosecfiniteandg¼1=cforthemultiplicativeweightsalgorithm,wecannotusethetechniqueofTheorem1forthiscase.However,wedonotneedtouseaninfinitesetofmodelsinouranalysisforthiscase.ThisisbecauseinthiscasetheoptimalmodelintheclassP¼½0;1󰀄isalwayseitherp¼0orp¼1.ThuswecanconsideraEWalgorithmthatcombinesonlythesetwomodelsandgetclosetooptimalboundsontheregret.9.Conclusionsandopenproblems

Wehavedemonstrated,inasimplecase,thattheBayesalgorithmthathasbeenshownbyXieandBarrontobeoptimalwithrespecttotheaverage-caseregretisalsooptimalwithrespecttotheworst-caseregret.Moreover,theboundontheworst-caseregretisonlyveryslightlyworsethantheaverage-caseregret.

Wehavealsoshownthataverydifferentalgorithmresultsifoneisinterestedinthesquareloss,ratherthaninthelogloss.

Theseresultsgiveevidencethatsometimesaccuratestatisticalinferencecanbedonewithoutassumingthattheworldisrandom.Inrecentyearstheresultsgivenherehavebeengeneralizedtomuchmorecomplexclassesofdistributions.MostnotablyYamanishi[40],HausslerandOpper[23],andCesa-BianchiandLugosi[7].

88Y.Freund/InformationandComputation182(2003)73–94

Acknowledgments

SpecialthankstoSebastianSeungforhishelponunderstandingandsolvingtheintegralequationsusedinthispaper.ThankstoAndrewBarron,MeirFeder,DavidHaussler,RobSchapire,VolodyaVovk,QunXie,andKenjiYamanishiforhelpfuldiscussionandsugges-tions.

AppendixA.ProofofTheorem3

Wewanttocalculatethefollowingintegral:Z1

^

xðpÞeÀTDKLðhjjpÞdp;

0

whichwecanexpandandwriteasfollows:

Z

󰀄󰀅󰀅󰀄󰀄󰀅

1p1Àp^log^ÞlogpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiexpThdpTð1Àh

^^ppð1ÀpÞ1Àhh0

Z1

1^ÞÀ1=2^À1=2Tð1ÀhTh

¼pð1ÀpÞdp:^Þ^Tð1ÀhTh^^0phð1ÀhÞ

1

Luckily,thelastintegralisawellstudiedquantity,calledtheBetafunction.Morespecifically,itis^þ1=2;Tð1Àh^Þþ1=2Þ,whichcanalsobeexpressedintermsoftheGammafunction,BðTh

Bðx;yÞ¼CðxÞCðyÞ=CðxþyÞ.13Usingtheserelationsweget

Z

01

xðpÞe

^jjpÞÀTDKLðh

^þ1=2;Tð1Àh^Þþ1=2ÞCðTh^þ1=2ÞCðTð1Àh^Þþ1=2ÞBðTh

dp¼¼:^Þ^Þ^^Tð1ÀhTð1ÀhThTh^^^^phð1ÀhÞpCðTþ1Þhð1ÀhÞ

PluggingthisformulaintoEq.(20)weget

󰀇󰀈^þ1=2ÞCðTð1Àh^Þþ1=2ÞCðTh^^^^¼lnCðTþ1ÞþlnpþThlnðhÞþð1ÀhÞlnð1ÀhÞÀln^Þ^Tð1ÀhTh^^pCðTþ1Þhð1ÀhÞ

^þ1=2ÞÀlnCðTð1Àh^Þþ1=2Þ:ÀlnCðTh

TheasymptoticexpansionoflnCðzÞforlargevaluesofzcanbeusedtogiveupperandlower

boundsonthisfunctionforpositivevaluesofz(seeEqs.(6.1.41)and(6.1.42)in[1])

Essentially,theGammafunctionisanextensionoftheFactorialtotherealsandtheBetafunctionisanextensionofthereciprocaloftheBinomialfunction.

13Y.Freund/InformationandComputation182(2003)73–9489

11ÀðzÀ1=2ÞlnzÀzþð1=2Þlnð2pÞþ6lnCðzÞ6ðzÀ1=2ÞlnzÀzþð1=2Þ12z360z3

1:Âlnð2pÞþ12z

ðA:1Þ

Usingtheseboundswegetthestatementofthetheorem(detailsinthissection)

󰀇󰀈

^^^^^^lnCðTþ1ÞÀlnCðThþ1=2ÞÀlnCðTð1ÀhÞþ1=2ÞþlnpþThlnðhÞþð1ÀhÞlnð1ÀhÞ

6ðTþ1=2ÞlnðTþ1ÞÀðTþ1Þþð1=2Þlnð2pÞþ

1

12ðTþ1Þ

11

þ

^þ1=2Þ360ðTh^þ1=2Þ312ðTh

1

^Þþ1=2Þ12ðTð1Àh

^lnðTh^þ1=2ÞþðTh^þ1=2ÞÀð1=2Þlnð2pÞÀÀTh

^ÞlnðTð1Àh^Þþ1=2ÞþðTð1Àh^Þþ1=2ÞÀð1=2Þlnð2pÞÀÀTð1Àhþ

1

^Þþ1=2Þ3360ðTð1Àh󰀇󰀈^lnh^þð1Àh^Þlnð1Àh^ÞþlnpþTh

!

^^^^1p11^ln2Thþ2hþð1Àh^Þln2Tð1ÀhÞþ2ð1ÀhÞþ6lnþlnðTþ1ÞþTh3^þ1^Þþ1222360ðTþ1Þ2Th2Tð1Àh66

11p111

þlnðTþ1ÞþlnþÀ

^;1Àh^ÞÞÀ1360ðTþ1Þ322222þðTminðh1

lnðTþ1Þþ1:2

ðA:2ÞðA:3Þ

AppendixB.ProofofLemma1

Giventhattheboundontheregretofthemin–maxoptimalalgorithmisequaltolnZT,whereZTisthenormalizationfactorofthemin–maxdistributionforsequencesoflengthT,ourgoalistocalculate

X

limlnZT¼limlnZTmaxPpðxTÞ:

T!1

T!1

xT

p

TheprobabilityassignedtoasequencexTbythemodelpdependsonlyonthenumberof1ÕsinxT,

^.Itisi.e.,onTh

󰀇󰀈^Þ^T^^Tð1Àh1ÀhTThh

¼pð1ÀpÞ:ðB:1ÞPpðxÞ¼pð1ÀpÞ

90Y.Freund/InformationandComputation182(2003)73–94

^thuswegetthatThevalueofPpðxTÞforagivenxTismaximizedwhenp¼h

󰀅T

X󰀄h^^

^Þ1Àh:^ð1ÀhZT¼h

ÀÁ

^^AsthereareTTh^sequencesoflengthTinwhichthefractionof1Õsish,andashachievesthevalues

i=Tfori¼0;...;T,wecanrewritethelastequationinthefollowingform:

ZT¼

T󰀄󰀅XTi¼0xT

i

eÀTHði=TÞ;

ðB:2Þ

whereHðpÞ¼ÀplnpÀð1ÀpÞlnð1ÀpÞisthebinaryentropyofp.ÀTÁ

ToapproximatethevalueofEq.(B.2)wereplaceibytheequivalentexpressionCðTþ1Þ=ðCðiþ1ÞCðTÀiþ1ÞÞmovethelogofthisexpressiontotheexponent,andget

ZT¼

TXi¼0

expðlnCðTþ1ÞÀlnCðiþ1ÞÀlnCðTÀiþ1ÞÀTHði=TÞÞ:ðB:3Þ

ReplacingCðÁÞbyitsseriesexpansionaroundT¼1andHðpÞbyitsdefinition,weget,inthe

exponent,theexpression

󰀄1Tþ

2

󰀄󰀅

111

lnðTþ1ÞÀðTþ1Þþln2pþOð1=TÞÀiþlnðTþ1Þþðiþ1ÞÀln2p

222

󰀄󰀅

11i

þOð1=TÞÀTÀiþlnðTÀiþ1ÞþðTÀiþ1ÞÀln2pþOð1=TÞþiln

22TþðTÀiÞln

TÀi1

¼1Àln2pþOð1=TÞþðTþ1=2ÞlnðTþ1ÞÀðiþ1=2Þlnðiþ1ÞT2

1

ln2pþOð1=TÞ2

󰀅

ÀðTÀiþ1=2ÞlnðTÀiþ1ÞþilniþðTÀiÞlnðTÀiÞÀTlnðTÞ¼1À1

þTln1þ

T

󰀄

󰀅

󰀄󰀅󰀄󰀅

111Tþ1

Àiln1þÀðTÀiÞln1þþln:

iTÀi2ðiþ1ÞðTÀiþ1Þ

Inthefirstlineofthelastexpression,thefirsttwotermsareconstantandthethirdtermisoð1Þ.

Allthetermsinthesecondlineareintherange½0;1󰀄.Thefirsttermisequalto1ÀOð1=TÞ,andif^2½󰀅;1À󰀅󰀄forsomefixed󰀅>0thenthesecondandthirdtermshavethesameasymptotich

behavior.Thedominantterm,inanycase,isthelastone.

ReturningtoEq.(B.2),weseparatethesumintothreepartsasfollows:

ZT¼

T󰀄󰀅XTi¼0

i

eÀTHði=TÞ¼

󰀅T󰀄󰀅XTi¼0

i

eÀTHði=TÞþ

ð1À󰀅ÞTXi¼󰀅T

󰀄󰀅󰀄󰀅TXTÀTHði=TÞTÀTHði=TÞ

þ:ee

iii¼ð1À󰀅ÞT

Y.Freund/InformationandComputation182(2003)73–9491

Usingtheapproximationsshownabovewecanupperboundthesummandsinthefirstandthirdsumsby

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi󰀄󰀅

Tþ1TÀTHði=TÞ

6e2Àð1=2Þln2pþOð1=TÞe

ðiþ1ÞðTÀiþ1Þiandestimatethesummandsinthesecondtermby

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi󰀄󰀅

TÀTHði=TÞTþ1

:e¼eÀð1=2Þln2pþOð1=TÞ

iðiþ1ÞðTÀiþ1ÞAconvenientwayforwritingthecommonfactorissffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

pffiffiffiffiffiffiffiffiffiffiffiffi1Tþ11¼Tþ1iþ1TÀiþ1ðiþ1ÞðTÀiþ1ÞTþ2Tþ2Tþ2Usingtheseequalitieswecanwritethesecondandmajorsummationasfollows:

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1À󰀅ÞT󰀄󰀅ð1À󰀅ÞTXXpffiffiffiffiffiffiffiffiffiffiffiffiTÀTHði=TÞ111

:¼eOð1=TÞpffiffiffiffiffiffiTþ1eiþ1TÀiþ1iTþ22pTþ2Tþ2i¼󰀅Ti¼󰀅T

Observethelastsum,itiseasytoseethatitaRiemannSumwhichisafiniteapproximationtothe

integral

Z1À󰀅sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1

dp:

pð1ÀpÞ󰀅pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

And,asT!1,andthefunction1=pð1ÀpÞisRiemannintegrablethissumapproachesthevalueoftheintegral,soweget

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi󰀄󰀅Zð1À󰀅ÞT1À󰀅XTpffiffiffiffiffiffiffiffiffiffiffiffi11ÀTHði=TÞOð1=TÞ

ffidp:Tþ1pffiffiffiffiffi¼ee

ipð1ÀpÞ2p󰀅i¼󰀅TSimilarly,wegetthatthesumthatcorrespondstothefirstandlasttermsapproach

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi󰀄󰀅Z󰀅T󰀅XTpffiffiffiffiffiffiffiffiffiffiffiffi11ÀTHði=TÞ2þOð1=TÞ

dpTþ1pffiffiffiffiffiffi6ee

ipð1ÀpÞ2p0i¼0and

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi󰀄󰀅ZT1Xpffiffiffiffiffiffiffiffiffiffiffiffi1TÀTHði=TÞ1

dp:6e2þOð1=TÞTþ1pffiffiffiffiffiffie

ipð1ÀpÞ2p1À󰀅i¼ð1À󰀅ÞT

92Y.Freund/InformationandComputation182(2003)73–94

ThelasttwointegralsareOð󰀅Þ,thus,afterwetakethelimitT!1wecantakethelimit󰀅!1andgetthat

!󰀄󰀅󰀄󰀅󰀄󰀅ð1À󰀅ÞT󰀅TTXTXXTZTTÀTHði=TÞ

eÀTHði=TÞþeÀTHði=TÞþelimpffiffiffiffiffiffiffiffiffiffiffiffi¼limlim

T!1T!1󰀅!0iiiTþ1i¼󰀅Ti¼0i¼ð1À󰀅ÞT

ssffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z󰀅Z1À󰀅

1111

dpþpffiffiffiffiffiffidp¼lime2pffiffiffiffiffiffi󰀅!0pð1ÀpÞpð1ÀpÞ2p02p󰀅

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!Z1

11

dpþe2pffiffiffiffiffiffi2p1À󰀅pð1ÀpÞ

rffiffiffiZ1sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

11p

dp¼¼pffiffiffiffiffiffi:

pð1ÀpÞ22p0Finallytakingthelogwegetthat

T!1

limlnZTÀ

11p

lnðTþ1Þ¼ln222

whichcompletestheproofofthetheorem.ÃAppendixC.ProofofTheorem4

Weneedtocalculatealowerboundonthefollowingintegral:

Z1󰀇󰀇󰀈󰀇󰀈󰀈

2222^^^^exp2Thð1ÀhÞÀð1ÀpÞþ2Tð1ÀhÞð0ÀhÞÀð0ÀpÞdp:

0

Thisintegralsimplifiesto

Z1󰀇󰀈

2^ÀpÞdp:expÀ2Tðh

0

Thisisanotherwell-knownintegral,itis(uptoamultiplicativeconstant)anintegralofthe

normaldistributiononpartoftherealline.Thisintegraldoesnothaveaclosedalgebraicform,butcanbeexpressedusingtheerrorfunctionerfðÁÞ.

󰀅󰀄Z1Z1Z1󰀇󰀈rffiffiffiffiffiffi22p^^^ÀpÞ2¼expÀ2TðheÀ2TðhþpÞdpÀeÀ2Tð1ÀhþpÞdp1À

2T00rffiffiffiffiffiffi󰀇󰀇0

󰀈󰀇pffiffiffiffipffiffiffiffi󰀈󰀈1p^Tþerfð1Àh^ÞT;¼erfhðC:1Þ

22Twhere

:2erfðpÞ¼pffiffiffip

Z

0p

eÀxdx

2

Y.Freund/InformationandComputation182(2003)73–9493

isthecumulativedistributionfunctionnormaldistribution.AserfðxÞisanincreasingffiffiffiffiffiffiÁÀpffiffiffiffiffiffiÁÀofthep^^functionwefindthaterfh2Tþerfð1ÀhÞ2TisminimizedwhenT¼1andaserfðxÞis

^¼0orh^¼1.ThuswegetthattheconcaveforxP0wefindthattheminimumisachievedforh

^)lowerboundbyintegralisuniformly(w.r.t.h

ffiÁrffiffiffiffiffiffiZ1󰀇󰀈erfÀpffiffi2p^ÀpÞ26expÀ2Tðh:ðC:2Þ22T0WenowplugthislowerboundintoEq.(20)andgetthattheregretisuniformlyupperboundedby

ÀpffiffiffiÁrffiffiffiffiffiffi!erf21p1121p

¼lnTþlnÀpffiffiffiÁÀln:ðC:3ÞÀln

222T42erf242AppendixD.AnexampleforwhichtheexponentialweightsalgorithmisbetterthananyBayes

algorithm

Supposethemodelclassconsistsofjustthreebiasedcoins:p1¼0;p2¼1=2;p3¼1,andthatasequenceisgeneratedbyflippingarandomcoinwhosebiasis0.9.ConsiderfirsttheBayesalgo-rithm,whateveristhechoiceofthepriordistribution,oncethealgorithmobservesbothazeroandaoneinthesequence,theposteriordistributionbecomesconcentratedonp2¼1=2andallpredictionsfromthereonwouldnecessarilybe/t¼1=2.Thiswouldcausesthealgorithmanaveragelosspertrialofð0:5À0:1Þ2¼0:16.Ontheotherhand,considertheEWalgorithmwhichusestheuniformpriordistributionoverthethreemodels.Thetotallossofthisalgorithmisguaranteedtobelargerthanthatofthebestmodelintheclassbyatmostlnð2Þ=4.ForlargeT,withveryhighprobability,

2

thebestmodelisp2¼1whoseaveragelosspertrialisð0:9À1Þ¼0:01.ThustheaveragelosspertrialoftheEWalgorithmisguaranteedtoquicklyapproach0.01,makingitaclearlybetterchoicethantheBayesalgorithmforthisproblem.Weconjecturethatagapbetweentheperformanceoftherealgorithmsexistswhentheclassofmodelsisthesetofallthebiasedcoins.However,wehaveyetnotbeenabletocalculateoptimalpriorfortheoptimalBayesalgorithmforthiscase.References

[1]Abramowitz,Stegun,HandbookofMathematicalFunctions,NationalBureauofStandards,1970.

[2]A.Barron,J.Rissanen,B.Yu,Theminimumdescriptionlengthprincipleincodingandmodeling,EEETrans.Inform.Theory44(6)(1998)2743–2760;Inform.Theory:1948–1998.

[3]J.M.Bernardo,ReferenceposteriordistributionsforBayesianinference,J.Roy.Stat.Soc.B.41(1979)113–147.[4]D.Blackwell,Ananalogoftheminimaxtheoremforvectorpayoffs,PacificJ.Math.6(1)(1956)1–8.[5]D.Blackwell,M.A.Girshick,TheoryofGamesandStatisticalDecisions,Dover,NewYork,1954.

[6]N.Cesa-Bianchi,Y.Freund,D.P.Helmbold,D.Haussler,R.E.Schapire,M.K.Warmuth.Howtouseexpertadvice,in:ProceedingsoftheTwenty-FifthAnnualACMSymposiumontheTheoryofComputing,1993,pp382–391.[7]N.Cesa-Bianchi,G.Lugosi,Onpredictionofindividualsequences,Ann.Stat.27(6)(1999)1865–1895.

[8]B.S.Clarke,A.R.Barron,JeffreyÕspriorisasymptoticallyleastfavorableunderentropicrisk,J.Stat.PlanningInference41(1994)37–60.

[9]T.M.Cover,E.Ordentlich,Universalportfolioswithsideinformation,EEETrans.nform.Theory(March)(1996).[10]T.M.Cover,Behaviorofsequentialpredictorsofbinarysequences,in:TransactionsoftheFourthPrague

ConferenceonInformationTheory,StatisticalDecisionFunctions,RandomProcesses(Prague,1965),Academia,Prague,1967,pp.263–272.

94Y.Freund/InformationandComputation182(2003)73–94

[11]T.M.Cover,Universalportfolios,Math.Finance1(1)(1991)1–29.[12]L.D.Davisson,Universalnoiselesscoding,EEETrans.nform.Theory19(1973)783–795.

[13]N.G.deBuijn,AsymptoticMethodsinAnalysis.Dover,NewYork,1958,1981.AgoodintroductiontoLaplace

MethodandtheSaddlePointmethodforapproximatingintegrals.[14]M.Feder,N.Merhav,M.Gutman,Universalpredictionofindividualsequences,EEETrans.nform.Theory38

(1992)1258–1270.

[15]T.S.Ferguson,MathematicalStatistics:ADecisionTheoreticApproach,AcademicPress,NewYork,1967.[16]D.P.Foster,Predictionintheworstcase,Ann.Stat.19(2)(1991)1084–1090.

[17]D.P.Foster,R.Vohra,Regretintheon-linedecisionproblem.GamesEcon.Behavior29(1–2)(1999)7–35.

Learningingames:asymposiuminhonorofDavidBlackwell.

optimalbiased[18]Y.Freund,Predictingabinarysequencealmostaswellasthecoin,in:ProceedingsoftheNinthAnnualConferenceonComputationalLearningTheory,1996.

[19]J.Hannan,ApproximationtoBayesriskinrepeatedplay,in:M.Dresher,A.W.Tucker,P.Wolfe(Eds.),

ContributionstotheTheoryofGames,vol.III,PrincetonUniversityPress,Princeton,NJ,1957,pp.97–139.[20]S.Hart,A.Mas-Colell,Asimpleadaptiveprocedureleadingtocorrelatedequilibrium,Econometrica68(5)(2000)

1127–1150.

[21]D.Haussler,Ageneralminimaxresultforrelativeentropy,EEETrans.nform.Theory43(4)(1997)1276–1280.[22]D.Haussler,J.Kivinen,M.K.Warmuth,Tightworst-caselossboundsforpredictingwithexpertadvice,in:

ComputationalLearningTheory:SecondEuropeanConference,EuroCOLTÕ95,Springer,Berlin,1995,pp.69–83.[23]D.Haussler,M.Opper,Worstcasepredictionoversequencesunderlogloss,in:D.OÕLeary,G.Cybenko,J.

Rissanen(Eds.),TheMathematicsofInformationCoding,ExtractionandDistribution,Springer,Berlin,1998.[24]M.Li,P.Vit󰀆anyi,in:AnIntroductiontoKolmogorovComplexityanditsApplications,TextsandMonogaraphs

inComputerScience,Springer,Berlin,1993.

[25]N.Littlestone,Learningwhenirrelevantattributesabound,in:28thAnnualSymposiumonFoundationsof

ComputerScience,October1987,pp.68–77.

[26]N.Littlestone,M.K.Warmuth,Theweightedmajorityalgorithm,nform.Comput.108(1994)212–261.[27]J.D.Murray,AsymptoticAnalysis,Springer,Berlin,1973.

[28]J.Rissanen,in:StochasticComplexityinStatisticalnquiry,SeriesinComputerScience,vol.15,WorldScientific,

Singapore,1989.

[29]J.Rissanen,G.G.LangdonJr,Universalmodelingandcoding,EEETrans.nform.TheoryT-27(1)(1981)12–23.[30]J.J.Rissanen,Fisherinformationandstochasticcomplexity,EEETrans.nform.Theory42(1)(1996)40–47.[31]Y.M.ShtarÔkov,Universalsequentialcodingofsinglemessages,ProblemsInform.Transmission(Trans.Russian)

23(July–September)(1987)175–186.

[32]Yu.M.Shtarkov,Codingofdescretesourceswithunknownstatistics,in:.Csiszar,P.Elias(Eds.),Topicsin

InformationTheory,North-Holland,Amsterdam,1975,pp.559–574.

[33]J.Suzuki,Somenotesonuniversalnoiselesscoding,ECETrans.Fundam.E78-A(12)(1995).[34]V.G.Vovk,Agameofpredictionwithexpertadvice,J.Comp.Syst.Sci.56(2)(1998)153–173.

[35]V.G.Vovk,Aggregatingstrategies,in:ProceedingsoftheThirdAnnualWorkshoponComputationalLearning

Theory,1990,pp.371–383.

[36]G.N.Watson,TheoryofBesselfunctions,CambridgeUniversityPress,Cambridge,1952.

[37]F.M.J.Willems,Y.M.Shtarkov,T.J.Tjalkens,Thecontexttreeweightingmethod:basicproperties,EEETrans.

Inform.Theory41(3)(1995)653–664.

[38]Q.Xie,A.Barron,Minimaxredundancyfortheclassofmemorylesssources,EEETrans.nform.Theory43

(1997)446–657.

[39]Q.Xie,A.R.Barron,Asymptoticminimaxregretfordatacompression,gambling,andprediction,EEETrans.

Inform.Theory46(2)(2000)431–445.

[40]K.Yamanishi,Adecision-theoreticextensionofstochasticcomplexityanditsapplicationstolearning,EEE

Trans.Inform.Theory44(4)(1998)1424–1439.

[41]Z.Zhang,Discretenoninformativeprioirs,Ph.D.Thesis,YaleUniversity,1994.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top