Asweshowlater,suchadependencedoesindeedexist,butitvanishesasT!1ifp2½;1Àforanyfixed>0.ForagooddescriptionoftheLaplacemethod,seeChapter4ofDeBruijnÕsbook[13]orChapter2ofMurrayÕsbook[27].10SeethederivationofEquation(2.33)in[27].98Y.Freund/InformationandComputation182(2003)73–9483
Z
a
b
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiuÀ2pÀThðtÞ
iþOðTÀ3=2Þ:fðtÞedt¼fðtminÞuth2
dTdhðtÞt2t¼tmin
ð26Þ
Wecannowprovethetheorem.
^andletxTbeanysequencewhoseem-ProofofTheorem1.Wefixanempiricaldistributionh
^.weusethefactthelisdefinedbythedensityfunctionxandrewritethepiricaldistributionish1
boundgiveninEq.(20),withouttakingthemaximumoverthesequence
Z1TTXX
^;pÞxðpÞdp:‘ðpt;xtÞÀ‘ðp;xtÞ6ÀclnexpÀTgðh
t¼1
t¼1
0
^;tÞ,andtmin¼p^.ThisintegralisoftheformdefinedinEq.(25),wherefðtÞ¼xðtÞ=Z,hðtÞ¼gðh
^FromWatsonÕslemmawethusgetthat,foranyfixedvalueofh
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiZ1uÀ2pc^;pÞÀTgðh^iedl1ðpÞ¼xðhÞuð27ÞþOðTÀ3=2Þ:th2
d^0Tdp2gðh;pÞ^
^ðhÞp¼p
Inordertominimizethebound,wewanttheintegraltobelarge.Moreprecisely,wewantto
^2½0;1.11Asxisachoosexsoastomaximizetheminimalvalueachievedoverallchoicesofh
distribution,itiseasytoseethattheminimumofthefirsttermintheboundismaximizedifthe
^.WethusarriveatthechoiceofxgiveninEq.(22)valueofthistermisequalforallvaluesofh
^ofthefirstterm.Wethusgetwhichexactlycancelsthedependenceonh
rffiffiffiffiffiffiffiffirffiffiffiffiffiffiffiffivffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uÀ2pc2pZ2pZ^Þuh2iþOðTÀ3=2Þ¼ð1þOð1=TÞÞþOðTÀ3=2Þ6maxxðht^dTT^h2½0;1Tdp2gðh;pÞ^
p¼p^ðhÞ
andwhenweplugthisestimateintotheboundgiveninEq.(20)wegetthestatementofthe
theorem.Ã
Wenowmoveontoshowthatthesuggestedexponentialweightsalgorithmdoesindeedachieveaverystrongboundontheworst-caseregretfortheloglossandforthesquarelossontheclassofbiasedcoins.5.Log-loss
Thelossfunctioninthiscaseis‘logðð;xÞ;pÞ¼Àlogð1ÀjxÀpjÞ,theoptimalparametersare
^.^¼hc¼1=g¼1andtheoptimalvalueofpforagivensequenceisp
^oftheformi=T.Indeed,wecanfindslightlybetterActually,ifwefixT,weneedtoconsideronlyvaluesofh
choicesofxforfixedvaluesofT.However,ourgoalistochooseasingledistributionthatwillworkwellforalllargeT.
^,which,asthesecondderivativeofhiscontinuous,isequivalenttoconsideringWethushavetoconsiderallrationalh
^2½0;1.allh
1184Y.Freund/InformationandComputation182(2003)73–94
Itiseasytocheckthatinthiscasethepredictionrule
Z1
^^^^/t¼pexpÀThlnhþð1ÀhÞlnð1ÀhÞdl1ðpÞ
0
satisfiesEq.(14)forthelogloss.NotealsothatthisruleisequivalenttotheBayesoptimal
predictionruleusingthepriordistributionl1.
Thegapfunctionhinthiscaseis(minus)theKL-divergence
! !
^^1Àhh^Þlog^;pÞ¼Àh^log^jjp:¼ÀDKLhÀð1Àhgðh
1ÀppThesecondderivativeofhistheFisherinformation
2!o
gðx;pÞ¼pð1ÀpÞ:op2p¼p^ðxÞSotheoptimalprioris
11
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi:¼xðpÞ¼rhi2ppð1ÀpÞo
Zgðx;pÞop2^ðxÞp¼p
ð28Þ
ThisprioristheJeffreyÕspriorforthismodelclass,thusthealgorithmsuggestedinthiscaseisthe
BayesalgorithmusingJeffreyÕsprior(BJ).
Toboundtheworst-caseregretwecalculatetheintegralofEq.(18)which,inthiscase,isequalto
Z1
^
xðpÞeÀTDKLðhjjpÞdp:
0
DetailsofthiscalculationwillbegiveninAppendixA.Herewestatetheresultingbound.Theorem3.TheregretoftheExponentialweightsalgorithmovertheclassofbiasedcoins,whichusesthepriordistributiondescribedinEq.(28)withrespecttotheloglossisbounded,foranyTP1by
11^Þ61lnðTþ1Þþ1lnpþ1ÀRwcðEW;T;hþ^;1Àh^ÞÞÀ1360ðTþ1Þ322222þðTminðhwhichimpliesthat
RwcðEW;TÞ6
1
lnðTþ1Þþ1:2
ð30Þð29Þ
Thelastinequalityshowsthattheregretoftheexponentialweightsalgorithmholdsuniformly
^2½0;1,i.e.,forallsequences.Itisworthwhiletoconsiderthemorepreciseboundgiveninforallh
Y.Freund/InformationandComputation182(2003)73–9485
^2½;1À,then,forT!1,thelasttwoInequality(29).Ifforsomefixed>0wehavethath
termsconvergetozeroandtheboundconvergestoð1=2ÞlnTþð1=2Þlnðp=2Þ.Thisboundalso
^¼0,wegetaslightlylargerboundof^2½ð=TaÞ;1Àð=TaÞforanya<1.However,ifhholdsifh
^¼Hð1=TÞthenwegetanintermediatebound.ð1=2ÞlnTþð1=2Þlnðp=2Þþð1=2Þ.12Finally,ifh6.Comparisontootherresultsregardinglog-loss
ItisinterestingtocomparetheseboundstotheonesgivenbyXieandBarron[38].Theyanalyze
thesamealgorithmonaverysimilarproblem,buttheyconsidertheexpectedregretandnottheworst-caseregret.Aswasshownintheintroduction,theworst-caseregretupperboundstheaverage-caseregret.However,ourdefinitionofregretismuchstrongerthentheirs,becausewemakenoprobabilisticassumptionaboutthemechanismthatisgeneratingthesequence.Itisthereforsurprisingthattheboundsthatwegetaresoveryclosetotheirbounds.
Fromtheargumentsgivenintheprevioussectionwegetthat,Theorem3impliestheboundgiveninEq.(10).ThedifferencebetweenthisboundandtheboundderivedbyXieandBarron[38]describedinEq.(2)isð1=2Þlnðp=2ÞÀð1=2Þlnðp=2eÞ¼0:5bits¼0.721bits.Inotherwords,knowingthatthesequenceisactuallygeneratedbyindependentrandomdrawsofarandomcoinisworthlessthanonebit!
AsXieandBarronshow,theBJalgorithmisnotanasymptoticallyoptimalalgorithmwithrespecttotheaverageprior.Thatisbecauseontheendpointsp¼0andp¼1thelossislargerthanfortheinteriorpoints,andthisdifferencedoesnotvanishasT!1.Inordertoachievetheasymptoticmin–maxtheysuggestmultiplyingJeffreyÕspriorby1À2gwhereg>OðTÀaÞforsomea>1=2andputtingaprobabilitymassofgoneachofthetwopointsc=Tand1Àc=Tforsomeconstantc.Thisreducestheasymptoticaverageregretattheendpointstotheasymptoticallyoptimalvaluewithoutchangingtheasymptoticaverageregretintheinteriorpoints.Similarobservationsandthesamefixholdfortheworst-caseregret.
Asweshowattheendofthelastsection,theregretofalgorithmBJontheinteriorof½0;1isð1=2ÞlnTþð1=2Þlnðp=2Þ,largerby1/2fromtheoptimalperformancewithrespecttotheaverageregret.Wenowshowthatthissmallgapcannotberemoved.Wedothisbycalculatingtheregretofthemin–maxoptimalalgorithmfortheworst-caseregretwithrespecttothelogloss.
Weusearatherwellknownlemma,statedforinstanceinShtarkov[31,32]andinCesa-Bianchietal.[6].Thelemmastatesthatthemin–maxoptimalpredictionalgorithmdefinesthefollowingdistributionoverthesetofsequencesoflengthT
X1TT
maxPpðxTÞ;ð31ÞPðxÞ¼maxPpðxÞ;whereZ¼
pZp
xTandthattheregretsufferedbythisoptimalalgorithmonanysequenceisequaltolnZ.Usingthiswecanexplicitlycalculatetheworst-caseregretofthemin–maxoptimalalgorithm(thisresultwaspreviouslyshownbyShtarkov[31]).
^¼0orTheactualasymptoticvalueoftheregret(bothworst-caseandaverage-case)oftheBJalgorithmforh
^¼1isslightlysmaller:ð1=2ÞlnTþð1=2Þlnp.h
1286Y.Freund/InformationandComputation182(2003)73–94
Lemma1.Theworst-caseregretofthemin–maxoptimalpredictionalgorithmforsequencesoflengthT,whichwedenotebyMMT,withrespecttotheclassofbiasedcoinsandthelogloss,is
!TXpffiffiffiffiTÀTHði=TÞ11
¼lnðTþ1Þþlnðp=2ÞÀOð1=TÞ:eð32ÞRwcðMMT;TÞ¼ln
i22i¼0
TheproofisgiveninAppendixB.7.Squareloss
Inthissectionweconsiderthelossfunctionlðx;pÞ¼ðxÀpÞ2.AswasshownbyVovk[35]and
Haussleretal.theoptimalparametersinthiscasearec¼1=2,g¼2andtheoptimalmodelis
^.Thegapfunctioninthiscaseis^¼hp
2222^^^^gðpÞ¼hð1ÀhÞÀð1ÀpÞþð1ÀhÞð0ÀhÞÀð0ÀpÞ:Anditssecondderivativeisaconstant
h00ðpÞ¼4:
Sotheoptimalprioristheuniformdistribution
xðpÞ¼1:
Toboundtheworst-caseregretwecalculatetheintegralofEq.(18)which,inthiscase,isequal
to
Z1
2222^ð1Àh^ÞÀð1ÀpÞþ2Tð1Àh^Þð0Àh^ÞÀð0ÀpÞexp2Thdp:
0
DetailsaregiveninAppendixC.Theresultingboundis:
Theorem4.TheregretoftheExponentialweightsalgorithmovertheclassofbiasedcoins,which
usestheuniformpriordistributionwithrespecttothethesquaredloss,foranyTP1,isboundedby
21p^Þ61lnTþ1lnpffiffiffiffiRwcðEW;T;hÀln;pffiffiffiffi42erfh42^Tþerfð1Àh^ÞTwhichimplies
RwcðEW;TÞ6
1121p
lnTþlnÀpffiffiffiÁÀln42erf242
ð34Þð33Þ
Similartothedetailedanalysisofthelog-losscase,Inequality(34)givesusauniformupper
^2½;1Àforsomeboundthatdoesnotdependonthesequence,whileifweassumethath
constant>0thenInequality(33)givesaslightlybetterasymptoticbound.Inthesecondcase
Y.Freund/InformationandComputation182(2003)73–9487
eachofthetwoerfðÁÞfunctionsconvergesto1andsothesecondtermvanishesandweareleft
^¼0orh^¼1thenonlyoneofthetwoerfðÁÞtermswiththenegativetermÀð1=4Þlnðp=2Þ.Ifh
convergesto1whiletheotherremains0,andwegetanadditionaltermoflnð2Þ=2intheregret.
^and0(or1)isabitstrongertheninForthesquarelosstherestrictiononthedistancebetweenh
aa^thelog-losscase.Herepweffiffiffiffihavethatifh2½=T;1À=Tforsomea<1=2thenthebetterbound
^¼Hð1=TÞthenwegetaboundthatisbetweentheinteriorboundandtheboundholdsandifh
^¼0.forh
Twocommentsareinorder.First,althoughwehavenotconcernedourselveswiththecom-putationalefficiencyofouralgorithms,boththelog-lossversionandthesquare-lossversionre-quiresmallconstanttimetocalculatethepredictions,whoseformulasaregiveninEqs.(11)and(12).Second,itisnothardtogiveexamplesoffinitemodelclassesinwhichusingtheEWal-gorithmismuchbetterthanusinganyBayesalgorithmwhenthedataisgeneratedbyamodeloutsidetheclass(seeAppendixD).Weconjecturethatsuchanexampleexistsalsoforthecon-tinuousmodelclassP¼½0;1.8.AbsoluteLoss
Inthissectionweconsidertheabsoluteloss,whichhasverydifferentpropertiesthantheloglossandthesquareloss.Haussleretal.[22]showthatthereisnofinitevalueofcsuchthattheboundeq.(13)holdsforg¼1=c.Moreover,aswasshownbyCesa-Bianchiatal.[6],thereisnopredictionalgorithmwhoseworst-caseregretdoesnotdependonthelossofthebestmodel.Ontheotherhand,therearechoicesofcandg>1=cforwhichEq.(13)holdsandCesa-Bianchiet.al.haveshownthat,basedonthisfact,anexponentialweightsalgorithmforfinitemodelclassescanbedevised.Andtheworst-caseregretofthisalgorithmisboundedbypffiffiffiffiffiffiffiffiffiffiffiffiffiOðlnNLÃTÞ.
Aswecannotchoosecfiniteandg¼1=cforthemultiplicativeweightsalgorithm,wecannotusethetechniqueofTheorem1forthiscase.However,wedonotneedtouseaninfinitesetofmodelsinouranalysisforthiscase.ThisisbecauseinthiscasetheoptimalmodelintheclassP¼½0;1isalwayseitherp¼0orp¼1.ThuswecanconsideraEWalgorithmthatcombinesonlythesetwomodelsandgetclosetooptimalboundsontheregret.9.Conclusionsandopenproblems
Wehavedemonstrated,inasimplecase,thattheBayesalgorithmthathasbeenshownbyXieandBarrontobeoptimalwithrespecttotheaverage-caseregretisalsooptimalwithrespecttotheworst-caseregret.Moreover,theboundontheworst-caseregretisonlyveryslightlyworsethantheaverage-caseregret.
Wehavealsoshownthataverydifferentalgorithmresultsifoneisinterestedinthesquareloss,ratherthaninthelogloss.
Theseresultsgiveevidencethatsometimesaccuratestatisticalinferencecanbedonewithoutassumingthattheworldisrandom.Inrecentyearstheresultsgivenherehavebeengeneralizedtomuchmorecomplexclassesofdistributions.MostnotablyYamanishi[40],HausslerandOpper[23],andCesa-BianchiandLugosi[7].
88Y.Freund/InformationandComputation182(2003)73–94
Acknowledgments
SpecialthankstoSebastianSeungforhishelponunderstandingandsolvingtheintegralequationsusedinthispaper.ThankstoAndrewBarron,MeirFeder,DavidHaussler,RobSchapire,VolodyaVovk,QunXie,andKenjiYamanishiforhelpfuldiscussionandsugges-tions.
AppendixA.ProofofTheorem3
Wewanttocalculatethefollowingintegral:Z1
^
xðpÞeÀTDKLðhjjpÞdp;
0
whichwecanexpandandwriteasfollows:
Z
1p1Àp^log^ÞlogpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiexpThdpTð1Àh
^^ppð1ÀpÞ1Àhh0
Z1
1^ÞÀ1=2^À1=2Tð1ÀhTh
¼pð1ÀpÞdp:^Þ^Tð1ÀhTh^^0phð1ÀhÞ
1
Luckily,thelastintegralisawellstudiedquantity,calledtheBetafunction.Morespecifically,itis^þ1=2;Tð1Àh^Þþ1=2Þ,whichcanalsobeexpressedintermsoftheGammafunction,BðTh
Bðx;yÞ¼CðxÞCðyÞ=CðxþyÞ.13Usingtheserelationsweget
Z
01
xðpÞe
^jjpÞÀTDKLðh
^þ1=2;Tð1Àh^Þþ1=2ÞCðTh^þ1=2ÞCðTð1Àh^Þþ1=2ÞBðTh
dp¼¼:^Þ^Þ^^Tð1ÀhTð1ÀhThTh^^^^phð1ÀhÞpCðTþ1Þhð1ÀhÞ
PluggingthisformulaintoEq.(20)weget
^þ1=2ÞCðTð1Àh^Þþ1=2ÞCðTh^^^^¼lnCðTþ1ÞþlnpþThlnðhÞþð1ÀhÞlnð1ÀhÞÀln^Þ^Tð1ÀhTh^^pCðTþ1Þhð1ÀhÞ
^þ1=2ÞÀlnCðTð1Àh^Þþ1=2Þ:ÀlnCðTh
TheasymptoticexpansionoflnCðzÞforlargevaluesofzcanbeusedtogiveupperandlower
boundsonthisfunctionforpositivevaluesofz(seeEqs.(6.1.41)and(6.1.42)in[1])
Essentially,theGammafunctionisanextensionoftheFactorialtotherealsandtheBetafunctionisanextensionofthereciprocaloftheBinomialfunction.
13Y.Freund/InformationandComputation182(2003)73–9489
11ÀðzÀ1=2ÞlnzÀzþð1=2Þlnð2pÞþ6lnCðzÞ6ðzÀ1=2ÞlnzÀzþð1=2Þ12z360z3
1:Âlnð2pÞþ12z
ðA:1Þ
Usingtheseboundswegetthestatementofthetheorem(detailsinthissection)
^^^^^^lnCðTþ1ÞÀlnCðThþ1=2ÞÀlnCðTð1ÀhÞþ1=2ÞþlnpþThlnðhÞþð1ÀhÞlnð1ÀhÞ
6ðTþ1=2ÞlnðTþ1ÞÀðTþ1Þþð1=2Þlnð2pÞþ
1
12ðTþ1Þ
11
þ
^þ1=2Þ360ðTh^þ1=2Þ312ðTh
1
^Þþ1=2Þ12ðTð1Àh
^lnðTh^þ1=2ÞþðTh^þ1=2ÞÀð1=2Þlnð2pÞÀÀTh
^ÞlnðTð1Àh^Þþ1=2ÞþðTð1Àh^Þþ1=2ÞÀð1=2Þlnð2pÞÀÀTð1Àhþ
1
^Þþ1=2Þ3360ðTð1Àh^lnh^þð1Àh^Þlnð1Àh^ÞþlnpþTh
!
^^^^1p11^ln2Thþ2hþð1Àh^Þln2Tð1ÀhÞþ2ð1ÀhÞþ6lnþlnðTþ1ÞþTh3^þ1^Þþ1222360ðTþ1Þ2Th2Tð1Àh66
11p111
þlnðTþ1ÞþlnþÀ
^;1Àh^ÞÞÀ1360ðTþ1Þ322222þðTminðh1
lnðTþ1Þþ1:2
ðA:2ÞðA:3Þ
AppendixB.ProofofLemma1
Giventhattheboundontheregretofthemin–maxoptimalalgorithmisequaltolnZT,whereZTisthenormalizationfactorofthemin–maxdistributionforsequencesoflengthT,ourgoalistocalculate
X
limlnZT¼limlnZTmaxPpðxTÞ:
T!1
T!1
xT
p
TheprobabilityassignedtoasequencexTbythemodelpdependsonlyonthenumberof1ÕsinxT,
^.Itisi.e.,onTh
^Þ^T^^Tð1Àh1ÀhTThh
¼pð1ÀpÞ:ðB:1ÞPpðxÞ¼pð1ÀpÞ
90Y.Freund/InformationandComputation182(2003)73–94
^thuswegetthatThevalueofPpðxTÞforagivenxTismaximizedwhenp¼h
T
Xh^^
^Þ1Àh:^ð1ÀhZT¼h
ÀÁ
^^AsthereareTTh^sequencesoflengthTinwhichthefractionof1Õsish,andashachievesthevalues
i=Tfori¼0;...;T,wecanrewritethelastequationinthefollowingform:
ZT¼
TXTi¼0xT
i
eÀTHði=TÞ;
ðB:2Þ
whereHðpÞ¼ÀplnpÀð1ÀpÞlnð1ÀpÞisthebinaryentropyofp.ÀTÁ
ToapproximatethevalueofEq.(B.2)wereplaceibytheequivalentexpressionCðTþ1Þ=ðCðiþ1ÞCðTÀiþ1ÞÞmovethelogofthisexpressiontotheexponent,andget
ZT¼
TXi¼0
expðlnCðTþ1ÞÀlnCðiþ1ÞÀlnCðTÀiþ1ÞÀTHði=TÞÞ:ðB:3Þ
ReplacingCðÁÞbyitsseriesexpansionaroundT¼1andHðpÞbyitsdefinition,weget,inthe
exponent,theexpression
1Tþ
2
111
lnðTþ1ÞÀðTþ1Þþln2pþOð1=TÞÀiþlnðTþ1Þþðiþ1ÞÀln2p
222
11i
þOð1=TÞÀTÀiþlnðTÀiþ1ÞþðTÀiþ1ÞÀln2pþOð1=TÞþiln
22TþðTÀiÞln
TÀi1
¼1Àln2pþOð1=TÞþðTþ1=2ÞlnðTþ1ÞÀðiþ1=2Þlnðiþ1ÞT2
1
ln2pþOð1=TÞ2
ÀðTÀiþ1=2ÞlnðTÀiþ1ÞþilniþðTÀiÞlnðTÀiÞÀTlnðTÞ¼1À1
þTln1þ
T
111Tþ1
Àiln1þÀðTÀiÞln1þþln:
iTÀi2ðiþ1ÞðTÀiþ1Þ
Inthefirstlineofthelastexpression,thefirsttwotermsareconstantandthethirdtermisoð1Þ.
Allthetermsinthesecondlineareintherange½0;1.Thefirsttermisequalto1ÀOð1=TÞ,andif^2½;1Àforsomefixed>0thenthesecondandthirdtermshavethesameasymptotich
behavior.Thedominantterm,inanycase,isthelastone.
ReturningtoEq.(B.2),weseparatethesumintothreepartsasfollows:
ZT¼
TXTi¼0
i
eÀTHði=TÞ¼
TXTi¼0
i
eÀTHði=TÞþ
ð1ÀÞTXi¼T
TXTÀTHði=TÞTÀTHði=TÞ
þ:ee
iii¼ð1ÀÞT
Y.Freund/InformationandComputation182(2003)73–9491
Usingtheapproximationsshownabovewecanupperboundthesummandsinthefirstandthirdsumsby
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Tþ1TÀTHði=TÞ
6e2Àð1=2Þln2pþOð1=TÞe
ðiþ1ÞðTÀiþ1Þiandestimatethesummandsinthesecondtermby
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TÀTHði=TÞTþ1
:e¼eÀð1=2Þln2pþOð1=TÞ
iðiþ1ÞðTÀiþ1ÞAconvenientwayforwritingthecommonfactorissffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffisffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffi1Tþ11¼Tþ1iþ1TÀiþ1ðiþ1ÞðTÀiþ1ÞTþ2Tþ2Tþ2Usingtheseequalitieswecanwritethesecondandmajorsummationasfollows:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1ÀÞTð1ÀÞTXXpffiffiffiffiffiffiffiffiffiffiffiffiTÀTHði=TÞ111
:¼eOð1=TÞpffiffiffiffiffiffiTþ1eiþ1TÀiþ1iTþ22pTþ2Tþ2i¼Ti¼T
Observethelastsum,itiseasytoseethatitaRiemannSumwhichisafiniteapproximationtothe
integral
Z1Àsffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
dp:
pð1ÀpÞpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
And,asT!1,andthefunction1=pð1ÀpÞisRiemannintegrablethissumapproachesthevalueoftheintegral,soweget
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiZð1ÀÞT1ÀXTpffiffiffiffiffiffiffiffiffiffiffiffi11ÀTHði=TÞOð1=TÞ
ffidp:Tþ1pffiffiffiffiffi¼ee
ipð1ÀpÞ2pi¼TSimilarly,wegetthatthesumthatcorrespondstothefirstandlasttermsapproach
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiZTXTpffiffiffiffiffiffiffiffiffiffiffiffi11ÀTHði=TÞ2þOð1=TÞ
dpTþ1pffiffiffiffiffiffi6ee
ipð1ÀpÞ2p0i¼0and
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiZT1Xpffiffiffiffiffiffiffiffiffiffiffiffi1TÀTHði=TÞ1
dp:6e2þOð1=TÞTþ1pffiffiffiffiffiffie
ipð1ÀpÞ2p1Ài¼ð1ÀÞT
92Y.Freund/InformationandComputation182(2003)73–94
ThelasttwointegralsareOðÞ,thus,afterwetakethelimitT!1wecantakethelimit!1andgetthat
!ð1ÀÞTTTXTXXTZTTÀTHði=TÞ
eÀTHði=TÞþeÀTHði=TÞþelimpffiffiffiffiffiffiffiffiffiffiffiffi¼limlim
T!1T!1!0iiiTþ1i¼Ti¼0i¼ð1ÀÞT
ssffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ZZ1À
1111
dpþpffiffiffiffiffiffidp¼lime2pffiffiffiffiffiffi!0pð1ÀpÞpð1ÀpÞ2p02p
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi!Z1
11
dpþe2pffiffiffiffiffiffi2p1Àpð1ÀpÞ
rffiffiffiZ1sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
11p
dp¼¼pffiffiffiffiffiffi:
pð1ÀpÞ22p0Finallytakingthelogwegetthat
T!1
limlnZTÀ
11p
lnðTþ1Þ¼ln222
whichcompletestheproofofthetheorem.ÃAppendixC.ProofofTheorem4
Weneedtocalculatealowerboundonthefollowingintegral:
Z1
2222^^^^exp2Thð1ÀhÞÀð1ÀpÞþ2Tð1ÀhÞð0ÀhÞÀð0ÀpÞdp:
0
Thisintegralsimplifiesto
Z1
2^ÀpÞdp:expÀ2Tðh
0
Thisisanotherwell-knownintegral,itis(uptoamultiplicativeconstant)anintegralofthe
normaldistributiononpartoftherealline.Thisintegraldoesnothaveaclosedalgebraicform,butcanbeexpressedusingtheerrorfunctionerfðÁÞ.
Z1Z1Z1rffiffiffiffiffiffi22p^^^ÀpÞ2¼expÀ2TðheÀ2TðhþpÞdpÀeÀ2Tð1ÀhþpÞdp1À
2T00rffiffiffiffiffiffi0
pffiffiffiffipffiffiffiffi1p^Tþerfð1Àh^ÞT;¼erfhðC:1Þ
22Twhere
:2erfðpÞ¼pffiffiffip
Z
0p
eÀxdx
2
Y.Freund/InformationandComputation182(2003)73–9493
isthecumulativedistributionfunctionnormaldistribution.AserfðxÞisanincreasingffiffiffiffiffiffiÁÀpffiffiffiffiffiffiÁÀofthep^^functionwefindthaterfh2Tþerfð1ÀhÞ2TisminimizedwhenT¼1andaserfðxÞis
^¼0orh^¼1.ThuswegetthattheconcaveforxP0wefindthattheminimumisachievedforh
^)lowerboundbyintegralisuniformly(w.r.t.h
ffiÁrffiffiffiffiffiffiZ1erfÀpffiffi2p^ÀpÞ26expÀ2Tðh:ðC:2Þ22T0WenowplugthislowerboundintoEq.(20)andgetthattheregretisuniformlyupperboundedby
ÀpffiffiffiÁrffiffiffiffiffiffi!erf21p1121p
¼lnTþlnÀpffiffiffiÁÀln:ðC:3ÞÀln
222T42erf242AppendixD.AnexampleforwhichtheexponentialweightsalgorithmisbetterthananyBayes
algorithm
Supposethemodelclassconsistsofjustthreebiasedcoins:p1¼0;p2¼1=2;p3¼1,andthatasequenceisgeneratedbyflippingarandomcoinwhosebiasis0.9.ConsiderfirsttheBayesalgo-rithm,whateveristhechoiceofthepriordistribution,oncethealgorithmobservesbothazeroandaoneinthesequence,theposteriordistributionbecomesconcentratedonp2¼1=2andallpredictionsfromthereonwouldnecessarilybe/t¼1=2.Thiswouldcausesthealgorithmanaveragelosspertrialofð0:5À0:1Þ2¼0:16.Ontheotherhand,considertheEWalgorithmwhichusestheuniformpriordistributionoverthethreemodels.Thetotallossofthisalgorithmisguaranteedtobelargerthanthatofthebestmodelintheclassbyatmostlnð2Þ=4.ForlargeT,withveryhighprobability,
2
thebestmodelisp2¼1whoseaveragelosspertrialisð0:9À1Þ¼0:01.ThustheaveragelosspertrialoftheEWalgorithmisguaranteedtoquicklyapproach0.01,makingitaclearlybetterchoicethantheBayesalgorithmforthisproblem.Weconjecturethatagapbetweentheperformanceoftherealgorithmsexistswhentheclassofmodelsisthesetofallthebiasedcoins.However,wehaveyetnotbeenabletocalculateoptimalpriorfortheoptimalBayesalgorithmforthiscase.References
[1]Abramowitz,Stegun,HandbookofMathematicalFunctions,NationalBureauofStandards,1970.
[2]A.Barron,J.Rissanen,B.Yu,Theminimumdescriptionlengthprincipleincodingandmodeling,EEETrans.Inform.Theory44(6)(1998)2743–2760;Inform.Theory:1948–1998.
[3]J.M.Bernardo,ReferenceposteriordistributionsforBayesianinference,J.Roy.Stat.Soc.B.41(1979)113–147.[4]D.Blackwell,Ananalogoftheminimaxtheoremforvectorpayoffs,PacificJ.Math.6(1)(1956)1–8.[5]D.Blackwell,M.A.Girshick,TheoryofGamesandStatisticalDecisions,Dover,NewYork,1954.
[6]N.Cesa-Bianchi,Y.Freund,D.P.Helmbold,D.Haussler,R.E.Schapire,M.K.Warmuth.Howtouseexpertadvice,in:ProceedingsoftheTwenty-FifthAnnualACMSymposiumontheTheoryofComputing,1993,pp382–391.[7]N.Cesa-Bianchi,G.Lugosi,Onpredictionofindividualsequences,Ann.Stat.27(6)(1999)1865–1895.
[8]B.S.Clarke,A.R.Barron,JeffreyÕspriorisasymptoticallyleastfavorableunderentropicrisk,J.Stat.PlanningInference41(1994)37–60.
[9]T.M.Cover,E.Ordentlich,Universalportfolioswithsideinformation,EEETrans.nform.Theory(March)(1996).[10]T.M.Cover,Behaviorofsequentialpredictorsofbinarysequences,in:TransactionsoftheFourthPrague
ConferenceonInformationTheory,StatisticalDecisionFunctions,RandomProcesses(Prague,1965),Academia,Prague,1967,pp.263–272.
94Y.Freund/InformationandComputation182(2003)73–94
[11]T.M.Cover,Universalportfolios,Math.Finance1(1)(1991)1–29.[12]L.D.Davisson,Universalnoiselesscoding,EEETrans.nform.Theory19(1973)783–795.
[13]N.G.deBuijn,AsymptoticMethodsinAnalysis.Dover,NewYork,1958,1981.AgoodintroductiontoLaplace
MethodandtheSaddlePointmethodforapproximatingintegrals.[14]M.Feder,N.Merhav,M.Gutman,Universalpredictionofindividualsequences,EEETrans.nform.Theory38
(1992)1258–1270.
[15]T.S.Ferguson,MathematicalStatistics:ADecisionTheoreticApproach,AcademicPress,NewYork,1967.[16]D.P.Foster,Predictionintheworstcase,Ann.Stat.19(2)(1991)1084–1090.
[17]D.P.Foster,R.Vohra,Regretintheon-linedecisionproblem.GamesEcon.Behavior29(1–2)(1999)7–35.
Learningingames:asymposiuminhonorofDavidBlackwell.
optimalbiased[18]Y.Freund,Predictingabinarysequencealmostaswellasthecoin,in:ProceedingsoftheNinthAnnualConferenceonComputationalLearningTheory,1996.
[19]J.Hannan,ApproximationtoBayesriskinrepeatedplay,in:M.Dresher,A.W.Tucker,P.Wolfe(Eds.),
ContributionstotheTheoryofGames,vol.III,PrincetonUniversityPress,Princeton,NJ,1957,pp.97–139.[20]S.Hart,A.Mas-Colell,Asimpleadaptiveprocedureleadingtocorrelatedequilibrium,Econometrica68(5)(2000)
1127–1150.
[21]D.Haussler,Ageneralminimaxresultforrelativeentropy,EEETrans.nform.Theory43(4)(1997)1276–1280.[22]D.Haussler,J.Kivinen,M.K.Warmuth,Tightworst-caselossboundsforpredictingwithexpertadvice,in:
ComputationalLearningTheory:SecondEuropeanConference,EuroCOLTÕ95,Springer,Berlin,1995,pp.69–83.[23]D.Haussler,M.Opper,Worstcasepredictionoversequencesunderlogloss,in:D.OÕLeary,G.Cybenko,J.
Rissanen(Eds.),TheMathematicsofInformationCoding,ExtractionandDistribution,Springer,Berlin,1998.[24]M.Li,P.Vitanyi,in:AnIntroductiontoKolmogorovComplexityanditsApplications,TextsandMonogaraphs
inComputerScience,Springer,Berlin,1993.
[25]N.Littlestone,Learningwhenirrelevantattributesabound,in:28thAnnualSymposiumonFoundationsof
ComputerScience,October1987,pp.68–77.
[26]N.Littlestone,M.K.Warmuth,Theweightedmajorityalgorithm,nform.Comput.108(1994)212–261.[27]J.D.Murray,AsymptoticAnalysis,Springer,Berlin,1973.
[28]J.Rissanen,in:StochasticComplexityinStatisticalnquiry,SeriesinComputerScience,vol.15,WorldScientific,
Singapore,1989.
[29]J.Rissanen,G.G.LangdonJr,Universalmodelingandcoding,EEETrans.nform.TheoryT-27(1)(1981)12–23.[30]J.J.Rissanen,Fisherinformationandstochasticcomplexity,EEETrans.nform.Theory42(1)(1996)40–47.[31]Y.M.ShtarÔkov,Universalsequentialcodingofsinglemessages,ProblemsInform.Transmission(Trans.Russian)
23(July–September)(1987)175–186.
[32]Yu.M.Shtarkov,Codingofdescretesourceswithunknownstatistics,in:.Csiszar,P.Elias(Eds.),Topicsin
InformationTheory,North-Holland,Amsterdam,1975,pp.559–574.
[33]J.Suzuki,Somenotesonuniversalnoiselesscoding,ECETrans.Fundam.E78-A(12)(1995).[34]V.G.Vovk,Agameofpredictionwithexpertadvice,J.Comp.Syst.Sci.56(2)(1998)153–173.
[35]V.G.Vovk,Aggregatingstrategies,in:ProceedingsoftheThirdAnnualWorkshoponComputationalLearning
Theory,1990,pp.371–383.
[36]G.N.Watson,TheoryofBesselfunctions,CambridgeUniversityPress,Cambridge,1952.
[37]F.M.J.Willems,Y.M.Shtarkov,T.J.Tjalkens,Thecontexttreeweightingmethod:basicproperties,EEETrans.
Inform.Theory41(3)(1995)653–664.
[38]Q.Xie,A.Barron,Minimaxredundancyfortheclassofmemorylesssources,EEETrans.nform.Theory43
(1997)446–657.
[39]Q.Xie,A.R.Barron,Asymptoticminimaxregretfordatacompression,gambling,andprediction,EEETrans.
Inform.Theory46(2)(2000)431–445.
[40]K.Yamanishi,Adecision-theoreticextensionofstochasticcomplexityanditsapplicationstolearning,EEE
Trans.Inform.Theory44(4)(1998)1424–1439.
[41]Z.Zhang,Discretenoninformativeprioirs,Ph.D.Thesis,YaleUniversity,1994.