> a``!G);1/bAl4i<xT}hWU~s]ù̙T!oB#A5TlH:Ŋ4ӰMD3"WikAJKD9O~s.xy}s=Ҁ ;"$>"&.>1Fd 2OfYVOUT
OI֘ K$4RJupA[нBFB6쪉kBqp5Y)10&YSɊA)ʹsL(}1Vqd0{gJy{6پc(?Ưl
.ar%YsVwcuYNam>"emgl?b[{7V#6\,bNA+kqMpv:b7Ui\)S͟~;%#9ӧ[G~<5'b4qj
g+ZmwuI]F;9;9O2 }*d TfHK!4 dna;Sg^RFH(\VB?U܍
v9}8v:i?]߄SC?O2:Afr'{xs>d3}e?\w۾s'KPaFV/.<1Kf0To/vL'7'8\0qVvD**]f~RXjhPM:e#9(ļh07F9lpGVqVGG}kIy"Ca*WhMMA]f!.\A:f:͓?(}r,W.\GD2:Q+D8NUUW4lDj
[T?$=ޠ7J
q$e (
0
bClip (MS_ClipArt_Gallery.20,Microsoft Clip Gallery/0DTimes New Romanbb(b0LbLb0DArialNew Romanbb(b0LbLb0" DSymbolew Romanbb(b0LbLb0
R 2
@n?" dd@ @@``Ds/
,2$G);1/bAl#X0e0e
A@ A5% 8c8c
?1 d0u0@Ty2 NP'p<'pA)BCDE@g4MdMdXb0Pb
ppp ?%
(Bagging (manipulating training examples))(
)
` ̙33` ` ff3333f` 333MMM` f` f` 3>?" dd@,?" dd@ " @ ` n?" dd@ @@``PR @ ` `p>>'(
6Q ZP
T Click to edit Master title style!
!
0dS Z
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0dV `Z
Y*
0V `&
[*
0$W `
[*H
0ηoh ? ̙33 ,Blank Presentation.potr" (
z
0$Z@
My name is Dustin Boswell and I will be presenting:
Ensemble Methods in Machine Learning
by
Thomas G. Dietterich
Oregon State University, Corvallis, OregonP(23'?
H
0ηoh ? ̙33`X@( ̙33
0
XVClassification problem through supervised learning. (Notation)
Given a set S of training examples { x1, x2, & xm} with corresponding class labels { y1, y2, & ym}
Each xi is a vector with n features .
Each yi is one of K class labels.
A learning algorithm takes S and produces a hypothesis h
@(2 2! 2 2A% 6
,H
0ηoh ? ̙33u%P( @
0Y
k%Pictoral View of the Learning Process&(2&
&^
6Ԕ@^
6Ԕ@`
0DY`
^x1. 2
0Y@@`
\y1, 2
^
6Ԕ` ^
6Ԕ`
0Y
^x2. 2
0Y` @
\y2, 2
^
6Ԕ`^
6Ԕ
0$Y
^xm. 2
0Y@
\ym, 2
R2
s*
R2
s*@ pR2
s* ^
6jJp` ^
6Ԕ
0dY @
nLearning Algorithm (Neural Net, Nearest Neighbor, etc& )$8(2&
8dB
<DԔ` ` dB
<DԔ p` ^
6Ԕ
0
0Y
THypothesis
(gimme an x, I ll give you a y)<+ 2
+jB
BDԔ000 jB
BDԔ00
0$Y
Qx" 2
0dY
Ey 2
H
0ηoh ? ̙33]U`,.( 9
^
6Ԕp^
6Ԕp
0Y
^x1. 2
0Yp@
\y1, 2
^
6Ԕ ^
6Ԕ
0Y
^x2. 2
0Y @
\y2, 2
^
6Ԕ^
6Ԕ
0Y
@
^xm. 2
0Y@
\ym, 2
R2
s* @R2
s*p R2
s* ^
6jJp`P^
6Ԕ
0TYP@
<
nEnsembled Learning Algorithm$(2
dB
<DԔ dB
<DԔ`p``^
6ԔP
`
jB
BDԔ@@jB
BDԔ
@@p
0tY
Qx" 2
0tY
Ey 2
0Y`
YThe Ensemble Method(2
^
6Ԕ
p^
6Ԕ
^
6Ԕ
P
!
0t`
DHey, what s going on inside there?##
#pB
"
HDo dB
#
<DԔp`dB
$
<DԔ
p`
dB
%
<DԔ
p`
dB
&
<DԔ` p`` dB
'
<DԔp`dB
(
<DԔp`
)
HtGmH
0
BGive us an x, we ll
give you a y.2"
"^B
*
6DԔP
`^B
+
6DԔ0 P
`0 ^B
,
6DԔP
P
`P
^B

6DԔP
`
.
0
dH1
H2
H3
...
HL
&0x2 2
H
0ηoh ?) ̙331pq(
0T``
Characteristics of Ensemble Classifiers
Ensemble classifiers are often more accurate than any of its individual members
A necessary and sufficient condition for this is that individuals be accurate and diverse.((2! 2'PE
XB
0Dpx
0p`p
Accuracy of a classifier means an error rate e of less than 1/2.
Two individual classifiers are diverse when their out of sample errors are uncorrelated (the e are independent random variables)h 2%17$
XB
0Dp@pH
0ηoh ? ̙33?7 ( @
04
}7Fundamental Reasons why Ensembles might perform better.8(28
8
0T
R Statistical. When training data is too small, there are many hypothesi which satisfy it. Ensembling reduces the chance of picking a bad classifier.
Computational. Depending on the learning algorithm, individuals might get stuck in local minima of training errors. Ensembling reduces the chance of getting stuck in a bad minima
Representational. An ensembled cast may represent a classifier which was not possible in the original set of hypothesi.^! 2f
H
0ηoh ? ̙33$( t db
$~
$
0`B
pMethods for obtaining an Ensemble.
Problem: Given that we only have one training set and one learning algorithm, how can we produce multiple hypthesi?
Solution: Fiddle with everything we can get our hands on!
Manipulate the training examples
Manipulate the input data points
Manipulate the target output (the class labels) of the training data
Inject Randomness$(2 2! 2$o2
q`
$
c$A??hH
$0ηoh ? ̙33"(b( hl
(
(
0p"
rManipulating the training examples.
Run the learning algorithm multiple times with subsets of training data.
This works well for unstable learning algorithms.$(2}! 2$/
@
(
6jJPv
xUnstable
 decision tree s
 Neural Networks
 rule learning4 (24 25
=
(
6o `
@Stable
 linear regression
 Nearest Neighbor
 linear threshold4(2: 2;
AH
(0ηoh ? ̙33I
,(
,l
, CP
J
,
0
Take multiple bootstrap replicates of the training data.
Question: If you sample N points from a batch of N (with replacement), how many of the original N do you expect to have? Poll the audience& 2 2
^B
,
6DԔ %
,
04PP*
[ On each sample, a given point has a (N1)/N chance of being missed.
To be completely left out after N samples occurs with prob [ (N1)/N ] ^ N.
Thus to be included at least once occurs with prob 1  [ (N1)/N ] ^ N = 1  [ 1  1/N ] ^ N = 1  1/e = .63 as N gets large.
Thus we expect 63% of the points to be in the bootstrap replicate<\ 2H
\H
,0ηoh ? ̙33*0j(
XuDu
0
0
04``
z6AdaBoosting (still manipulating with the training set)7 27
7P
0
00@
 chooses a series of hypothesi, but the latter ones are designed to excel in the places (the training examples) that the earlier hypothesi did not.. 2
H
00ηoh ? ̙33V4( 0@
4^
4
<?
Manipulation of the input data
 Each input x is a vector of n features.
 Train multiple hypothesi based on the same training set, but for each xi, only a subset of the n features are taken.
 Cherkauer (1996) used this method to train an ensemble of neural nets to identify volcanoes on Venus.
 there were 119 input features
 they were grouped (by hand) into subsets of features based on different image processing operations, like PCA, Fourier, etc&
 the resulting ensemble matched the ability of expert humans
 Tumer and Ghosh (1996) applied this technique to sonar data and found that removing any of the input features hurt the performance
 The technique only works when the features contain redundant data.(2 20P2 2
S "U
H
40ηoh ? ̙33 8`(
8(
8
<?
HManipulation of the output targets (of the input data)
 Each x is mapped to one of K classes (where K is large).
 Divide the set of K classes into 2 groups A and B.
 Learn that new (and simpler) learning problem for various partitions A and B.
 Each member of the ensemble then implicitly votes for K/2 of the K classes that are in A or B (whichever was voted for).
 Think of it like classifying cities to the states where they are, but first classifying which region (southwest, northwest, etc) first.
 Benefit: Can use any 2classifier to classify arbitrary K class problem.FI 27
IH
80ηoh ? ̙33ld<(
<
<
<?
XInjection of Randomness
 Neural Networks
 initial weights can be randomly chosen.
 the ensemble consists of NN s trained with different initial weights
 Between:
a) 10fold crossvalidated committees
b) bagging and
c) random initial weights,
they performed in that order: a) was the best , c) worst.
 Injecting randomness into the input vectors is another option.:(2a 2`
yH
<0ηoh ? ̙33{+#@(
@
@
<?r
Comparison of Ensemble Methods (empirical)
1) C4.5
2) C4.5 with injected randomness (in the treebuilding)
3) Bagged C4.5
4) AdaBoost C4.5
 33 data sets with little or no noise
 AdaBoost performed the best
 same 33 data sets with artificial 20% class label noise
 Bagging was the best (AdaBoost overfit)
 Analogy: AdaBoost tries to come up with a theory that explains everything, Bagging makes sure to know most of it.>+(2b0<2 2*
H
@0ηoh ? ̙33DZ(
D"
D
<?~
<Interpretation of the Methods by appealing to the Fundamental Reasons for Ensemble performance
 Bagging and Randomness work by attacking the Statistical issue.
 AdaBoost attacks the Representational Problem (it recognizes and uses the fact that not every hypothesis will be correct for all the training points).L_(20<2C0x2 2^
=H
D0ηoh ? ̙33xVylEږ)mR5WhmhJyGbP(V) )D)᐀#ÊU9ߎ=xLy;;3;3>r=*f2f5c;=PdO';km;i&d/l$dw֭/3eg/Sߐt==]GX)~&mI5^Bv3vy0d1վ<,2:RÙ3Εop5oڬdKr墴=f;Zsh1e2S;(]SGT})>2C<]9b9mgkj(V6KI'SQϞnPDk[AjAJL7?ee5*] :V>ZmTLN*֎\eqKU'?EʤDc0P֟UXmTN\`X؈=<Ώa^_+<6h
"n5qh7#L?tu^(y8Z*> ch+xwB[X<bx]1GV"0XIkl~Jq;[+hm*M ]Hkd' '# ݖ'a}ypXmP ,"K'ȊJ&(qb`JSvy
X1&cL@+?[x PrܨE?m>nXGa>@O}N_h}?_4^ՎIm?h\BAAl&:̃[,.?+O@,S`ZZ!\EK}Z\ބkpsX^2vY+b'Avv!NmU!f(SWCkmN{CbqzB7^36+pd)Yy0~TݕЕip%53Ҝ$9K2R\I)4ʦ
%UjaroKlJ:M@艜lW],m4h,erCv4r̿
کxRrP0 O+C=IVxMQV5]g`hlq4upV~b
Oh+'0
px
@LX
dpxNo Slide TitleYaser Abu MostafaeCC:\Program Files\Microsoft Office\Templates\Blank Presentation.potYaser Abu Mostafaic23eMicrosoft PowerPointoso@u@`Aq@@X}KV@ rGof 8T&~ &&#TNPPp0D
v
&
TNPP &&TNPP~
 " ! "&G3q&
 &G3yq& {q2 "Arial0 .T2
;3My name is Dustin Boswell and I will be presenting:
."Arial0 .=2
l$Ensemble Methods in Machine Learning" ! ! ."Arial0 .
2
by. .%2
7CThomas G. Dietterich
. .F2
m*Oregon State University, Corvallis, Oregon
."System&TNPP &
՜.+,D՜.+,p,
CustomCaltechҎjTimes New RomanArialSymbolBlank Presentation.potMicrosoft Clip GalleryNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide Title)Bagging (manipulating training examples)No Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleNo Slide TitleFonts UsedDesign TemplateEmbedded OLE Servers
Slide Titles 6>
_PID_GUIDAN{909BE6E0741111D5A4B700A0C95DF01A})_~bYaser Abu Mostafa
!"#$%&'()*+,./0123456789:;<=>?@ABCDEFGIJKLMNOQRSTUVWYZ[\]^_bRoot EntrydO)RrPicturesCurrent UserXSummaryInformation(HPowerPoint Document
(~ DocumentSummaryInformation8P