machine learning question, need to type in Latex, only need to do question 2

machine learning question, need to type in Latex, only question 2 needed

UNIVERSITY COLLEGE LONDON
Faculty of Engineering Sciences
Department of Computer Science
COMP0036: Assignment 1
Dr. Dariush Hosseini (dariush.hosseini@ucl.ac.uk)
Tuesday 10th November 2020
Overview
• Assignment Release Date: 10th November 2020
• Assignment Hand-in Date: 17th November 2020 at 4.00 p.m.
• Weighting: 30% of module total
• Format: Problems
1
Guidelines
• You should answer all TWO questions.
• Note that not all questions carry equal marks.
• Note that several questions are marked ‘Bonus’. It is possible to score full marks without
attempting any such ‘Bonus’ questions.
• Your score will be calculated as the lesser of 50 or your mark total (including any ‘Bonus’
marks).
• You should submit your final report as a pdf via the module’s Moodle page.
• Within your report you should begin each question on a new page.
• You should preface your report with a single page containing, on two lines:
– The module code and assignment title: ‘COMP0036: Assignment 1’
– Your candidate number: ‘Candidate Number: [YOUR NUMBER]’
• Your report should be neat and legible.
• You must use LATEX to format the report. A template, ‘COMP0036 Solution Template.tex’,
is provided on the module’s Moodle page for this purpose.
• Please attempt to express your answers as succinctly as possible.
• Please note that if your answer to a question or sub-question is illegible or incomprehensible to the marker then you will receive no marks for that question or sub-question.
• Please remember to detail your working, and state clearly any assumptions which you
make.
• Unless a question specifies otherwise, then please make use of the Notation section as a
guide to the definition of objects.
• Failure to adhere to any of the guidelines may result in question-specific deduction of
marks. If warranted these deductions may be punitive, and on occasion may result in no
marks being awarded for the assignment.
Page 2
Notation & Formulae
Inputs:
x = [1, x1, x2, …, xm]
T ∈ R
m+1
Outputs:
y ∈ R for regression problems
y ∈ {0, 1} for binary classification problems
Training Data:
S = {(x
(i)
, y(i)
)}
n
i=1
Input Training Data:
The design matrix, X, is defined as:
X =






x
(1)T
x
(2)T
·
·
x
(n)T











1 x
(1)
1
· · x
(1)
m
1 x
(2)
1
· · x
(2)
m
· · · · ·
· · · · ·
1 x
(n)
1
· · x
(n)
m







Output Training Data:
y =






y
(1)
y
(2)
·
·
y
(n)






Data-Generating Distribution:
S is drawn i.i.d. from a data-generating distribution, D
Page 3
Dirichlet Distribution:
Let X = [X1, X2, …, Xn]
T be a continuous vector of random variables, the outcomes of which
are x = [x1, x2, …, xm]
T
, where xi ∈ [0, 1] ∀i and Pm
i=1 xi = 1, and which follow a Dirichlet
distribution, De:
x ∼ Dir(α1, α2, …, αm) where: α1, α2, …, αm > 0
This has a characteristic probability distribution function, fX :
fX (x; α) = Γ(α0)
Γ(α1)…Γ(αm)
Ym
i=1
x
αi−1
i
Where:
α = [α1, α2, …, αm]
T
α0 =
Pm
i=1 αi
.
The expectation is given by:
EDe[Xi
] = αi
α0
The mode is given by:
mode[Xi
] = αi − 1
α0 − m
where: αi > 1
Page 4

  1. In a linear regression setting in which w0 = 0, in general we seek to learn a linear mapping,
    fw, characterised by a weight vector, w ∈ R
    m, and drawn from a function class, F:
    F = {fw(x) = w · x|w = [w1, …, wm]
    T ∈ R
    m}
    Here x = [x1, x2, …, xm]
    T
    .
    We assume that data, S = {(x
    (i)
    , y(i)
    )}
    n
    i=1 , is drawn i.i.d., and that the associated design
    matrix is described as follows:
    X =






    x
    (1)T
    x
    (2)T
    ·
    ·
    x
    (n)T











x
(1)
1
· · x
(1)
m
x
(2)
1
· · x
(2)
m
· · · ·
· · · ·
x
(n)
1
· · x
(n)
m







Now, consider a data-generating distribution described by a Gaussian additive noise model,
such that:
y = w · x + ε where: ε ∼ N (0, α), α > 0
Here y is the outcome of a random variable, Y, which characterises the output of a particular
data point, and x is the outcome of a random variable, X , which characterises the input to
a particular data point.
In a particular setting for which m = 2, we are given the following sample training data:
X =












2.0 3.0
3.0 0.1
4.0 6.0
5.0 8.0
6.0 15.0
7.0 −3.0
8.0 −7.0
7.5 3.0












y =












6.65
8.91
13.06
16.88
19.33
20.53
24.41
23.34












In what follows, any numerical answers should be given to 3 significant figures.
(a) [5 marks]
Analytically derive the Maximum Likelihood Estimate (MLE) of w and of α in the
context of this data-generating distribution and data.
(b) Assume a prior distribution, pW(w), over w, such that each instance of w is an outcome
of a Gaussian random variable, W, where:
w ∼ N (0, βIm) where: β > 0
Page 5
(i) [5 marks]
Assuming that α = 0.5 and β = 0.05, analytically derive the Maximum A Posteriori (MAP) estimate of w in the context of this model and data. (You should not
explicitly use differentiation in your derivation).
(ii) [7 marks]
Assuming that α = 0.5, explain whether it is possible to set β such that the
MAP estimate of w2 = 0. If so, provide the setting(s) of β which lead to such an
estimate.
(c) Now, instead assume a prior distribution, pW(w), over w, such that:
pW(w) = Ym
i=1
1
2b
exp 

|wi
|
b

where: b > 0
(i) [6 marks]
Assuming that α = 0.5 and b = 0.05, and given that all elements of the optimal
weight vector are positive, analytically derive the Maximum a Posteriori (MAP)
estimate of w in the context of this model and data.
(ii) [Bonus: 10 marks]
Assuming that α = 0.5, explain whether it is possible to set b such that the MAP
estimate of w2 = 0 and w1 > 0? If so, provide the setting(s) of b which lead to
such an estimate.
(d) [2 marks]
In general, explain what behaviour is exhibited in the MAP estimator of the model
described in part (c), (for certain settings of α, b), which is not exhibited in the MAP
estimator of the model in part (b) or the MLE estimator of the model in part (a).
[Total for Question 1: 25 marks + Bonus 10 marks]
Page 6

  1. Consider the Categorical Naive Bayes approach to binary classification. Here y ∈ {0, 1}
    is the outcome of a random variable, Y, which characterises the binary output label of a
    particular data point, and x = [x1, x2, …, xm]
    T
    is the m-dimensional outcome of a random
    variable, X = [X1, X2, …, Xm]
    T
    , which characterises the input attributes of a particular data
    point. The outcome of the i-th attribute, xi
    , can take one of mi discrete settings, and we
    write xij to denote the j-th discrete setting of the attribute xi
    .
    Here, we assume that Y is a Bernoulli random variable:
    y ∼ Bern(θy)
    Where: θy ∈ [0, 1].
    And we assume that the class conditional input attribute random variables, (Xi
    |y = k) are
    Categorical random variables:
    (xi
    |y = k) ∼ Categorical(θik)
    Where: θik = [θi1k, θi2k, …, θimik]
    T
    ; θijk ∈ [0, 1]; Pmi
    j=1 θijk = 1.
    In this setting in general we seek to learn a posterior probability distribution, pY (y|x) =
    f
    θy,{θik}
    m,1
    i=1,k=0
    (x), where f
    θy,{θik}
    m,1
    i=1,k=0
    is drawn from a function class, F:
    F =
    (
    f
    θy,{θik}
    m,1
    i=1,k=0
    (x) = pY (y)
    Qm
    i=1 pXi
    (xi
    |y)
    pX (x)





    pY (y = 1) = θy,
    pXi
    (xi = xij |y = k) = θijk,
    θik = [θi1k, θi2k, …, θimik]
    T
    )
    We are given a sample data set, S = {(x
    (i)
    , y(i)
    )}
    n
    i=1, where nk of the points are those for
    which y
    (i) = k, and n˜ijk of the points are those for which (y
    (i) = k) ∧ (x
    (i)
    ˜i
    = x˜ij ).
    Given a novel test point with input xe, we seek to make statements about the (unknown)
    output, ye, of this novel test point. The statements will be phrased in terms of the posterior
    probability of the output ye, conditional on observing the input xe.
    (a) [9 marks]
    Derive the Maximum Likelihood Estimate (MLE) of the posterior probability of the
    output ye, pY
    
    ye = 1|xe; θyMLE, {θikMLE}
    m,1
    i=1,k=0
    .
    (Do not use explicit constrained optimisation techniques in your answer. Instead enforce the constraints by writing θijk =
    e
    ηijk
    Pmi
    j=1 e
    ηijk where ηijk ∈ R).
    (b) [9 marks]
    Given a factorised prior distribution over θy, {θik}
    m,1
    i=1,k=0, pΘy,{Θik}
    m,1
    i=1,k=0 
    θy, {θik}
    m,1
    i=1,k=0
    ,
    such that each instance of θy is the outcome of a Beta distributed random variable Θy,
    and each instance of θik is the outcome of a Dirichlet distributed random variable Θik,
    where:
    pΘy,{Θik}
    m,1
    i=1,k=0 
    θy, {θik}
    m,1
    i=1,k=0
    = pΘy
    (θy)
    Ym
    i=1
    Y
    1
    k=0
    pΘik (θik)
    Page 7
    θy ∼ Beta(β1, β0)
    n
    θik ∼ Dir(α
    ik
    1
    , αik
    2
    , …, αik
    mi
    )
    om,1
    i=1,k=0
    Here we constrain β1 > 1, β0 > 1, αik
    j > 1 ∀i, j, k.
    Derive the Maximum A Posteriori (MAP) of the posterior probability of the output ye,
    pY
    
    ye = 1|xe; θyMAP, {θikMAP}
    m,1
    i=1,k=0
    .
    (Hint: Use the attached ‘Notation & Formula Sheet’).
    (c) [4 marks]
    As n becomes very large, succinctly describe the relationship between your answers to
    parts (a) and (b).
    (d) [3 marks]
    What deficiency associated with the MLE estimate of part (a) is resolved via the MAP
    approach of part (b).
    [Total for Question 2: 25 marks]
    Page 8


If you need answers to this assignment, WhatsApp/Text to +1 646 978 1313  

or send us an email to admin@shrewdwriters.com and we will reply instantly. We provide original answers that are not plagiarized. Please, try our service. Thanks

Leave a Reply

Your email address will not be published.