ML Week 3 Logistic Regression

April 30, 2018 | Author: Bhargavprasad Kulkarni | Category: Loss Function, Mathematical Optimization, Statistical Classification, Logistic Regression, Regression Analysis

Share Embed Donate

Report this link

Short Description

Descripción: ML Week 3 Logistic Regression...

Description

VI. Logistic Regression - Feedback You achieved a score of 3.50 out of 5.00. Your answers, as well as our explanations, are shown below.

To review the material and deepen your understanding of the course content, please answer the review questions below, and hit submit at the bottom of the page when you're done. You are allowed to take/re-take these review quizzes multiple times, and each time you will see a slightly different set of questions or answers. We will use only your highest score, and strongly encourage you to continue re-taking each quiz until you get a 100% score at least once. (Even after that, you can re-take it to review the content further, with no risk of your final score being reduced.) To prevent rapid-fire guessing, the system enforces a minimum of 10 minutes between each attempt. Question 1 Suppose that you have trained a logistic regression classifier, and it outputs on a new example x a prediction hθ(x) = 0.2. This means (check all that apply): Our estimate for P(y=0|x;θ) is 0.8. Our estimate for P(y=1|x;θ) is 0.8. Our estimate for P(y=1|x;θ) is 0.2. Our estimate for P(y=0|x;θ) is 0.2.

Your answer

Score

Choice explanation Since we must have P(y=0|

x;θ) = 1−P(y=1|x;θ), the former Our estimate for P(y=0|x;θ) is 0.8.

0.25

is 1−0.2=0.8.

hθ(x) gives P(y=1|x;θ), Our estimate for P(y=1|x;θ) is 0.8.

0.25

not 1−P(y=1|x;θ).

hθ(x) is precisely P(y=1|x;θ), so each Our estimate for P(y=1|x;θ) is 0.2.

0.25

is 0.2.

Our estimate for P(y=0|x;θ) is 0.2.

0.25

hθ(x) is P(y=1|x;θ), not P(y=0|x;θ)

Total

1.00 / 1.00

Question 2 Suppose you train a logistic classifier hθ(x)=g(θ0+θ1x1+θ2x2). Supposeθ0=−6,θ1=0,θ2=1. Which of the following figures represents the decision boundary found by your classifier?

Your answer

Score

Choice explanation In this figure, we transition from negative to positive when x1 goes from below 6 to above 6, but for the given values of θ, the transition occurs when x2 goes from

0.00 Total

below 6 to above 6

0.00 / 1.00

Question 3

Suppose you have the following training set, and fit a logistic regression classifier hθ(x)=g(θ0+θ1x1+θ2x2).

x 1

x2

y

1

0.5

0

1

1.5

0

2

1

1

3

1

0

Which of the following are true? Check all that apply. The positive and negative examples cannot be separated using a straight line. So, gradient descent will fail to converge. At the optimal value of θ (e.g., found by fminunc), we will have

J(θ)≥0.

Adding polynomial features (e.g., instead using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x21+θ4x1x2+θ5x22) ) would increase J(θ) because we are now summing over more terms.

J(θ) will be a convex function, so gradient descent should converge to the global minimum.

Your answer

Scor e

Choice explanation

While it is true they cannot be separated, gradient descent The positive and negative examples cannot be will still converge to the optimal separated using a straight line. So, gradient descent fit. Some examples will remain will fail to converge. 0.25 misclassified at the optimum. At the optimal value of θ (e.g., found by fminunc), we

The cost function J(θ)is

will have J(θ)≥0.

always non-negative for logistic 0.25 regression.

Adding polynomial features (e.g., instead

0.25 The summation in J(θ)is over

examples, not features. Furthermore, the hypothesis will now be more accurate (or at least just as accurate) with new features, so the cost function will decrease.

using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x21+θ4x1x2+

θ5x22)) would increase J(θ) because we are now summing over more terms.

The cost function J(θ)is

J(θ) will be a convex function, so gradient descent should converge to the global minimum. Total Question 4

guaranteed to be convex for 0.25 logistic regression. 1.00 / 1.00

For logistic regression, the gradient is given by ∂∂θjJ(θ)=∑mi=1(hθ(x(i))

−y(i))x(i)j. Which of these is a correct gradient descent update for logistic regression with a learning rate of α? Check all that apply.

θj:=θj−α1m∑mi=1(θTx−y(i))x(i)j (simultaneously update for all j). θ:=θ−α1m∑mi=1(hθ(x(i))−y(i))x(i). θj:=θj−α1m∑mi=1(11+e−θTx(i)−y(i))x(i)j (simultaneously update for all j). θ:=θ−α1m∑mi=1(θTx−y(i))x(i). Your answer

θj:=θj−α1m∑mi=1(θTx−y(i))x(i)j(simultaneo usly update for all j).

Scor e

Choice explanation This uses the linear regression hypothesis θTx instead of that for

0.25 linear regression. This is a vectorized version of the direct substitution of ∂∂θjJ(θ) into

θ:=θ−α1m∑mi=1(hθ(x(i))−y(i))x(i).

0.25 the gradient descent update.

θj:=θj−α1m∑mi=1(11+e−θTx(i)

0.25 This substitutes the exact form of hθ(x(i)) used by logistic regression into the gradient

−y(i))x(i)j(simultaneously update for all j).

descent update. This vectorized version uses the linear regression hypothesis θTx instead of that for

θ:=θ−α1m∑mi=1(θTx−y(i))x(i).

0.25 logistic regression.

Total

1.00 / 1.00

Question 5

Which of the following statements are true? Check all that apply. The cost function J(θ) for logistic regression trained with m≥1 examples is always greater than or equal to zero. The sigmoid function g(z)=11+e−z is never greater than one (>1). For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/LBFGS/etc). Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression.

Your answer

Score

Choice explanation The cost for any example x(i) is always ≥0since it is the negative log of a quantity less than one. The cost function J(θ) is a summation over the

The cost function J(θ) for logistic regression trained with m≥1 examples is always greater than or equal to zero. 0.00

cost for each eample, so the cost function itself must be greater than or equal to zero. The denomiator ranges

The sigmoid function g(z)=11+e−z is never greater than one (>1).

from ∞ to 1 as zgrows, so the result is 0.25

For logistic regression, sometimes 0.00 gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer

always in (0,1). The cost function for logistic regression is convex, so gradient descent will always converge to the global minimum. We still might use a more advanded

more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc).

optimization algorithm since they can be faster and don't require you to select a learning rate.

Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression. 0.25

As demonstrated in the lecture, linear regression often classifies poorly since its training prodcedure focuses on predicting real-valued outputs, not classification.

Total

0.50 / 1.00

ML Week 3 Logistic Regression

Short Description

Description

Comments

We need your help!