02 Mai, 2017

Try alternative cost functions

In the following, we will try alternative cost functions for data set data1.

original cost function

\[\begin{eqnarray} J \left(\begin{pmatrix}\theta_0\\ \theta_1\end{pmatrix} \right) &=& \frac{1}{2m}\sum_{i=1}^{m} (h(x^{(i)})-y^{(i)})^2 \\ &=& \frac{1}{2m}\sum_{i=1}^{m} ((\theta_0 + \theta_1 x^{(i)})-y^{(i)})^2 \\ &=& \frac{1}{2m}(X\times\Theta'-y)'\times (X\times\Theta'-y) \end{eqnarray}\]

# experiment with cost functions
### original cost function 
# graphical parameter
par(lwd=2,pch=4)

### Play with data and linear regression
n = 120 # number of samples
m = 10  # max value of x (min is zero)
X <- seq(from=0,to=m,length.out=n)
y <- rep(1,times=n)
y[seq(from=1,to=n,by=3)] <- 4
data1 <- cbind(X,y)
data <- data1
plot(data,col="red",main="data1")

Looking at the data it is clear that the optimal value for \(\theta_1\) is \(0\). Let \(J_0\) denote the cost function with \(\theta_1=0\) fixed, thus \(J_0(\theta)=J(\theta,0)\).

# vary only theta0
theta0_vals <- seq(from=0,to=4,length.out=100)
cost_vals <- rep(1,times=100)
for (i in 1:100){
  cost_vals[i] <- computeCost(data,c(theta0_vals[i],0))
  }
plot(theta0_vals,cost_vals,type="l",main="Original cost function with fixed theta1=0")
plotCostSurface_pure(data,computeCost,c(-3,3),c(-2,2))
val_vec<-c(0,1,1.5,2,2.5,3,4)
val_df <- data.frame()
for (i in val_vec){
  val_df<- rbind(val_df,c(i,round(computeCost(data,c(i,0)),digits=3)))
}
names(val_df) <- c("t0","cost") 
val_df$t1 <- 0
val_df <-val_df[,c(1,3,2)]

You must enable Javascript to view this page properly.

The optimal \(\theta_0\)-value for the original cost function is \(2\). That gives the following linear fit:

plotLinearFit(data,c(2,0))

You must enable Javascript to view this page properly.

1. alternative cost function

\[\begin{eqnarray} J(\Theta) &=& \frac{1}{m}\sum_{i=1}^{m} |(h_{\Theta}(x^{(i)})-y^{(i)}| \\ &=& \frac{1}{m}\sum_{i=1}^{m} |\theta_0 + \theta_1 x^{(i)}-y^{(i)}| \end{eqnarray}\]

### cost function 1
data <- data1
# vary only theta0
theta0_vals <- seq(from=0,to=4,length.out=100)
cost_vals <- rep(1,times=100)
for (i in 1:100){
  cost_vals[i] <- computeCost1(data,c(theta0_vals[i],0))
  }
plot(theta0_vals,cost_vals,type="l")
plotCostSurface_pure(data,computeCost1,c(-3,3),c(-2,2))

cost1<- c()
for (i in val_vec){
  cost1<- c(cost1,round(computeCost1(data,c(i,0)),digits=3))
}
val_df$cost1 <- cost1

You must enable Javascript to view this page properly.

The optimal \(\theta_0\)-value for the 1st alternative cost function is \(1\). That gives the following linear fit:

plotLinearFit(data,c(1,0))

You must enable Javascript to view this page properly.

Careful: the cost specified in the subtitle of the plot is the cost computed by the original cost function. The cost for the 1. alternative cost function given the parameter \(\theta_0=2\) and \(\theta_1=0\) is \(1\) (\(=\frac{1}{m}(0\frac{2m}{3}+3\frac{m}{3})\).

2. alternative cost function

\[\begin{eqnarray} J(\Theta) &=& \left|\sum_{i=1}^{m} (h_{\Theta}(x^{(i)})-y^{(i)})\right|\\ &=& \left|\sum_{i=1}^{m} (\theta_0 + \theta_1 x^{(i)}-y^{(i)})\right| \end{eqnarray}\]

## experiment with cost functions
### cost function 2
data <- data1
# vary only theta0
theta0_vals <- seq(from=0,to=4,length.out=100)
cost_vals <- rep(1,times=100)
for (i in 1:100){
  cost_vals[i] <- computeCost2(data,c(theta0_vals[i],0))
  }
plot(theta0_vals,cost_vals,type="l")
plotCostSurface_pure(data,computeCost2,c(-3,3),c(-2,2))

cost2<-c()
for (i in val_vec){
  cost2<- c(cost2,round(computeCost2(data,c(i,0)),digits=3))
}
val_df$cost2 <- cost2
print("cost values for selected t0 values")

## [1] "cost values for selected t0 values"

print(val_df,row.names=FALSE)

##   t0 t1  cost cost1 cost2
##  0.0  0 3.000 2.000   240
##  1.0  0 1.500 1.000   120
##  1.5  0 1.125 1.167    60
##  2.0  0 1.000 1.333     0
##  2.5  0 1.125 1.500    60
##  3.0  0 1.500 1.667   120
##  4.0  0 3.000 2.000   240

You must enable Javascript to view this page properly.

The optimal \(\theta_0\)-value for the 2nd alternative cost function is \(2\). That gives the following linear fit:

plotLinearFit(data,c(2,0))

You must enable Javascript to view this page properly.

Careful: the cost specified in the subtitle of the plot is the cost computed by the original cost function. The cost for the 2. alternative cost function given the parameter \(\theta_0=2\) and \(\theta_1=0\) is \(0\)(\(=-1\frac{2m}{3}+2\frac{m}{3})\).

Gradient descent: initial parameters

Example 1: gradient descent converts

#######################################################
## Experiments on setting the initial parameters
# data preparation
X <- seq(from=0,to=m,length.out=n)
y <- 2 + 4*X 
res_vec <- rnorm(n, mean = 0, sd = 8) 
y <- y + res_vec
data4 <- cbind(X,y)

data <- data4 # select data you are interested in 
X <- cbind(rep(1,times=dim(data)[1]),data[,1])
y <- data[,2]

# plot data
plot(data,col="red")


# gradient descent
theta_init = c(-20,-10) 
iterations = 1000
alpha = 0.058   # try differnt values
grad_desc <- gradientDescent(data, theta_init, alpha, iterations)
plotCostSurface(data,grad_desc)

You must enable Javascript to view this page properly.

data: data4
\(\alpha= 0.058\)
iterations: 1000
theta_init=-20, -10

Example 2: gradient descent does not convert

# gradient descent
iterations = 100
alpha = 0.06   # try differnt values
grad_desc <- gradientDescent(data, theta_init, alpha, iterations)
plotCostSurface(data,grad_desc)

You must enable Javascript to view this page properly.

\(\alpha=0.06\)

Is there a way to see before we draw the contour plot of the cost function, that something goes miss?

Yes, plot the cost at iteration step function:

# plot cost development
plotCostDev(grad_desc)

You must enable Javascript to view this page properly.

How can we solve the problem?

adapt iterations?
adapt theta_init?
adapt alpha?

adapt `iterations` value:

not a good idea

iterations = 10
grad_desc <- gradientDescent(data, theta_init, alpha, iterations)
plotCostDev(grad_desc)

You must enable Javascript to view this page properly.

adapt `theta_init` value:

could work, but but only if one knows the optimal fit

iterations = 1000
alpha = 0.02
grad_desc <- gradientDescent(data, theta_init,0.02, iterations)
theta_opt <- grad_desc$theta
alpha = 0.058
grad_desc <- gradientDescent(data, theta_opt, alpha, iterations)
plotCostDev(grad_desc)
plotCostSurface(data,grad_desc)

You must enable Javascript to view this page properly.

theta_init = 3.4571955, 3.7885747
iterations = 1000

adapt `alpha` value:

Yes, but don’t make it too small:
alpha = 0.00002

theta_init = c(-20,-10) 
iterations = 1000
alpha = 0.00002   # try differnt values
grad_desc <- gradientDescent(data, theta_init, alpha, iterations)
plotCostDev(grad_desc)
plotLinearFit(data,grad_desc$theta)
plotCostSurface(data,grad_desc)

You must enable Javascript to view this page properly.

Note, the cost at iteration steps function is still decreasing

Tips

allways control the cost development function!
make enough iteration steps
set theta_initto some reasonable value

theta_init = c(0,0) 
iterations = 1000
alpha = 0.02   # try differnt values
grad_desc <- gradientDescent(data, theta_init, alpha, iterations)
plotCostDev(grad_desc)
plotLinearFit(data,grad_desc$theta)
plotCostSurface(data,grad_desc)

You must enable Javascript to view this page properly.

Also control the last cost values you received and how much the theta value is still changing:

values <- cbind(last(grad_desc$cost_vec,10),
last(grad_desc$theta_vec[,1],10),
last(grad_desc$theta_vec[,2],10))
colnames(values) <- c("last 10 cost values","last 10 theta0 values","last 10 theta1 values")
kable(values,caption="important controll values",row.names=FALSE)

important controll values
last 10 cost values	last 10 theta0 values	last 10 theta1 values
36.90818	3.582604	3.769703
36.90818	3.582712	3.769686
36.90818	3.582819	3.769670
36.90817	3.582925	3.769654
36.90817	3.583031	3.769638
36.90817	3.583136	3.769622
36.90817	3.583241	3.769607
36.90817	3.583345	3.769591
36.90817	3.583449	3.769575
36.90817	3.583552	3.769560

This looks good, nearly no changes anymore, the cost function has converged.

Machine Learning - Session 1: Linear Regression

Part 3: Experiments with linear regression

Wiebke Petersen

Try alternative cost functions

original cost function

1. alternative cost function

2. alternative cost function

Gradient descent: initial parameters

Example 1: gradient descent converts

Example 2: gradient descent does not convert

How can we solve the problem?

adapt `iterations` value:

adapt `theta_init` value:

adapt `alpha` value:

Tips

Machine Learning - Session 1: Linear Regression

Part 3: Experiments with linear regression

Wiebke Petersen

Try alternative cost functions

original cost function

1. alternative cost function

2. alternative cost function

Gradient descent: initial parameters

Example 1: gradient descent converts

Example 2: gradient descent does not convert

How can we solve the problem?

adapt iterations value:

adapt theta_init value:

adapt alpha value:

Tips

adapt `iterations` value:

adapt `theta_init` value:

adapt `alpha` value: