Machine Learning: Cost Function

Aparna Joshi
4 min readAug 30, 2020

Machine Learning — The Supervised kind

Machine learning is the ability of computer algorithms to improve continuously through experience. One of the most common types of machine learning techniques includes supervised learning. In learning algorithms, we have two sets of values — the Input Features and the Output Variables. Most of the supervised learning algorithms are classified into two types of problems:

  1. Regression: In regression problems, we have a set of continuous input features, mapped against the output variables. The problem is to predict a real-valued output against an anonymous input feature, as close to the actual value as possible.
  2. Classification: In classification problems, we have a set of inputs belonging to a given category. The problem is to map anonymous input values into discrete categories.

In this article, we shall discuss the Regression type problems in machine learning, the definition of the cost function, and the need for minimizing it.

What is Regression? What scenarios require the regression method of problem-solving in machine learning?

Regression is a mathematical problem-solving method in which we try to formulate a function through which an unknown variable can be predicted, whose value depends upon the values of known variables.

Assume that we have the problem of predicting the monetary value of the house based on three factors: Dimensions, Number of bedrooms, and Age of the house. One can say that the value of the house increases if the Dimensions and Number of bedrooms increases. On the other hand, the value of the house decreases if the Age increases

In this problem,

Where, x1, x2, and x3 represent Dimension, No.of bedrooms, and Age respectively. y is the correct value of the house. The coefficients θ1, θ2, and θ3 of the independent variables x1, x2, and x3 will change according to the training samples given while formulating the regression function f(x).

If we have only one independent variable( x), we call this as linear regression. For the sake of simplicity, I will be using linear regression in the rest of the article.

Cost of the hypothesis function

Consider the linear regression problem containing only one independent variable. Let’s define the linear regression function by:

f(x) = θ0 + θ1x

This function is a hypothesis function in which we say that, for a certain value of and, given the value of x, we get the predictions of f(x) very close to the actual value.

Let us consider two graphs. Graph 1 contains the plot of output y versus the input values x as dots on a graph. Graph 2 contains the plot of hypothesis function f(x) plotted on the graph as a straight line. If we merge these two graphs and represent them in a single graph, the distance between the predicted value(a point on the hypothesis function) and the actual value(training sample value) represents the cost of individual predictions.

Let us view this graph:

In the above graph:

  1. Individual points on the graph represent each training sample y v/s x.
  2. The line on the graph represents the hypothesis function f(x)
  3. The distance between the points and the line represents the cost for the individual training sample

Cost representation: We have seen the representation of cost graphically. Let us try to derive a mathematical equation out of this graphical representation. We define the following parameters used in the cost function.

  • J(θi) => Individual Cost function.
  • f(xi) => Hypothesis function for ith training set.
  • yi => ith value in training set.

The individual costs can be defined as the difference between the value of hypothesis function and the actual value: J(θ1) = f(x1) — y1, J(θ2) = f(x2) — y2

The total cost of all the values present in the training set can be represented as: Total cost = Σ0-i (f(xi) — yi)

Minimizing cost function: We don’t want to spend too much now, Do we?

The final goal of linear regression is to find the hypothesis function using training samples, such that the final total cost of the hypothesis function is minimal. Let us derive the equation that represents the cost function to be minimized.

  1. We know that the total cost of the hypothesis function, given a training set can be defined as: Total cost = Σ0-i (f(xi) — yi)
  2. We want the cost to be minimum, in other words, the difference between (f() and ) should be minimum. Note that when we square an integer, its value increases, however, if we square a fraction, its value decreases. Squaring the difference will make sure that the cost is at its absolute minimum: Total cost = J(θi) = Σ0-i (f(xi) — yi)2
  3. Let us consider that the training sample has m number of values in it. The average cost can be represented as: (Σ0-i (f(xi) — yi)2)/m

Hence the average cost function to be minimized can be represented as: J(θ) = (Σ0-i (f(xi) — yi)2)/m

If we substitute the hypothesis function with the actual values of θ and x, we get the cost function as:

J(θ) = (Σ0-i ((θ0 + θ1xi) — yi)2)/m

There are many algorithms that can be implemented to minimize this cost function. Gradient descent is one such algorithm commonly used, however, note that there is more than one way to reduce the cost function. I hope this article gave an insight into understanding how the average cost function is derived from the hypothesis function in linear regression.

Originally published at https://aparnajoshi.netlify.app.

--

--