May 9, 2021

In a previous post on this blog I discussed the task of **classification**, which involves determining automatically which class a given object belongs to. A classification system assigns to each object a class label, where the number of possible labels is defined in advance.

In **regression**, on the other hand, a machine learning system assigns to each object a certain numerical value, out of an infinite range of possibilities.

Let us consider the example of a set of TripAdvisor reviews for a particular restaurant. This is a selected review from the set:

For a dataset like this, we might define at least a few different classification tasks – for example:

*assign the review to one of three classes based on the author’s overall view: positive, negative or neutral;**determine the location of the restaurant being described, choosing from among the 16 Polish provinces;**define the type of food the restaurant serves (Polish, European, Asian, etc.).*

By contrast, an example regression task for the same dataset might be the following:

*Rate the quality of the service being described in the review by means of a real number in the interval from 0 to 5.*

The value computed by a regression system based on a review might be, for instance, 3.45. We expect the value returned by the system to correspond adequately to the opinion expressed in the review: if the opinion is very positive, the value should be close to 5, but if it is strongly negative, the value should be close to zero.

Regression provides more precise information than classification. We can use it, for example, to track the progress a restaurant is making in improving its quality of service, by comparing the values derived from its reviews at different times. A regression system can also be used with this type of analysis to predict future values of stock market indices – and in this case accuracy, even to many decimal places, can be crucial.

The regression method can be used to compute a predicted value based on an object’s ** features**. In machine learning, a

In the method of linear regression, it is assumed that the value to be predicted can be computed using a **linear function** of feature values. We may assume, for example, that the value of an opinion is directly proportional to the number of words with positive meaning contained in it. This is expressed by the formula:

**VALUE = α x NUMBER OF POSITIVE WORDS**

where α is some positive number.

We might additionally assume that the predicted value is reduced by an amount proportional to the number of words with negative meaning:

**VALUE = α x NUMBER OF POSITIVE WORDS+ 𝛽 x NUMBER OF NEGATIVE WORDS**

where 𝛽 is some negative number.

If the function defined in this way returns, for example, values ranging from –1 to 4, and we want the results to lie in the range 0 to 5, we can shift the range of returned values by adding a “displacement parameter”:

**VALUE = α x NUMBER OF POSITIVE WORDS + 𝛽 x NUMBER OF NEGATIVE WORDS + 𝛾**

In this case, the linear regression task is to find values of the coefficients α, 𝛽 and 𝛾 such that the values calculated using the above formula **conform to expectations**.

If we use regression to evaluate objects (e.g.: What is the reviewer’s attitude to the reviewed service, on a scale of 0 to 5?), the method is said to *conform to expectations* if the values that it returns are largely in agreement with human intuition.

If we use a regression method for forecasting (e.g.: What will be the value of the stock market index next month? What will be the value of textbook sales in September?), the results *conform to expectations* if there is only a small difference between the predictions and the actual values (which can be checked after the fact).

Regression coefficients are computed from **training data**: a set of objects for which we know both the values of their features and the values being predicted. For example, a training data object might be the restaurant review shown earlier in this post. This contains five positive words (*mega*, *nice*, *best*, *thank*, *pleasant*) and one negative word (*not*), and the point score awarded by the reviewer (which in this case is the value we will be trying to predict) is 5, as shown by the five green dots at the top of the review. If we had to determine the regression coefficients on the basis of just this one example, they might take the following values:

α = 1

𝛽 = –1

𝛾 = 1

In this case the predicted value of the review, computed from the formula given above, would be:

VALUE = 1 x 5 + (–1) x 1 + 1 = 5

The value computed from the linear regression formula would thus be identical to the point score of the review from the training set. Bravo!

Let us assume, however, that another review from the training set – again with five points awarded – contains the following opinion:

*Superb restaurant! I can’t wait to go there again!*

Here the numbers of positive and negative words are 1 each (*Superb* and *can’t*). If we apply the same values of α, 𝛽 and 𝛾, the above formula will give a predicted value of:

VALUE = 1 x 1 + (–1) x 1 + 1 = 1

Here the value computed from the linear regression formula differs from the user’s actual score by as many as 4 points! The inclusion of this second data object in the training set thus makes it necessary to change the coefficients – to the following values, for example:

α = 0.5

𝛽 = –0.5

𝛾 = 4

After this modification, the predicted score for the first object will be 6 (one point too high), while for the second object it will be 4 (one point too low). For both objects jointly, therefore, the difference between the predicted and real values is just 2, whereas with the previous choice of values of coefficients it was 4.

Regression coefficients are determined in such a way that the sum of the differences between the predicted and actual values, over the entire set of training data, is **as small as possible**. This sum of differences between the results returned by the linear function and the actual values is called a **cost function**.

In mathematical language: the goal of the regression method is to find values of the regression coefficients such that the cost function attains its **minimum**.

The method of gradient descent is used to find the minimum point of a function – that is, the value(s) of its argument(s) for which the function takes its lowest value. We can illustrate this using an example in which the function has only one argument.

As we can see from the above graph, the analysed function takes its lowest value (–6) for the argument value x = 2. The value of 2 is thus the minimum point of this function.

To see how gradient descent works, consider the *Tale of the Empty Tank*.

Imagine you have gone for a drive in the mountains. Halfway up a hill the car suddenly comes to a stop – and for the most prosaic reason you could think of: its petrol tank is empty. You will now have to go on foot to the nearest petrol station.

You aren’t sure where the station is, although you know for certain that it is at the lowest point along the road. So which way should you go? You look around: the road leads upwards in one direction, and downwards in the other. So you take a step in the downwards direction, and then repeat the procedure.

Using this procedure, can you be sure of reaching the petrol station? If the cross-section of the road looks like the left-hand diagram, you will certainly reach your goal. However, if it looks something like the right-hand diagram, success is no longer guaranteed.

Gradient descent is analogous to the method used in the *Tale of the Empty Tank*:

- Start at an arbitrary point (e.g. with the argument equal to 0), and then repeat steps 2) and 3).
- If the function is decreasing at the current point (i.e. its value gets smaller as the value of the argument increases), then move along the graph of the function to the right.
- In the opposite case (i.e. the value of the function gets smaller as the value of the argument decreases), then move along the graph to the left.

Keep moving in this way until you reach a point where the function does not decrease on either side. You have now reached your minimum!

Such an algorithm will find a minimum effectively for all functions with a graph like the one in the left-hand diagram – those that have exactly one “valley”. These are called **convex functions**. However, the algorithm may not work correctly in the case of functions like the one in the right-hand diagram.

Luckily, it turns out that the function we are interested in – the cost function – is convex. Therefore, the method of gradient descent works perfectly in this case.

Unfortunately, the results obtained by linear regression do not always leave the user fully satisfied. Of course, when we expect the system to provide a predicted temperature, we are pleased to get a result expressed in degrees Celsius, and when we need a stock price forecast, we are happy for the system to give us a value in dollars. However, a review score expressed as a number like 3.45 remains unclear if we do not know what scale is being used. Perhaps in that case it would be better to return a result expressed as a percentage (e.g. 69%)? And if we want to predict Nancy Pelosi’s chances of being elected US president in 2024, we would certainly expect a result expressed in percentage terms, not a number from some unknown range.

Logistic regression makes it possible to recalculate the results obtained by linear regression so that they lie in the interval from 0 to 1. To do this, we take a result given by linear regression and apply a **logistic function**, whose values always lie within that interval.

An example of a logistic function is the sigmoid function:

As the graph shows, if the value returned by the linear function is 0, the sigmoid function will convert it to 0.5. The sigmoid function maps negative quantities to values in the range from 0 to 0.5, and positive quantities to values from 0.5 to 1. Therefore, all values of the sigmoid function are contained in the interval from 0 to 1. Which is just what we want!

Logistic regression is one of the fundamental methods of machine learning. It is used, for example, in neural networks, which thanks to new technological developments (specialised graphics cards, high-performance tensor processing units, etc.) are taking over more and more bastions of human intelligence.

But neural networks will be the topic of another post on our blog!