机器学习tips

Part 1

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”

Machine learning is all about drawing lines through data and we decide what purpose the line services, such as a decision boundary in a classification algorithm, or a predictor that models real-world behavior. And these lines in turn just come from finding the minimum of a cost function using gradient descent.

The key to determining what parameters to choose to best approximate the data is to find a way to characterize how “wrong” our boundary or predictor is. We do this by using a cost function (or a loss function).

The perceptron algorithm forms the basis for many modern day ML algorithms, most notably neural networks.

Like neurons, perceptrons take in several inputs and spit out an output.

We would want to use different weights for each input.

We use a bias as a measure of how difficult it is for the perceptron to say ‘yes’ or ‘no’

A logistic model can predict probabilities while a perceptron can only predict yes or no.

Part 2

A support vector is a vector from the data point with the smallest margin to the decision boundary.

The idea behind a support vector machine is to classify data by drawing a decision boundary such that it maximizes the support vectors. By maximizing the support vectors, we’re also maximizing the margins in a data set, and thus the decision boundary is as far away as possible from the data points.

The SVM algorithm fails when you have a linearly inseparable dataset.

Before, we wanted every single data point to be as far away (to the correct side) from the decision boundary as possible. Now, we’ll allow a data point to stray toward the wrong side of the boundary, but we’ll add a “cost” to that happening (remember cost functions from the last post?). This is something that happens very often in machine learning and is called regularization.

You can think of a kernel as mapping a set data from one coordinate system to another coordinate system. In the original coordinate system the data may not be linearly separable at all, whereas in the new coordinate system if we choose the correct kernel, we should get a set a data set is very easily linearly separable.

Part 3

A neuron takes in any number of numerical inputs and spits out just one output. To get to this output, the neuron calculates an intermediate value called s by multiplying each input by a different weight, adding them all together, and adding an additional number called the bias.

An activation function is any function that takes in our s and gives the output of our neuron, called the activation.

Using a step function makes training very difficult because there’s no way to tell whether the neural network is getting closer or farther from the correct answer.But sigmoid function is a nice activation function because it is smooth everywhere, making it easier to figure out if you’re getting closer to the top.

Artificial neural network are composed of layers of artificial neurons in a similar way. In general, there are three types of layers: an input layer, one or more hidden layers, and an output layer.

At its very core, training a neural network just means adjusting the parameters.

When the neural network outputs the wrong answer, you find the slopes of the output layer (the domino closest to the marble) first because it was the direct cause of the incorrect answer. And since the output layer depends on the hidden layer, you’ll have to fix that too by finding the slopes and using gradient descent. Eventually you’ll work your way back to the hidden layer closest to the input layer.

Changing the parameters in one layer will affect the outputs of the next layer which will affect the outputs of the next layer, and so on until the cost function itself is affected.It turns out that once we calculate the slopes of a given layer, we can easily find the slopes of a previous layer.

We calculate slopes by starting from the back and propagating our algorithm backwards through the neural network until we get all the slopes for gradient descent.This, in a nutshell, is the back-propagation algorithm.

Part 4

Todo…