Як вибрати між алгоритмами навчання

21

Мені потрібно реалізувати програму, яка класифікує записи на 2 категорії (правда / хибність) на основі деяких навчальних даних, і мені було цікаво, на який алгоритм / методологію я повинен дивитись. Здається, що їх вибирати дуже багато - штучна нейронна мережа, генетичний алгоритм, машинне навчання, байєсова оптимізація тощо тощо, і я не знав, з чого почати. Отже, мої запитання: як я повинен вибрати алгоритм навчання, який я повинен використовувати для своєї проблеми?

Якщо це допомагає, ось проблема, яку мені потрібно вирішити.

Дані про навчання:
Дані про тренінг складаються з багатьох таких рядків:

Precursor1, Precursor2, Boolean (true/false)

Пробіг
мені дадуть купу попередників.
Потім,

Я вибираю алгоритм А з різних алгоритмів (або динамічно генерую алгоритм) і застосовую його до всіх можливих комбінацій цих попередників і збираю "записи", які випромінюються. "Запис" складається з декількох пар ключ-значення *.
Я застосовую якийсь дивовижний алгоритм і класифікую ці записи на 2 категорії (true / false).
Я буду генерувати таблицю, яка має той самий формат, що і дані поїздів:
Precursor1, Precursor2, Boolean

І вся програма оцінюється виходячи з того, скільки правдивих / хибних я отримав рацію.

*: "Запис" буде виглядати приблизно так (сподіваюся, це має сенс)

Record         [1...*] Score
-Precursor1             -Key
-Precursor2             -Value

Існує лише обмежена кількість можливих ключів. Записи містять різні підмножини цих клавіш (деякі записи мають key1, key2, key3 ... інші записи мають key3, key4 ... тощо).

I actually need 2 learning. One is for step 1. I need to have a module that looks at the Precursor pairs etc. and decide what algorithm to apply in order to emit a record for the comparison. Another is for step 2. I need a module that analyzes the collection of records and categorize them into the 2 categories (true/false).

Thank you in advance!

— Enno Shioji
джерело

16

There is a package for "R" called "caret," which stands for "classification and regression testing." I think it would be a good place for you to start, as it will easily allow you to apply a dozen or so different learning algorithms to your data, and then cross-validate them to estimate how accurate they each are.

Here is an example that you can modify with your own data/other methods:

install.packages('caret',dependencies = c('Depends','Suggests'))
library(caret)

set.seed(999)
Precursor1 <- runif(25)
Precursor2 <- runif(25)
Target <- sample(c('T','F'),25,replace=TRUE)
MyData <- data.frame(Precursor1,Precursor2,Target)
str(MyData)

#Try Logistic regression
model_Logistic <- train(Target~Precursor1+Precursor2,data=MyData,method='glm')

#Try Neural Network
model_NN <- train(Target~Precursor1+Precursor2,data=MyData,method='nnet',trace=FALSE)

#Try Naive Bayes
model_NB <- train(Target~Precursor1+Precursor2,data=MyData,method='nb')

#Try Random Forest
model_RF <- train(Target~Precursor1+Precursor2,data=MyData,method='rf')

#Try Support Vector Machine
model_SVM<- train(Target~Precursor1+Precursor2,data=MyData,method='svmLinear')

#Try Nearest Neighbors
model_KNN<- train(Target~Precursor1+Precursor2,data=MyData,method='knn')

#Compare the accuracy of each model
cat('Logistic:',max(model_Logistic$results$Accuracy))
cat('Neural:',max(model_NN$results$Accuracy))
cat('Bayes:',max(model_NB$results$Accuracy))
cat('Random Forest:',max(model_RF$results$Accuracy))
cat('Support Vector Machine:',max(model_SVM$results$Accuracy))
cat('Nearest Neighbors:',max(model_KNN$results$Accuracy))

#Look at other available methods
?train

Another idea would be to break your data into a training set and a test set, and then compare how each model performs on the test set. If you like, I can show you how to do that.

— Zach
джерело

8

I would use probability theory to start with, and then pick whichever algorithm best calculates what probability theory tells you to do. So you have training data $T$ , and some new precursors $X$ , and an object to classify $Y$ , as well as your prior information $I$ .

So you want to know about $Y$ . Then probability theory says, just calculate its probability, conditional on all the information you have available to you.

P (Y | T, X, I)

$P(Y|T,X,I)$

Now we can use any of the rules of probability theory to manipulate this into things that we do know how to calculate. So using Bayes theorem, you get:

P (Y | T, X, I) = \frac{P (Y | T, I) P (X | Y, T, I)}{P (X | T, I)}

$P(Y|T,X,I)=\frac{P(Y|T,I)P(X|Y,T,I)}{P(X|T,I)}$

Now $P(Y|T,I)$ is usually easy - unless you prior information can tell you something about $Y$ beyond the training data (e.g. correlations), then it is given by the rule of succession - or basically the observed fraction of times $Y$ was true in the training data set.

For the second term $P(X|Y,T,I)$ - this is your model, and where most of your work will go, and where different algorithms will do different things. $P(X|T,I)$ is a bit of a vicious beast to calculate, so we do the following trick to avoid having to do this: take the odds of $Y$ against $\overline{Y}$ (i.e. not $Y$ ). And we get:

O (Y | T, X, I) = \frac{P (Y | T, X, I)}{P (\bar{Y} | T, X, I)} = \frac{P (Y | T, I)}{P (\bar{Y} | T, I)} \frac{P (X | Y, T, I)}{P (X | \bar{Y}, T, I)}

$O(Y|T,X,I)=\frac{P(Y|T,X,I)}{P(\overline{Y}|T,X,I)}=\frac{P(Y|T,I)}{P(\overline{Y}|T,I)}\frac{P(X|Y,T,I)}{P(X|\overline{Y},T,I)}$

Now you basically need a decision rule - when the odds/probability is above a certain threshold, you will classify $Y$ as "true", otherwise you will classify it as "false". Now nobody can really help you with this - it is a decision which depends on the consequences of making right and wrong decisions. This is a subjective exercise, and only the proper context can answer this. Of course the "subjectivity" will only matter if there is high uncertainty (i.e. you have a "crap" model/data which can't distinguish the two very well).

The second quantity - the model $P(X|Y,T,I)$ is a "predictive" model. Suppose the prior information indicates a single model which depends on parameter $\theta_{Y}$ . Then the quantity is given by:

P (X | Y, T, I) = \int P (X, θ_{Y} | Y, T, I) d θ = \int P (X | θ_{Y}, Y, T, I) P (θ_{Y} | Y, T, I) d θ_{Y}

$P(X|Y,T,I)=\int P(X,\theta_{Y}|Y,T,I) d\theta = \int P(X|\theta_{Y},Y,T,I)P(\theta_{Y}|Y,T,I) d\theta_{Y}$

Now if your model is of the "iid" variety, then $P(X|\theta_{Y},Y,T,I)=P(X|\theta_{Y},Y,I)$ . But if you have a dependent model, such as an autoregressive one, then $T$ may still matter. And $P(\theta_{Y}|Y,T,I)$ is the posterior distribution for the parameters in the model - this is the part that the training data would determine. And this is probably where most of the work will go.

But what if the model is not known with certainty? well it just becomes another nuisance parameter to integrate out, just as was done for $\theta_{Y}$ . Call the ith model $M_i$ and its set of parameters $\theta^{(i)}_{Y}$ , and the equation becomes:

P (X | Y, T, I) = \sum_{i} P (M_{i} | Y, T, I) \int P (X | θ_{Y}^{(i)}, M_{i}, Y, T, I) P (θ_{Y}^{(i)} | M_{i}, Y, T, I) d θ_{Y}^{(i)}

$P(X|Y,T,I)= \sum_{i}P(M_{i}|Y,T,I)\int P(X|\theta_{Y}^{(i)},M_{i},Y,T,I)P(\theta_{Y}^{(i)}|M_{i},Y,T,I) d\theta_{Y}^{(i)}$ Where

P (M_{i} | Y, T, I) = P (M_{i} | Y, I) \int P (θ_{Y}^{(i)} | M_{i}, Y, I) P (T | θ_{Y}^{(i)}, M_{i}, Y, I) d θ_{Y}^{(i)}

$P(M_{i}|Y,T,I)=P(M_{i}|Y,I)\int P(\theta_{Y}^{(i)}|M_{i},Y,I)P(T|\theta_{Y}^{(i)},M_{i},Y,I) d\theta_{Y}^{(i)}$

(NOTE: $M_i$ is a proposition of the form "the ith model is the best in the set that is being considered". and no improper priors allowed if you are integrating over models - the infinities do not cancel out in this case, and you will be left with non-sense)

Now, up to this point, all results are exact and optimal (this is the option 2 - apply some awesome algorithm to the data). But this a daunting task to undertake. In the real world, the mathematics required may be not feasible to do in practice - so you will have to compromise. you should always "have a go" at doing the exact equations, for any maths that you can simplify will save you time at the PC. However, this first step is important, because this sets "the target", and it makes it clear what is to be done. Otherwise you are left (as you seem to be) with a whole host of potential options with nothing to choose between them.

Now at this stage, we are still in "symbolic logic" world, where nothing really makes sense. So you need to link these to your specific problem:

$P(M_{i}|Y,I)$ is the prior probability for the ith model - generally will be equal for all i.
$P(\theta_{Y}^{(i)}|M_{i},Y,I)$ is the prior for the parameters in the ith model (must be proper!)
$P(T|\theta_{Y}^{(i)},M_{i},Y,I)$ is the likelihood function for the training data, given the ith model
$P(\theta_{Y}^{(i)}|T,M_{i},Y,I)$ is the posterior for the parameters in the ith model, conditional on the training data.
$P(M_{i}|Y,T,I)$ is the posterior for the ith model conditional on the training data

There will be another set of equations for $\overline{Y}$

Note that the equations will simplify enormously if a) one model is a clear winner, so that $P(M_{j}|Y,T,I)\approx 1$ and b) within this model, its parameters are very accurate, so the integrand resembles a delta function (and integration is very close to substitution or plug-in estimates). If both these conditions are met you have:

P (X | Y, T, I) \approx P (X | θ_{Y}^{(j)}, M_{j}, Y, T, I)_{θ_{Y}^{(j)} = {\hat{θ}}_{Y}^{(j)}}

$P(X|Y,T,I)\approx P(X|\theta_{Y}^{(j)},M_{j},Y,T,I)_{\theta_{Y}^{(j)}=\hat{\theta}_{Y}^{(j)}}$

Which is the "standard" approach to this kind of problem.

— probabilityislogic
джерело