I would use probability theory to start with, and then pick whichever algorithm best calculates what probability theory tells you to do. So you have training data T, and some new precursors X, and an object to classify Y, as well as your prior information I.
So you want to know about Y. Then probability theory says, just calculate its probability, conditional on all the information you have available to you.
P(Y|T,X,I)
Now we can use any of the rules of probability theory to manipulate this into things that we do know how to calculate. So using Bayes theorem, you get:
P(Y|T,X,I)=P(Y|T,I)P(X|Y,T,I)P(X|T,I)
Now P(Y|T,I) is usually easy - unless you prior information can tell you something about Y beyond the training data (e.g. correlations), then it is given by the rule of succession - or basically the observed fraction of times Y was true in the training data set.
For the second term P(X|Y,T,I) - this is your model, and where most of your work will go, and where different algorithms will do different things. P(X|T,I) is a bit of a vicious beast to calculate, so we do the following trick to avoid having to do this: take the odds of Y against Y¯¯¯¯ (i.e. not Y). And we get:
O(Y|T,X,I)=P(Y|T,X,I)P(Y¯¯¯¯|T,X,I)=P(Y|T,I)P(Y¯¯¯¯|T,I)P(X|Y,T,I)P(X|Y¯¯¯¯,T,I)
Now you basically need a decision rule - when the odds/probability is above a certain threshold, you will classify Y as "true", otherwise you will classify it as "false". Now nobody can really help you with this - it is a decision which depends on the consequences of making right and wrong decisions. This is a subjective exercise, and only the proper context can answer this. Of course the "subjectivity" will only matter if there is high uncertainty (i.e. you have a "crap" model/data which can't distinguish the two very well).
The second quantity - the model P(X|Y,T,I) is a "predictive" model. Suppose the prior information indicates a single model which depends on parameter θY. Then the quantity is given by:
P(X|Y,T,I)=∫P(X,θY|Y,T,I)dθ=∫P(X|θY,Y,T,I)P(θY|Y,T,I)dθY
Now if your model is of the "iid" variety, then P(X|θY,Y,T,I)=P(X|θY,Y,I). But if you have a dependent model, such as an autoregressive one, then T may still matter. And P(θY|Y,T,I) is the posterior distribution for the parameters in the model - this is the part that the training data would determine. And this is probably where most of the work will go.
But what if the model is not known with certainty? well it just becomes another nuisance parameter to integrate out, just as was done for θY. Call the ith model Mi and its set of parameters θ(i)Y, and the equation becomes:
P(X|Y,T,I)=∑iP(Mi|Y,T,I)∫P(X|θ(i)Y,Mi,Y,T,I)P(θ(i)Y|Mi,Y,T,I)dθ(i)Y
Where
P(Mi|Y,T,I)=P(Mi|Y,I)∫P(θ(i)Y|Mi,Y,I)P(T|θ(i)Y,Mi,Y,I)dθ(i)Y
(NOTE: Mi is a proposition of the form "the ith model is the best in the set that is being considered". and no improper priors allowed if you are integrating over models - the infinities do not cancel out in this case, and you will be left with non-sense)
Now, up to this point, all results are exact and optimal (this is the option 2 - apply some awesome algorithm to the data). But this a daunting task to undertake. In the real world, the mathematics required may be not feasible to do in practice - so you will have to compromise. you should always "have a go" at doing the exact equations, for any maths that you can simplify will save you time at the PC. However, this first step is important, because this sets "the target", and it makes it clear what is to be done. Otherwise you are left (as you seem to be) with a whole host of potential options with nothing to choose between them.
Now at this stage, we are still in "symbolic logic" world, where nothing really makes sense. So you need to link these to your specific problem:
- P(Mi|Y,I) is the prior probability for the ith model - generally will be equal for all i.
- P(θ(i)Y|Mi,Y,I) is the prior for the parameters in the ith model (must be proper!)
- P(T|θ(i)Y,Mi,Y,I) is the likelihood function for the training data, given the ith model
- P(θ(i)Y|T,Mi,Y,I) is the posterior for the parameters in the ith model, conditional on the training data.
- P(Mi|Y,T,I) is the posterior for the ith model conditional on the training data
There will be another set of equations for Y¯¯¯¯
Note that the equations will simplify enormously if a) one model is a clear winner, so that P(Mj|Y,T,I)≈1 and b) within this model, its parameters are very accurate, so the integrand resembles a delta function (and integration is very close to substitution or plug-in estimates). If both these conditions are met you have:
P(X|Y,T,I)≈P(X|θ(j)Y,Mj,Y,T,I)θ(j)Y=θ^(j)Y
Which is the "standard" approach to this kind of problem.