Чи можете ви переобладнати, навчаючи алгоритми машинного навчання за допомогою CV / Bootstrap?

34

Це питання може бути занадто відкритим, щоб отримати остаточну відповідь, але, сподіваємось, ні.

Алгоритми машинного навчання, такі як SVM, GBM, Random Forest тощо, як правило, мають деякі вільні параметри, які, крім деякого правила керівництва, повинні бути налаштовані на кожен набір даних. Як правило, це робиться за допомогою певної техніки повторного відбору проб (завантажувальний, CV та ін.), Щоб відповідати набору параметрів, які дають найкращу помилку узагальнення.

Моє запитання: чи можна зайти сюди занадто далеко? Люди говорять про те, щоб провести пошук сітки так далі, але чому б просто не трактувати це як проблему оптимізації та підкреслити найкращий набір параметрів? Я запитав про якусь механіку цього в цього питання цьому питанні, але він не приділяв великої уваги. Можливо, питання було задано погано, але, можливо, саме питання є поганим підходом, якого люди взагалі не роблять?

Що мене турбує - це відсутність регуляризації. Я можу повторити вибірку, що найкраща кількість дерев для вирощування в ГБМ для цього набору даних - 647 з глибиною взаємодії 4, але наскільки я впевнений, що це буде правдою для нових даних (якщо припустити нову популяцію ідентичний навчальному набору)? Не маючи розумного значення «скорочуватися» до (або, якщо ви не бажаєте, ніякої інформативної попередньої інформації), повторний відбір проб здається найкращим, що ми можемо зробити. Я просто не чую про це жодної розмови, тому змушує задуматися, чи є щось, чого я пропускаю.

Очевидно, що великі обчислювальні витрати пов'язані з тим, щоб зробити багато ітерацій, щоб витіснути з моделі кожен останній шматочок передбачуваної потужності, тому очевидно, що це ви б зробили, якщо у вас є час / рохкання на оптимізацію і кожен шматочок підвищення ефективності роботи є цінним.

— Богдановіст
джерело

Резюме можна використовувати для різних речей. Щоб було зрозуміло, коли ви говорите "пошук мережі" або "настройка гіперпараметрів", ви говорите про вибір моделі , а не про вибір функції або навіть просто оцінку помилки класифікації.

— smci

30

На це запитання є остаточна відповідь: "так, безумовно, можна переоцінити критерій вибору моделі на основі перехресної валідації та закінчити модель, яка погано узагальнює! ". На мою думку, це, мабуть, не оцінено широко, але є суттєвим недоліком у застосуванні методів машинного навчання, і є основним напрямком мого сучасного дослідження; На сьогодні я написав два документи на цю тему

GC Cawley та NLC Talbot, Надмірна відповідність вибору моделі та подальша упередженість вибору в оцінці продуктивності, Journal of Machine Learning Research, 2010. Research, vol. 11, с. 2079-2107, липень 2010 р. ( Www )

що демонструє, що перевиконання у виборі моделі є суттєвою проблемою в машинному навчанні (і ви можете отримати сильно упереджені оцінки ефективності, якщо скоротити кути при виборі моделі під час оцінки продуктивності) та

GC Cawley та NLC Talbot, Запобігання надмірному підбору моделей за допомогою байєсівської регуляризації гіперпараметрів, Journal of Machine Learning Research, том 8, сторінки 841-861, квітень 2007 р. ( Www )

де критерій вибору моделі на основі перехресної валідації регулюється, щоб спробувати покращити перебір у виборі моделі (що є ключовою проблемою, якщо ви використовуєте ядро з багатьма гіпер-параметрами).

На даний момент я пишу статтю про вибір моделі на основі сітки, яка показує, що, безумовно, можна використовувати сітку, яка занадто тонка, коли ви в кінцевому підсумку з моделлю, яка статистично поступається моделі, вибраній значною мірою більш груба сітка (саме питання про StackExchange надихнуло мене на пошук сітки).

Сподіваюся, це допомагає.

PS Незаангажована оцінка продуктивності та надійний вибір моделі дійсно можуть бути обчислювально дорогими, але, на мій досвід, це цілком варто. Вкладена перехресна валідація, де зовнішня перехресна перевірка використовується для оцінки продуктивності, а внутрішня перехресна перевірка для вибору моделі - хороший базовий підхід.

— Dikran Marsupial
джерело

Perfect! Looks like those papers are exactly what I was after. Thanks for that.

— Bogdanovist

Do let me know if you have any questions about the papers (via email - I am the first author and my email address is on the paper).

— Dikran Marsupial

@DikranMarsupial How do you distinguish overfitting due to model selection and that due to sampling mismatch between train and test sets ?

— image_doctor

1

In principle, using a synthetic dataset where ground truth is available, then it is straight-forward, as there is then no sampling mismatch; the training set is just a random sample from the underlying distribution and you can estimate the error from the distribution itself, rather than a finite sample. For real-word datasets, however AFAICS the best you can manage is to use resampling and determine the effects of over-fitting the model selection criterion over many random test/training splits.

— Dikran Marsupial

2

Sadly it was rejected, but I will revise it to take into account the reviewers (very useful) comments and resubmit it to another journal.

— Dikran Marsupial

7

Cross validation and bootstrap have been shown to give estimates of error rate that are nearly unbiased and in some cases more accurately by the bootstrap over cross-validation. The problem with other methods like resubstitution is that by estimating error on the same data set that you fit the classifier with you can grossly underestimate the error rate and may be led to algorithms that include too many parameters and will not predict future values as accurately as an algorithm fit to a small set of parameters. The key to the use of statistical methods is that the data you have totrain the classifier is typical of the data you will see in the future where the classes are missing and must be predicted by the classifier. If you think that the future data could be very different then statistical methods can't help and I don't know what could.

— Michael R. Chernick
джерело

Thanks for the answer. I've edited the question to make it clear that I'm not asking about changes in the population between train and test sets. I realise that is a whole different question that I am not interested in for this question.

— Bogdanovist

1

+1 In this case unbiaseness is essentially irrelevant. The variance of the cross-validation estimate can be much more of a problem. For a model selection criterion you need the minimum of the criterion to be reliably close to the minimum of the generalisation error (as a function of the hyper-parameters). It is of no use if on average it is in the right place, but the spread of the minima fror different finite samples of data is all over the place.

— Dikran Marsupial

1

Of course accuracy is a combination of bias and variance and an unbiased estimate with a large variance is not as good as a slightly biased estimator with a small variance. The niave estimate of error rate is resubstitution and it has a large bias. The bootstrap 632 and 632+ work so well because they do a good job adjusting for the bias without much increase in variance. That is why for linear discriminant functions and quadratic discriminant functions they work much better than the leave-one-out version of cross-validation.

— Michael R. Chernick

With classification tree ensembles the bootstrap has not been demonstrated to do better.

— Michael R. Chernick

1

Perhaps one of the difficulties is that over-fitting often means different things in machine learning and statistics. It seems to me that statisticians sometimes use over-fitting to mean a model with more parameters than necessary is being used, rather than it being fit too closesly to the observations (as measured by the training criterion). I would normally use "over-parameterised" in that situation, and use "over-fit" to mean a model has been fitted too closely to the observations at the expense of generalisation performance. Perhaps this is where we may be talking at cross-purposes?

— Dikran Marsupial

4

I suspect one answer here is that, in the context of optimisation, what you are trying to find is a global minimum on a noisy cost function. So you have all the challenges of a multi-dimensional global optimistation plus a stochastic component added to the cost function.

Many of the approaches to deal with challenges of local minima and an expensive search space themselves have parameters which may need tuning, such as simulated annealing or monte carlo methods.

In an ideal, computationally unbounded universe, I suspect you could attempt to find a global minimum of your parameter space with suitably tight limits on the bias and variance of your estimate of the error function. Is this scenario regularisation wouldn't be an issue as you could re-sample ad infinitum.

In the real world I suspect you may easily find yourself in a local minimum.

As you mention, it is a separate issue, but this still leaves you open to overfitting due to sampling issues associated with the data available to you and it's relationship to the real underlying distribution of the sample space.

— image_doctor
джерело

4

It strongly depends on the algorithm, but you certainly can -- though in most cases it will be just a benign waste of effort.

The core of this problem is that this is not a strict optimization -- you don't have any $f(\mathbf{x})$ defined on some domain which simply has an extremum for at least one value of $\mathbf{x}$ , say $\mathbf{x}_{\text{opt}}$ , and all you have to do is to find it. Instead, you have $f(\mathbf{x})+\epsilon$ , where $\epsilon$ has some crazy distribution, is often stochastic and depends not only on $\mathbf{x}$ , but also your training data and CV/bootstrap details. This way, the only reasonable thing you can search for is some subspace of $f$ s domain, say $X_\text{opt}\ni \textbf{x}_\text{opt}$ , on which all the values of $f+\epsilon$ are insignificantly different (statistically speaking, if you wish).

Now, while you can't find $\textbf{x}_\text{opt}$ , in practice any value from $X_\text{opt}$ will do -- and usually it is just a search grid point from $X_\text{opt}$ selected at random, to minimize computational load, to maximize some sub- $f$ performance measure, you name it.

The serious overfitting can happen if the $f$ landscape has a sharp extrema -- yet, this "shouldn't happen", i.e. it is a characteristic of very badly selected algorithm/data pair and a bad prognosis for the generalization power.

Thus, well, (based on a practices present in good journals) full, external validation of parameter selection is not something you rigorously have to do (unlike validating feature selection), but only if the optimization is cursory and the classifier is rather insensitive to the parameters.

4

Yes, the parameters can be „overfitted” onto training and test set during crossvalidation or bootstrapping. However, there are some methods to prevent this. First simple method is, you divide your dataset into 3 partitions, one for testing (~20%), one for testing optimized parameters (~20%) and one for fitting the classifier with set parameters. It is only possible if you have quite large dataset. In other cases double crossvalidation is suggested.

Romain François and Florent Langrognet, "Double Cross Validation for Model Based Classification", 2006

— spinus
джерело