# High-order feature-based mixture models of classification learning predict individual learning curves and enable personalized teaching

See allHide authors and affiliations

Edited by Ranulfo Romo, Universidad Nacional Autonoma de Mexico, Mexico City, D.F., Mexico, and approved November 13, 2012 (received for review July 7, 2012)

## Abstract

Pattern classification learning tasks are commonly used to explore learning strategies in human subjects. The universal and individual traits of learning such tasks reflect our cognitive abilities and have been of interest both psychophysically and clinically. From a computational perspective, these tasks are hard, because the number of patterns and rules one could consider even in simple cases is exponentially large. Thus, when we learn to classify we must use simplifying assumptions and generalize. Studies of human behavior in probabilistic learning tasks have focused on rules in which pattern cues are independent, and also described individual behavior in terms of simple, single-cue, feature-based models. Here, we conducted psychophysical experiments in which people learned to classify binary sequences according to deterministic rules of different complexity, including high-order, multicue-dependent rules. We show that human performance on such tasks is very diverse, but that a class of reinforcement learning-like models that use a mixture of features captures individual learning behavior surprisingly well. These models reflect the important role of subjects’ priors, and their reliance on high-order features even when learning a low-order rule. Further, we show that these models predict future individual answers to a high degree of accuracy. We then use these models to build personally optimized teaching sessions and boost learning.

We regularly learn to classify sensory stimuli into novel categories, often using an impoverished sampling of the stimulus space and the underlying rule: even a “simple” task of classifying patterns of *n* bits into two categories, requires an implicit mapping of the 2^{n} possible patterns, which means there are potential deterministic classification rules. It is clear then that when we learn to classify, we cannot simply explore the space of rules and patterns, but instead must rely on simplifying assumptions.

Analysis of human learning of deterministic classification rules has focused on modeling of the average behavior of subjects (1⇓–3), and explored the effect of rule complexity on the average level of success (4). Learning to classify according to a probabilistic rule is inherently ambiguous, and so studies of such tasks have focused on simpler rules than those used in deterministic classification. For example, the weather prediction (WP) task (5) requires learning probabilistic associations between multiple cues and a label, where each cue carries independent information about the correct label. Analysis of the learning strategy of individual subjects in this task has compared single-cue or single feature-based strategies, and the possibility of switching between such strategies (6, 7). Associative or Bayesian learning models that rely on simple stimulus features were used to describe the diversity of individual learning dynamics that subjects exhibited and compare between subjects (8), and reflected differences between healthy subjects and patients (9, 10). However, these models were mostly evaluated in terms of their ability to describe subjects’ performance, rather than cross-validated predictive power. Second, the complexity of the rules studied was limited (i.e., cues carried independent information). Third, these tasks often relied on a strong bias in the presentation rate of different patterns, which changes the available information for subjects.

To characterize learning behavior of complex rules at the individual level, we used a psychophysical task of classifying binary visual patterns into two abstract classes, with no bias in the set of presented patterns. Our task involved only deterministic rules, but included rules of different complexity, focusing on high-order dependencies between elements in the pattern. We extended ideas of feature-based learning (6, 11⇓–13) to present a reinforcement learning-like family of maximum entropy-based models to describe how individuals learn different high-order classification rules. We found that these models capture individual behavior to a high degree of accuracy, and can also predict individual behavior. We then used such models, which were fitted to individuals during a learning session, to pick the optimal samples to present them to help them learn the rule faster.

## Results

To characterize individual classification learning in terms of accurate quantitative models, we presented 41 healthy subjects with a psychophysical task in which they had to classify patterns of black and white squares, into two abstract categories labeled “red” and “blue” (Fig. 1*A*). The correct label of each pattern was determined by a deterministic rule—a fact that was unknown to the subjects. In each session, patterns of size *n* = 4 or 5 squares were presented one at a time, and the correct label was shown after the subject classified the pattern. To enable quantitative analysis and comparison between subjects, all subjects were presented with the same order of patterns; moreover, each block of 16 examples (for *n* = 4, or 32 patterns in case *n* = 5) contained all possible patterns in a “frozen” random order. Given the huge number of deterministic rules (65,536 for *n* = 4, and 9 × 10^{9} for *n* = 5), we chose six different balanced rules (equal number of red- and blue-labeled patterns), but of different complexity. Denoting the patterns as , where *x*_{i} = ±1, and the label as *y* = ±1, the label of each pattern was determined according to single, pairwise, or triple-wise dependencies between the squares in the pattern, or according to “holistic” features of the whole pattern (one-bit, two-bit, three-bit, majority, symmetry, and middle symmetry rules; Fig. 1). Each subject performed four sessions on the same day, with a different rule in each session.

The diversity of learning dynamics among subjects was wide, ranging from no learning, to incremental learning, to abrupt transition to full success. Fig. 1*B* shows the fraction of correct answers as a function of time (over a running-average window) for 10 different subjects learning the same 1-bit rule (*y* = *x*_{3}). Similar diversity of learning curves was seen for the other rules we tested. Fig. 1*C* shows the average learning curve over all subjects for the different rules, which reflect the average difficulty of each rule. The large SDs over subjects result from the individual differences between them; in particular, for each rule there was at least one subject that failed to learn it, and no subject succeeded in learning all rules. The performance for the majority rule demonstrates the strong effect of the prior that subjects had, as 2 of 16 subjects learned the rule without any mistakes, and another two made a single mistake (12). The marked differences between the population learning curves of the different rules reflect that memorization without generalization can be ruled out as the sole mechanism of learning. In *SI Text*, *Ruling out Simple Memorization as the Strategy That Subjects Use*, we show that both gradual memorization and pattern specific memorization can also be ruled out as pure strategies.

Intuitively, one might assume that subjects seek distinctive features in the patterns according to which they classify, as has been studied in similar tasks (1, 4, 6). We therefore quantified the relation between a set of different features of the patterns, , and subjects’ answers. Fig. 1 *D* and *E* show the mutual information between subjects' choices and pattern features, *I*(*f*_{i}; *answers*) as a function of time for different one-, two-, three-, and four-bit features (this set is a complete basis from which one could linearly construct any other feature). In a few cases, subjects indeed relied on single features (Fig. 1*D*), but in most cases, they were using a different strategy, which could be described as mixing of features, or a very fast switching between features (Fig. 1*E*).

We therefore used a mixture model of these features, which are a basis that can span any rule, to explore individual behavior and characterize the effect of subjects' priors on learning (Fig. 2*A*). Using Bayes’ rule we can represent any probabilistic classifier of pattern *x* into label *y*, in terms of “internal models” that a subject has for each category *c* (the probability that pattern *x* belongs to category *c*) as a weighted mixture of features of *x*,where are features of ; *α*_{i} are the weights given for each feature, normalized such that the norm of is 1, *β* quantifies the certainty or inverse “temperature”, and *Z*_{c} is a normalization factor, or partition function. This form of exponential functions is commonly used in machine learning, statistics, psychophysics, and neuroscience (14⇓–16). We note that for a specific choice of features, this is the maximum-entropy distribution of , given the average value of that set of features in each category, . The classifier *p(y|x)* is then given bywhere *γ* is given by the category prior, , , and *Z*_{1} is the partition function of label 1 in Eq. **1**. This model can describe any classifier and learning dynamics in terms of arbitrary changes to each of the *α*_{i}s at each time-step.

Importantly, to find a compact and useful representation of individual learning in terms of a subject’s prior and simple set of learning parameters, we used a very restricted version of learning the model described by Eqs. **1** and **2**. Instead of changing the weight of each feature independently, we consider models in which learning occurs by changing all *α*_{i}s in a coupled way, according to a single learning rate and a gradient-based rule: given a labeled example, the change in weights is given bywhere *η* is a single static learning rate, is the expected value of feature *f*_{i} over the model distribution given by Eq. **1**, and *γ* has a similar learning rule as in Eq. **3**. Thus, the model starts with a prior set of weights for each of the features *α*_{i} at *t* = 0, and then updates the weights according to a learning rule that can be interpreted as a product of a standard reinforcement learning term (13, 17) and a gradient-based term. Thus, the parameters of our models are just the prior at *t* = 0 and two static parameters: the certainty *β* and learning rate *η* (Fig. 2*B*). We note that this is a generalized form of the models used by Speekenbrink and Shanks (11) for the WP task, but here they include a specific prior term and high-order features.

These models can exhibit a wide range of learning dynamics, which we mapped by simulating the performance of models with different values of *η* and *β* on different rules, with a large set of diverse priors. Fig. 2*C* shows the success rate of the simulated models in learning the one-bit rule (Fig. 1*C*) as a function of the ratio *η:β*. We found that for low *η:β*, the prior dominates the model’s behavior. For large *η:β*, the prior has no effective role, because the learning rule is overfitting to the last seen sample, resulting in poor generalization. Optimal learning in this case would occur at *η:β* ≃ 0.1, which would balance the prior with appropriate learning rates. (We show in *SI Text, Dependence of Learning in the Mixture of Features Model on η:β*, that the ratio *η:β* dominates the learning dynamics of these models).

We thus fitted the subjects’ learning sessions with our mixture of features model, and found that the detailed structure of their individual learning curves can be approximated to a high degree of accuracy. Specifically, for each subject and session, we used a heuristic approach to find the best prior and combination of *β* and *η* values that would maximize the likelihood of the observed sequence of answers (*Methods*). Fig. 3*A* shows examples of the high similarity between the empirical learning curve of subjects and the expected learning curve of the model that was fit to them. Because the models are probabilistic, we estimated the variance of the model’s learning curve, and found that the models are almost as similar to the observed behavior as one could expect. Fig. 3*B* shows the overlap in answers between subjects and the models that were fit to them, for all rules. Averaging over all rules and subjects, the accuracy was 87%; the average overlap for the different rules ranged from 80% to 98%. To further validate the accuracy of the model, we asked, how typical is the observed sequence of answers according to the model we fitted? Fig. 3*C* shows the overlap between the model and the empirical answers, normalized by the SD of the model’s predictions. The Z-score values show that the deviation between models and the data were within the probabilistic variance expected from the model.

Though our model relies on a spanning set of features, it is possible that subjects do not use or have access to all these features. We asked then how removing features from the basis set of the model would affect the accuracy and predictive power of the model. We found that mixture models that used only single-bit or pairwise features were significantly less accurate than models that used higher-order features in fitting subjects’ behavior (Fig. 4), almost regardless of rule complexity. A more general learning rule, which evaluated the change in based on feature values over a window of several examples , rather than just the last one, did not significantly improve our results. Neither did other reinforcement learning models and neural network-based ones (*SI Text, Comparisons of the Mixture of Features Model to Other Models and Effect of the Length of the Learning Window on the Mixture of Features Model*).

We asked then whether our models can predict individual behavior. Fig. 3*D* shows examples of the agreement between the subjects’ learning curves on the second half of the session, and the most likely learning curves predicted by the model that was fit just to the first half of that session. On average, we predicted ∼75% of the future answers (Fig. 3*E*). Fig. 3*F* shows Z-score measures of the prediction accuracy, reflecting that though these models are missing some part of the subjects’ learning dynamics, they are very informative of individual subjects.

Having a predictive model of a subject’s future behavior suggests that we could use the model to estimate which patterns would be most useful for the subject in terms of learning the rule. We thus repeated the same experiment, but now presented to the subjects a personalized sequence of patterns in the fourth block of samples. To determine the individual set of patterns we should show each subject, we used the first three blocks (where all subjects had the exact same patterns shown to them) to fit a mixture model to each subject (online). If we now assume that the model is an accurate predictor of the subject’s behavior, we can simulate the effect of any sequence of patterns in the fourth block on learning. We then sought the optimal set of patterns to present the subject in that fourth block (“personalized training block”) that would result in the most accurate classifier at the end of that block, according to the model. In *SI Text, Pattern Specific Performance Before Personalized Training Block, and the Order of Patterns in That Block*, we show that the optimal sequences used in the fourth block did not take a trivial form and also contained patterns that were classified well in the first three blocks. To evaluate the effect of the individually chosen group of patterns, all subjects had the exact same patterns shown to them in blocks 5–8. We found that both the test group and our control group, which were presented with a random set of patterns in the fourth block, improved during the session. However, the test group showed a significant improvement in the two blocks following the intervention (unpaired *t* test, *P* < 0.04; Fig. 5). The control group closed this gap later in the session. In *SI Text, Optimality of the Personalized Training Sequence: Comparison with Alternative Models*, we show that choosing the sequences according to other models is expected to give worse learning improvement. Thus, we succeeded in boosting individual performance by personalized, model-based teaching. We note that this approach resembles ideas from optimal sampling and active learning in statistics and machine learning (18, 19, *), which suggest future analysis of optimal exploration and exploitation strategies in learning.

## Discussion

We have shown that in a pattern classification task, human performance can be described to a high degree of accuracy by a probabilistic model that dynamically adapts the weight subjects give to different features of the patterns. These models rely on a prior and two static parameters, and were accurate enough to enable personally tailored teaching, by picking for individual subjects the best patterns to show them to help them learn.

Our results reflect the important role of the prior that a subject has in such learning tasks (12), and that these priors already rely on high-order correlations in the patterns that the subjects classify. This finding suggests that subjects’ history and experience are instrumental in shaping their learning.

We note that it is possible that more-detailed models, in particular using adaptive learning rates (*η*) or subject’s certainty (*β*), could result in better fit for the subjects’ answers and predictive power of the model. However, we submit that the power of the current approach is in its relative simplicity.

The mathematical nature of our models suggests that learning may be seen as a dynamic weighting of “experts” ^{†} by combining, linearly, prototypes of the stimulus or exemplar representation of the stimulus space (2, 20, 21) (*SI Text, Mixture Model as a Simple Case of Prototype or Exemplar Representation*). Because the classification rule could rely on an arbitrary feature of the patterns—and, in particular, some of the rules we studied were nonlinear functions of simple features of the patterns (in contrast to refs. 5 and 22)—our model included in the mixture a set of features that allowed describing any rule, i.e., a spanning set of features. This features set could correspond to neural correlates of decision-making (23, 24), probabilistic inference (13, 25), and learning and strategy shifts (26, 27) that were observed in single-unit recordings in primate and mammalian cortex. Seeking neural correlates of the model presented here would be of particular interest in light of the characterization of the role of memory systems involved in WP (28, 29) and other learning and decision-making tasks (30, 31), and theoretical models of incremental learning through spike timing-dependent plasticity (32, 33).

## Methods

### Subjects.

A total of 78 healthy adults (34 male) between the ages of 22 and 42 performed four learning sessions, taking short breaks between sessions, for a total duration of ∼1 h. The experimental setup was explained to all subjects before the first session, and they went through one brief training session. The detailed nature of the complexity of classification rules was not discussed with the subjects. Subjects signed a formal participation agreement according to Helsinki protocol TLV-0287-09 approved by the Weizmann Institute's institutional review board. Subjects were rewarded for participation, regardless of their performance. *SI Text, Instructions to Subjects in the Pattern Classification Task*, contains the instructions that were given to the subjects.

### Experimental Setup.

Stimulus presentation, feedback, and recoding of responses was done using MatLab (MathWorks) and Psychtoolbox (34) on a standard desktop computer (Intel-based operating system: Windows XP, 4 GB RAM). Subjects indicated their choices using a standard computer mouse.

### Behavior analysis.

Subjects’ answers were binary and so their performance was measured via (*i*) learning curves, using a moving average window of 16 steps to smooth the stepwise performance, and (*ii*) block averages, averaging the performances in consecutive blocks of 16 steps.

The mutual information between the subject’s choice, *r*, and the value of a parity feature, , is estimated from the empirical estimation of the rates, *P*(*r*, *f*_{J}), as

### Modeling.

The model’s learning dynamics is given by a gradient ascent step (Eq. **3**) as described in the text, followed by renormalizing .

For a given set of model coefficients, Θ, including a prior, , at *t* = 0 (*γ*_{t} _{=} _{0} = 0), a certainty, *β*, and a learning rate, *η*, the learning algorithm produces a set of decision probabilities, , for the patterns sequence . We then use these for model-fitting and prediction of future answers.

### Model-fitting.

We treat the subject’s sequence of answers, , as a realization of the binomial independent random variables, , and fit each session of each subject with a model that maximizes the log-likelihood:

Because this function is not concave, we heuristically looked for the optimal point, Θ*, by maximizing the likelihood through a combination of a genetic algorithm and simulated annealing.

### Prediction of subjects’ future behavior based on their individually fitted model.

For prediction we fit the model based only on the first 64 answers in a session. We then estimated the agreements between the model and the subject as the fraction of subject answers that match the model’s most likely answers, . When estimated, the expected agreement if the subject’s answers were indeed a realization of the model’s decision probabilities, *μ* = 〈max(*P*(1|*x*_{t}; Θ), 1 − *P*(1|*x*_{t}; Θ))〉_{t∈T}, and its SD,

; we used these to estimate the Z-score, , as a measurement of difference between the subject and the model.

### Personalized Teaching.

Subjects had to learn a two-bit rule (with *n* = 4). All of the subjects were presented with the same patterns, except for the fourth block, in which they were shown a sequence of patterns that was optimized for them individually. To fit the model online, we prepared a set of 80 million candidate models, covering a large region of parameter space, for which we stored the decision probabilities for learning steps 1–48 and the most helpful sequence for steps 49–64. The most helpful sequences were found using a *Glauber* dynamics search (35) that aimed to bring the models to the closest point to the target rule. While subjects performed steps 1–48, their answers were transmitted, using transmission control protocol/internet protocol, to another computer that calculated the best fit among the 80 million models and transmitted back the patterns sequence for the fourth block. The entire transmission duration was always less than the waiting time between samples (1 s), and so did not affect the session progress.

## Acknowledgments

We thank Rony Paz and Peter Dayan for valuable suggestions and comments. This work was supported by the Peter and Patricia Gruber Foundation, the Israel Science Foundation, and the Clore Center for Biological Physics (E.S.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: elad.schneidman{at}weizmann.ac.il.

Author contributions: Y.C. and E.S. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

↵*Roy N, McCallum A (June 28–July 1, 2001) Toward optimal active learning through sampling estimation of error reduction.

*Proceedings of the 18th International Conference on Machine Learning*(Morgan Kaufmann, San Francisco), pp 441–448.↵

^{†}Roth S, Black MJ (June 20–25, 2005) Fields of experts: A framework for learning image priors.*Computer Vision and Pattern Recognition, 2005. IEEE Computer Society Conference on Computer Vision and Pattern Recognition*(IEEE, San Diego), Vol 2, pp 860–867.This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1211606110/-/DCSupplemental.

## References

- ↵
- Shepard RN,
- Hovland CI,
- Jenkins HM

- ↵
- ↵
- ↵
- ↵
- Knowlton BJ,
- Squire LR,
- Gluck MA

- ↵
- Gluck MA,
- Shohamy D,
- Myers C

- ↵
- ↵
- Shanks D,
- Speekenbrink M

- ↵
- ↵
- ↵
- Speekenbrink M,
- Shanks DR

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Rescorla R,
- Wagner A

- ↵
- Cohn DA,
- Ghahramani Z,
- Jordan MI

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Nassar MR,
- Wilson RC,
- Heasly B,
- Gold JI

- ↵
- ↵
- Foerde K,
- Knowlton BJ,
- Poldrack RA

- ↵
- ↵
- ↵
- Farries MA,
- Fairhall AL

- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Neuroscience

- Social Sciences
- Psychological and Cognitive Sciences