Here we provide a brief review of current findings

Here we provide a brief review of current findings C646 in this domain, with a particular emphasis on neuroimaging and behavioral findings in humans. The goal of an RL agent is to determine a policy (a set of actions to be taken in different states of the world), so as to maximize expected future reward [1]. Some RL algorithms accomplish this by learning the expected reward that

follows from taking a given action (i.e. an action value), and then selecting a policy favoring more valuable actions. Interest in the application of RL to neuroscience emerged following the finding that the phasic activity of dopamine neurons resembles the implementation of a prediction error from a temporal difference algorithm, in which the difference between successive predictions of future reward plus the reward available at a given time is used to learn an updated representation of the value of a given action in a particular state 3 and 4]. Neuroimaging studies have also identified BOLD correlates of temporal difference prediction error (TDPE) signals in target areas of dopamine GDC-0980 neurons, including the ventral and dorsal striatum 5, 6 and 7] (Figure 1A), and in midbrain dopaminergic nuclei [8]. In addition to prediction errors, RL value signals have been found in the ventromedial prefrontal cortex (vmPFC) in human

neuroimaging studies, but also in intra-parietal and supplementary motor cortices 9, 10 and 11]. Collectively these findings provide support for the explanatory power of simple RL models in accounting for key aspects of the neural mechanisms underpinning learning from reward. It has been proposed that there are multiple systems for RL as opposed

to just a single system. One system is ‘Model-Free’ (MF) in that this algorithm does not learn a model of the structure of the world, but instead learns about the value of actions on the basis of past reinforcement using the TDPE signal reviewed earlier. By TCL contrast in ‘Model-Based’ (MB) RL, the agent encodes an internal model of the world, that is, the relationship between states, actions and subsequent states, and the outcomes experienced in those states, and then computes values on-line by searching prospectively through that internal model 12, 13 and 14]. Interest in the applicability of MB RL schemes emerged because MF RL algorithms alone cannot explain the behavioral distinction between goal-directed action selection, in which actions are chosen with respect to the current incentive value of an associated outcome, and habitual action selection in which an action is elicited by a prior antecedent stimulus, without linking to the current incentive value of an outcome 15 and 16].

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>