Reinforcement Learning

RL's relation to supervised and unsupervised learning. Source: Jones 2017.
RL's relation to supervised and unsupervised learning. Source: Jones 2017.

Reinforcement Learning (RL) is a subset of Machine Learning (ML). Whereas supervised ML learns from labelled data and unsupervised ML finds hidden patterns in data, RL learns by interacting with a dynamic environment.

Humans learn from experience. A parent may reward her child for getting good grades, or punish for bad grades. By interacting with peers, parents and teachers, the child learns what habits lead to good grades and what lead to bad grades. It learns to follow good habits to obtain good grades and higher rewards. In RL, this sort of feedback is called reward or reinforcement. The essence of RL is to learn how to act or behave in order to maximize rewards.

A suitable definition is that,

Reinforcement learning is the problem faced by an agent that must learn behaviour through trial-and-error interactions with a dynamic environment.


  • Could you explain reinforcement learning with an example?
    The game of Pong. Source: Karpathy 2016.
    The game of Pong. Source: Karpathy 2016.

    Thomas Edison, the inventor of the light bulb, reportedly performed thousands of experiments before arriving at a carbon filament taken from a shred of bamboo. Edison said, "I have not failed 10,000 times. I have succeeded in proving that those 10,000 ways will not work." This anecdote is relevant to reinforcement learning. RL welcomes mistakes. RL is about learning what works and what doesn't through many trial-and-error experiments.

    In the game of Pong, a ball bounces back and forth between two paddles. Our RL-trained agent software controls one paddle. The rewards in this simple game are clear: +1 if the ball beats the opponent, -1 if we missed the ball, 0 otherwise. Our agent can see current state of the game in terms of pixels. Based on this input and what the agent has learned so far, it decides if it has move its paddle up or down. The agent may lose the game many times but via negative rewards it will learn to avoid those paddle actions. Likewise, it will learn what paddle actions lead to high rewards.

  • Why do we need reinforcement learning?

    Reinforcement learning differs from other ML approaches. Unlike supervised learning, there's no supervision in RL, only a reward signal. Feedback is delayed, not instantaneous. In a chess game, long-term reward is apparent only after a series of moves.

    In supervised or unsupervised learning, a static dataset is usually the input. Reinforcement learning happens within a dynamic environment. Data is not independent and identically distributed. The agent can take many different exploratory paths through the environment. Immediate action affects the subsequent data the agent receives.

    In many complex problems, the only way to learn is by interacting with the environment. In complex board games, it's hard for humans to provide evaluations of a large number of positions. It's easier to learn the evaluation function via rewards.

  • Which are some practical applications of reinforcement learning?

    RL has wide application in many fields: robotics and industrial automation, data science and machine learning, personalized education and training, healthcare, natural language processing, media and advertising, trading and finance, autonomous driving, gaming and many more.

    RL can optimize operations in many industries. Google used it to reduce energy consumption in its data centers. Process planning, demand forecasting, warehouse operations (such as picking and placing), fleet logistics, inventory management, delivery management, fault detection, camera tuning, and computer networking are some applications that RL can optimize.

    We note a few specific areas:

    • Machine Learning: For a given problem, identify suitable neural network architectures. Google's AutoML based on RL does exactly this. RL models could assist software developers to write computer programs.
    • NLP: Generate summarizes from long text. Assist chatbots to learn from user interactions and respond to queries in a more natural manner. Useful for question answering and machine translation.
    • Media & Advertising: Give personalized content recommendations. Use in cross-channel marketing optimization and real-time bidding systems. Optimize video streaming quality. Deliver more meaningful notifications to users.
    • Healthcare: Dynamic treatment regimes, automated medical diagnosis, process control, drug discovery, health management.
  • How is reinforcement learning related to other fields?
    RL shares similarities with other fields. Source: Sheng 2018.
    RL shares similarities with other fields. Source: Sheng 2018.

    RL has interesting parallels with other fields:

    • Neuroscience: The brain learns via reinforcement. Synaptic associations are strengthened or weakened based on conditioned and unconditioned stimuli. In humans, dopamine serves as the reward.
    • Psychology: New behaviours are learned via associations. Reward and punishment are used to modify behaviour. Operant conditioning closely relates to RL.
    • Economics: Behaviour economics is concerned with decision making. While rational economic agents maximize utility, humans are not fully rational. We often optimize for satisfaction. The similarity with RL is that agents often have incomplete information of the environment.
    • Mathematics: Operations Research (OR) is a field that's similar to RL. OR employs analytical methods to aid decision making. System simulations leading to reward maximization or cost minimization are OR concerns.
    • Engineering: Control problems involve a cost function of state and variables. Differential equations model the path of control variables. Equivalent RL concepts are policy (controller), actions (actuator commands), observations (state feedback) and reward/observations (reference signal).
  • Which are some essential terms in RL?

    It's easier to understand RL, once we get familiar with these essential terms:

    • Agent: An entity that's being trained to perform a specific task by interacting with the environment. It's sometimes called controller.
    • Environment: This is everything except the agent. This includes system dynamics.
    • State: Parameter values that define the current configuration of the environment.
    • Action: An agent outputs an action to the environment and thereby changes the environment's state. Action itself is determined by observed states and policy.
    • Reward: The numerical result of taking an action in a given state. Return or utility is the sum of current and future rewards when following a sequence of states. Utility is also called long-term reward.
    • Policy: A function that takes state observations as inputs and outputs actions. It has a logical structure and tunable parameters that the agent learns. The agent's goal is to learn the optimal policy to maximize reward or utility. Equivalently, it's a probabilistic mapping from states to actions.
  • How does learning happen in RL?
    Learning happens via interactions between the agent and its environment. Source: Arulkumaran et al. 2017, fig. 2.
    Learning happens via interactions between the agent and its environment. Source: Arulkumaran et al. 2017, fig. 2.

    The agent starts by observing the environment's state. Based on the current policy, the agent acts. This action changes the environment's state. In return, the agent gets a reward. From the reward, the agent determines if the action was beneficial. A positive reward reinforces the action or agent's behaviour. A negative reward informs the agent to avoid such an action, at least in a particular state. This feedback cycle of observe-act-reward-update is repeated many times. The agent learns with each iteration.

    A perfect policy is learned when the agent knows exactly what action to take in every possible state. In practice, this is rare. The environment is not static. The agent might encounter new states never seen before during the learning phase. In some cases, it's not possible to accurately observe the environment's state. The technical term for this is Partially Observable Markov Decision Process (POMDP).

    Therefore, in practice, we need an RL algorithm. It's role is to update the policy based on states, actions and rewards. In this way, it responds to changing environments.

  • What are the main approaches to reinforcement learning?
    Consult the model before interacting with environment. Source: Adapted from MathWorks 2020, slide 19.
    Consult the model before interacting with environment. Source: Adapted from MathWorks 2020, slide 19.

    If RL has knowledge of the environment and models it, then we called it model-based RL. A model guides the agent to learn faster by ignoring low-reward states. If learning is done only via interactions with the environment without any knowledge of how the environment works, then we call it model-free RL.

    In passive RL, the policy is fixed and agent learns the utilities of states, possibly involving learning a model. In active RL, the agent must learn what actions to take. An active agent explores the state space to obtain a better model and higher rewards.

    During state space exploration, the current policy may not be followed. We call this off-policy learning. Otherwise, the algorithm is on-policy learning. A stochastic policy allows exploration.

    Instead of interacting with the environment, it's possible to learn from large datasets of past interactions. This data-driven approach is called offline or batch RL.

    Where deep neural networks are employed, the term Deep RL is common.

  • What are some specialized approaches in reinforcement learning?

    Typically a single agent learns. Multiple agents coordinating and learning to maximize a common utility function is called distributed RL. Multiple agents who typically don't coordinate their actions and having separate utility functions are part of multiagent RL. In fact, one agent may work against another agent (such as in Markov games) thus making the environment non-stationary.

    Hierarchical RL considers hierarchies of policies. Top-level policies might focus on high-level goals while others offer finer control.

    Where it's not easy to define the reward function, the agent can learn from examples without explicit rewards. This is called apprenticeship or imitation learning. Learning the reward function from examples is called inverse RL.

    Learning typically happens with atomic states. Instead, if we use structured representations, the approach is called relational RL.

  • In RL, should an agent learn via simulations or within a real environment?
    Real-world observations train a model that can be used to optimize a policy. Source: Kaiser and Erhan 2019.
    Real-world observations train a model that can be used to optimize a policy. Source: Kaiser and Erhan 2019.

    A real environment is better if it's difficult to model or it's constantly changing. Learning requires lots of samples and to do this in a real environment is time consuming. Simulations are faster than real time. Multiple simulations can be executed in parallel. A simulated environment is also safe. Simulations are useful to test rare events such as car crashes. But the simulated environment may not accurately model many aspects of the real environment.

    Balancing an inverted pendulum is a task that can be learned in a real environment because it's safe. Training a walking robot that has had no prior training may not be safe or effective in a real environment. It's better to train it in a simulated environment and then use a real environment for scenarios that simulations didn't cover.

    In practice, it's common for an agent to gather real-world observations. These are used to update the current model. Agent then learns within a simulated environment based on the updated model.

  • What are some challenges with or shortcomings of RL?

    RL has progressed in areas such as gaming and robotics where lots of simulated data is available. Translating these advances to practical applications is not trivial. RL demands much more training data than supervised ML.

    Learning from scratch with no priori knowledge, called pure RL, has been criticized because learning is too slow. RL shares with AI some common problems: algorithms are not predictable or explainable, can be trained for only a narrowly-defined task, don't generalize well unless trained on massive amounts of data.

    Basic assumptions may not hold good in real environments. The environment may not be fully observable. Even observed states can be inaccurate. Sometimes it not obvious or easy to figure out a suitable reward function, especially when there are multiple objectives. While the agent learns by mistakes, sometimes there's limited freedom to explore. For complex problems, it's not clear how to trade-off simulation complexity, training time and real-time performance constraints.

    Many approaches use discrete actions and states. In the real world, agents have to interact in a continuous space. Policy optimization becomes a lot harder. RL algorithms can get stuck in a local optima.



In an English translation of Pavlov's work on conditioned reflexes, the term reinforcement is used for the first time in the context of animal learning. Pavlovian conditioning comes from the work of Ivan Pavlov in the 1890s. Pavlov discovered how dogs would salivate upon seeing food but also when given associated stimuli even without food.


Alan Turing describes in a report the design of a pleasure-pain system. He notes that the computer makes and records a random choice when faced with incomplete data. Subsequently, "When a pain stimulus occurs all tentative entries are cancelled, and when a pleasure stimulus occurs they are all made permanent."


This decade sees the development of optimal control whose aim is to design a controller that minimizes some measure of a dynamic system's behaviour over time. Richard Bellman develops the relevant theory including the Bellman Equation, dynamic programming and Markov Decision Process (MDP). In 1960, Ronald Howard devises the policy iteration method for MDPs.


Arthur Samuels, as part of his program to play checkers, implements a learning method based on temporal-differences. This relies on differences between successive estimates of the same quantity. It's inspired by animal learning psychology and the idea of secondary reinforcers. In 1972, Klopf brings together trial-and-error learning with temporal-difference learning.


Minsky publishes a paper titled Steps Toward Artificial Intelligence that raises many issues relevant to reinforcement learning. He writes about the credit assignment problem, that is, how to credit success when a sequence of decisions have led to the final result. Through the 1960s, the terms "reinforcement" and "reinforcement learning" are increasingly used in literature. At the same time, some researchers working on pattern recognition and perceptual learning (these really belong to supervised ML) confuse these with RL.

The cart-pole problem has four state variables: cart's position and velocity, pole's angle and its differential. Source: Michie and Chambers 1968, fig. 6.
The cart-pole problem has four state variables: cart's position and velocity, pole's angle and its differential. Source: Michie and Chambers 1968, fig. 6.

Improving on their work in early 1960s, Michie and Chambers train an RL algorithm to play tic-tac-toe and another to balance a pole on a movable cart. The pole-balancing task was learned with incomplete knowledge of the environment. This work influences later research in the field. In 1974, Michie notes that trial-and-error learning is an essential aspect of AI.


Chris Watkins integrates the separate threads of dynamic programming and online learning. He formalizes reinforcement learning with MDP, subsequently adopted by other researchers. He proposes Q-Learning, a model-free method. The work of Watkins was preceded by Paul Werbos in 1977, who saw how dynamic programming could be related to learning methods.

Architecture of Dyna. Source: Sutton 1990, fig. 1.
Architecture of Dyna. Source: Sutton 1990, fig. 1.

Sutton proposes Dyna, a class of architectures that integrate reinforcement learning and execution-time planning. The system can alternate between real world and a learned model of the world. Using simulated experiences (planning steps) in the world model, the optimal path is discovered faster. He applies Dyna to both policy iteration and Q-Learning.


Gerry Tesauro develops TD-Gammon that can compete with human experts in the game of backgammon. TD-Gammon learned from self-play alone without human intervention. It's only rewards came at the end of each game. The evaluation function is a fully connected neural network with one hidden layer of 40 nodes. Previously, Tesauro attempted to train a neural network in a supervised manner with experts assigning relative values to moves. This approach was tedious and the program failed against human players.


Developed by DeepMind Technologies, AlphaGo beats Lee Sedol, a human champion in the game of Go. AlphaGo was trained using RL from games involving human and computer play. The architecture is model-based learning using Monte Carlo Tree Search (MCTS) and model-free learning using neural networks. In 2017, AlphaZero Go is released and it beats AlphaGo by 100-0. Also in 2017, AlphaZero is released as a generalization of AlphaZero Go. AlphaZero can play chess, shogi and Go.


  1. Arulkumaran, Kai, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. "A Brief Survey of Deep Reinforcement Learning." arXiv, v2, September 28. Accessed 2021-02-28.
  2. Ashraf, Muhammad. 2018. "Reinforcement Learning Demystified: A Gentle Introduction." Towards Data Science, on Medium, April 8. Accessed 2021-03-01.
  3. Das, Aneek. 2017. "The very basics of Reinforcement Learning." Becoming Human, on Medium, March 26. Accessed 2021-03-01.
  4. Furr, Nathan. 2011. "How Failure Taught Edison to Repeatedly Innovate." Forbes, June 9. Accessed 2021-03-01.
  5. Gadaleta, Francesco. 2019. "Top 4 reasons why reinforcement learning sucks (Ep. 83)." Podcast, Data Science at Home, Amethix Technologies, October 21. Accessed 2021-03-01.
  6. Google Developers. 2020. "Machine Learning Glossary: Reinforcement Learning." Google Developers, February 11. Accessed 2021-02-28.
  7. Hu, Junling. 2016. "Reinforcement learning explained: Learning to act based on long-term payoffs." O'Reilly, December 8. Accessed 2021-02-28.
  8. Hui, Jonathan. 2018. "RL — Reinforcement Learning Terms." Medium, September 18. Accessed 2021-02-28.
  9. Jones, M. Tim. 2017. "Train a software agent to behave rationally with reinforcement learning." Article, Artificial Intelligence, IBM, October 11. Accessed 2021-03-02.
  10. Kaelbling, L. P., M. L. Littman, and A. W. Moore. 1996. "Reinforcement Learning: A Survey." JAIR, vol. 4, pp. 237-285, May 1. Accessed 2021-02-28.
  11. Kaiser, Łukasz and Dumitru Erhan. 2019. "Simulated Policy Learning in Video Models." Google AI Blog, March 25. Accessed 2021-02-28.
  12. Karpathy, Andrej. 2016. "Deep Reinforcement Learning: Pong from Pixels." Blog, May 31. Accessed 2021-02-28.
  13. Kumar, Aviral and Avi Singh. 2020. "Offline Reinforcement Learning: How Conservative Algorithms Can Enable New Applications." Blog, Berkeley Artificial Intelligence Research, December 7. Accessed 2021-02-28.
  14. Kurenkov, Andrey. 2018. "Reinforcement learning’s foundational flaw." The Gradient, July 8. Accessed 2021-03-01.
  15. Li, Yuxi. 2018. "Deep Reinforcement Learning." arXiv, v1, October 15. Accessed 2021-02-28.
  16. Liu, Naijun, Yinghao Cai, Tao Lu, Rui Wang and Shuo Wang. 2020. 2020. "Real–Sim–Real Transfer for Real-World Robot Control Policy Learning with Deep Reinforcement Learning." Applied Sciences, MDPI, vol. 10, no. 5, February 25. Accessed 2021-02-28.
  17. Lorica, Ben. 2017. "Practical applications of reinforcement learning in industry." O'Reilly, December 14. Accessed 2021-02-28.
  18. Maruti Techlabs. 2017. "Reinforcement Learning and Its Practical Applications." Chatbots Magazine, on Medium, April 26. Accessed 2021-02-28.
  19. MathWorks. 2020. "Reinforcement Learning with MATLAB: Understanding the Basics and Setting Up the Environment." MathWorks, January. Accessed 2021-02-28.
  20. Mcleod, Saul. 2007. "Pavlov's Dogs." Simply Psychology, February 5. Updated 2018-10-08. Accessed 2021-03-01.
  21. Michie, D. and R. A. Chambers. 1968. "BOXES: An experiment in adaptive control." In: Machine Intelligence 2, E. Dale and D. Michie, Eds. Edinburgh: Oliver and Boyd, pp. 137–152. Accessed 2021-03-02.
  22. Millikin, Mike. 2020. "MIT system trains driverless cars using reinforcement learning with data-driven simulation." Green Car Congress, BioAge Group, LLC, March 26. Accessed 2021-02-28.
  23. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. "Playing Atari with Deep Reinforcement Learning." arXiv, v1, December 19. Accessed 2021-02-28.
  24. Mwiti, Derrick. 2021. "10 Real-Life Applications of Reinforcement Learning." Blog, Neptune, February 25. Accessed 2021-02-28.
  25. Russell, Stuart and Peter Norvig. 2016. "Artificial Intelligence: A Modern Approach." Third Edition, Pearson. Accessed 2021-02-28.
  26. Rutgers. 2012. "The Edisonian - Volume 9 Fall 2012." Newsletter, vol. 9, Thomas A. Edison Papers, School of Arts and Science, Rutgers. Accessed 2021-03-01.
  27. Sheng, Sasha. 2018. "Reinforcement Learning Isomorphisms - Part 1." Blog, January 8. Accessed 2021-02-28.
  28. State, Gavriel. 2020. "How GPUs Can Democratize Deep Reinforcement Learning for Robotics Development." Blog, NVIDIA, December 10. Accessed 2021-02-28.
  29. Sutton, R. S. 1990. "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming." In Proceedings of the Seventh International Conference on Machine Learning, pp. 216-224, Morgan Kaufmann. Accessed 2021-03-01.
  30. Sutton, Richard S. and Andrew G. Barto. 2015. "Reinforcement Learning: An Introduction." Second edition (in progress), The MIT Press, April. Accessed 2021-02-28.
  31. Wikipedia. 2021. "AlphaGo." Wikipedia, January 28. Accessed 2021-03-01.
  32. Xie, Jason. 2017. "When reinforcement learning should not be used?" KDNuggets, December. Accessed 2021-03-01.
  33. Yasser, Shehab. 2020. "A Brief History Of Reinforcement Learning In Game Play." The Startup, on Medium, May 13. Accessed 2021-03-01.
  34. 2021. "Reinforcement Learning Glossary." Accessed 2021-02-28.

Further Reading

  1. Hu, Junling. 2016. "Reinforcement learning explained: Learning to act based on long-term payoffs." O'Reilly, December 8. Accessed 2021-02-28.
  2. Sutton, Richard S. and Andrew G. Barto. 2015. "Reinforcement Learning: An Introduction." Second edition (in progress), The MIT Press, April. Accessed 2021-02-28.
  3. Gosavi, Abhijit. 2019. "A Tutorial for Reinforcement Learning." Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, September 30. Accessed 2021-02-28.
  4. Dulac-Arnold, Gabriel, Daniel Mankowitz, and Todd Hester. 2019. "Challenges of Real-World Reinforcement Learning." arXiv, v1, April 29. Accessed 2021-03-01.
  5. Kaelbling, L. P., M. L. Littman, and A. W. Moore. 1996. "Reinforcement Learning: A Survey." JAIR, vol. 4, pp. 237-285, May 1. Accessed 2021-02-28.
  6. Kung-Hsiang, Huang. 2018. "Introduction to Various Reinforcement Learning Algorithms. Part I (Q-Learning, SARSA, DQN, DDPG)." Towards Data Science, on Medium, January 12. Accessed 2021-02-28.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2022. "Reinforcement Learning." Version 16, February 15. Accessed 2024-06-25.
Contributed by
4 authors

Last updated on
2022-02-15 11:51:29
  • Q-Learning
  • Python Libraries for Reinforcement Learning
  • Markov Decision Process
  • Deep Q-Network
  • Error-Driven Learning
  • Dynamic Programming