Reinforcement Learning

Reinforcement learning is a type of Machine Learning algorithms which allows software agents and machines to automatically determine the ideal behavior within a specific context, to maximize its performance.

Reinforcement learning is a popular type of AI, is a form of supervised learning, but only given partial information.

Reinforcement learning is a kind of Machine Learning where in the system that is to be trained to do a particular job, learns on it’s own based on its previous experiences and outcomes while doing a similar kind of a job.

Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.


  • Why Reinforcement learning...?

    Reinforcement learning is a powerful machine learning technique for solving problems in dynamic and adaptive environments. Combined with a simulation or digital twin, reinforcement learning can train models to automate or optimize the efficiency of industrial systems and processes.

  • Concepts with Reinforcement learning
    Parallel concepts with RL. Source: SASHA SHENG Blog
    Parallel concepts with RL. Source: SASHA SHENG Blog

    Its important to highlight Reinforcement learning techniques' isomorphisms across other fields.

    • Neuroscience:Neuroscientists have been studying how the brain generates behaviors for decades. At the neural level, reinforcement allows for the strengthening of synaptic associations between pathways carrying conditioned and unconditioned stimulus information.
    • Psychology:Classical conditioning is learning new behaviors through a series of association. Operant conditioning a learning process through which strength of behavior is modified by reward and punishment. RL is more closely related to operant conditioning. Because that's literally how you train your agent.
    • Economics:Economic agents were portrayed as fully rational Bayesian maximizers of subjective utility. However, studies have shown that the agents (us humans) aren't fully rational agents. We frequently optimize for satisfaction rather than optimality.
    • Mathematics:Operations Research is a field that focuses on using analytical methods to learn how to make better business decisions. How do you efficiently and accurately simulate the system so that you could perform optimizations on top of it to minimize cost,maximize reward etc.
    • Engineering:Optimal Control is a research area where it is focused on finding a control law for a given system such that a certain optimality criterion is achieved.
  • How Reinforcement Learning works
    Agent, Action, Environment and State Relation. Source:
    Agent, Action, Environment and State Relation. Source:

    In order to have a reference frame for the type of problem we want to solve, we will start by going back to a mathematical concept developed in the 1960s, called the Markov decision process.

    • Markov decision process:When talking about reinforcement learning, we want to optimize the problem of a Markov decision process. It consists of a mathematical model that aids decision making in situations where the outcomes are in part random, and in part under the control of an agent.

    The main elements of this model are an Agent, an Environment, and a State, as shown in the diagram:

    • The agent can perform certain actions (such as moving the paddle left or right).
    • Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1.

    *State is the information used to determine what happens next.The set of states, actions, and rewards, together with the rules for transitioning from one state to another, make up a Markov decision process. Formally state is the a function of the history.

    St = f(Ht)

  • RL techniques: Q-learning
    Q-Learning table of states by actions that is initialized to zero, then each cell is updated through training. Source:
    Q-Learning table of states by actions that is initialized to zero, then each cell is updated through training. Source:

    Q-learning is a reinforcement learning technique used in machine learning. The goal of Q-Learning is to learn a policy, which tells an agent which action to take under which circumstances. It does not require a model of the environment and can handle problems with stochastic transitions and rewards, without requiring adaptations.

    For any finite Markov decision process (FMDP), Q-learning eventually finds an optimal policy,

    Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function that represents the maximum discounted future reward when we perform action a in state s.

    We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2), similar to what we did with the total discounted future reward. This equation is known as the Bellman equation for Q-learning:

  • Algorithm for Q-Learning
    Q -Learning formula. Source:
    Q -Learning formula. Source:

    where rt is the reward observed for the current state st.and (alpha) is the learning rate(0 < (alpha) <= 1)

    An episode of the algorithm ends when state (st + 1)is a final or terminal state. However, Q-learning can also learn in non-episodic tasksIf the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

  • Example of Reinforcement learning
    The game of Pong. Source: Andrej Karpathy blog
    The game of Pong. Source: Andrej Karpathy blog

    The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player.On the low level the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice).After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.

  • Graph of PONG game
    Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows). Source:
    Example of a simple MDP with three states (green circles) and two actions (orange circles), with two rewards (orange arrows). Source:

    A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

  • Libraries for Reinforcement learning
    • OpenAI Gym:The most popular environment for developing and comparing reinforcement learning models,is completely compatible with high computational libraries like TensorFlow.Its a python based rich AI simulation environment, The Gym environment also offers APIs which facilitate feeding observations along with rewards back to agents.
    • TensorFlow:This is an another well-known open-source library by Google followed by more than 95,000 developers everyday in areas of natural language processing, intelligent chatbots, robotics, and more.

    The TensorFlow community allows for the framework development in most popular languages such as Python, C, Java, JavaScript and Go.

    • Keras:Keras presents simplicity in implementing neural networks with just a few lines of codes with faster execution. It provides senior developers and principal scientists with a high-level interface to high tensor computation framework, TensorFlow and centralizes on the model architecture.
    • DeepMind Lab:DeepMind Lab is a Google 3D platform with customization for agent-based AI research. It is utilized to understand how self-sufficient artificial agents learn complicated tasks in large, partially observed environments.
    • Pytorch:Pytorch, open sourced by Facebook, is another well-known deep learning library adopted by many reinforcement learning researchers.where Pytorch unchained its capabilities with renowned RL algorithms like policy gradient and simplified Actor-Critic method.
  • What maked Reinforcement learning different from other ML paradigms?
    • There is no supervisor, only a reward signal.
    • Feed back is delayed, not instantaneous.
    • Time really matters (sequential, non i.i.d data).
    • Agents action affect the subsequent data it receives.
  • Practical Applications
    Manufacture, Inventory using RL. Source: chatbotsmagazine
    Manufacture, Inventory using RL. Source: chatbotsmagazine
    • Manufacturing: A robot uses deep RL to pick a device from one box and putting it in a container. Whether it succeeds or fails,it memorizes the object and gains knowledge and train’s itself to do this job with great speed and precision.Many eCommerce sites and other supermarkets use these intelligent robots.
    • Inventory Management:A major issue in supply chain inventory management is the coordination of inventory policies adopted by different supply chain actors, such as suppliers,manufacturers, distributors, so as to smooth material flow and minimize costs while responsively meeting customer demand.
    • Delivery Management:Reinforcement learning is used to solve the problem of Split Delivery Vehicle Routing.Q-learning is used to serve appropriate customers with just one vehicle.
    • Power Systems:Reinforcement learning and optimization techniques are utilized to assess the security of the electric power systems and to enhance Microgrid performance.
    • Finance Sector:Q-Learning algorithm is able to learn an optimal trading strategy with one simple instruction; maximize the value of our portofolio. Q-Learning algorithm will potentially be able to gain income with worrying about the market price or the risks involved since the Q-Learning algorithm is smart to take all these under considerations while making a trade.



The term "optimal control" came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and others through extending a nineteenth century theory of Hamilton and Jacobi.


The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b)


Markovian decision processes (MDPs), and Ron Howard (1960) devised the policy iteration method for MDPs.


Donald Michie. In 1961 and 1963 he described a simple trial-and-error learning system for learning how to play tic-tac-toe.


Michie has consistently emphasized the role of trial and error and learning as essential aspects of artificial intelligence.


Sutton developed Klopf's ideas further, particularly the links to animal learning theories , He and Barto refined these ideas and developed a psychological model of classical conditioning based on temporal-difference learning.


Finally, the temporal-difference and optimal control threads were fully brought together in 1989 with Chris Watkins's development of Q-learning.


The first demonstrations of multi-headed learning in reinforcement learning were by Jaderberg et alia.


  1. Andrej Karpathy. 2016. "Deep Reinforcement Learning: Pong from Pixels." Accessed 2018-07-28
  2. Ben Lorica. 2017."Practical applications of reinforcement learning in industry.." Accessed 2018-07-28
  3. Igor Halperin. 2018."Reinforcement Learning in Finance." Accessed 2018-07-28
  4. Kaelbling et al. 1996. "Reinforcement Learning: A Survey"Article, Accessed 2018-07-28
  5. Mark Hammond. 2018."Get your hard hat: Intelligent industrial systems with deep reinforcement learning." Accessed 2018-07-28.
  6. Maruti Techlabs. 2017."Reinforcement Learning and Its Practical Applications." Accessed 2018-07-28
  7. Pravin Dhandre. 2017. "How Reinforcement Learning works.." Accessed 2018-07-28
  8. Pravin Dhandre. 2018. "Top 5 tools for reinforcement learning." Accessed 2018-07-28
  9. Q-learning. 2018. "Q-learning.." Accessed 2018-07-28
  10. Q-Learning Matrix. 2018. "File:Q-Learning Matrix Initialized and After Training.png." Accessed 2018-07-28
  11. SASHA SHENG Blog. 2018."Reinforcement Learning Isomorphisms - Part 1." Accessed 2018-07-28
  12. Sugandha Lahoti. 2018. "How Google’s DeepMind is creating images with artificial intelligence." Accessed 2018-07-28
  13. Sutton, Barto. 1978. "History of Reinforcement Learning" Accessed 2018-07-27.

Further Reading

  1. Junling Hu.2016. "Reinforcement learning explained." Learning to act based on long-term payoffs. December 8 2016. Accessed 2018-07-27.
  2. Richard S. Sutton and Andrew G. Barto.2014, 2015. "Reinforcement Learning An Introduction". Accessed 2018-07-27.
  3. Abhijit Gosavi.February 11, 2017. "A Tutorial for Reinforcement Learning." Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Accessed 2018-07-27.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2020. "Reinforcement Learning." Version 10, January 6. Accessed 2021-02-08.
Contributed by
2 authors

Last updated on
2020-01-06 08:36:25

Improve this article

Article Warnings

  • Discussion answers at these positions have no citations: 7, 9
  • Milestones at these positions have no citations: 1, 2, 3, 4, 5, 7, 8
  • A good article must have at least 1.5 inline citations per 100 words. This article has 0.8.