Reinforcement Learning
 Summary

Discussion
 Why Reinforcement learning...?
 Concepts with Reinforcement learning
 How Reinforcement Learning works
 RL techniques: Qlearning
 Algorithm for QLearning
 Example of Reinforcement learning
 Graph of PONG game
 Libraries for Reinforcement learning
 What maked Reinforcement learning different from other ML paradigms?
 Practical Applications
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
Reinforcement learning is a type of Machine Learning algorithms which allows software agents and machines to automatically determine the ideal behavior within a specific context, to maximize its performance.
Reinforcement learning is a popular type of AI, is a form of supervised learning, but only given partial information.
Reinforcement learning is a kind of Machine Learning where in the system that is to be trained to do a particular job, learns on it’s own based on its previous experiences and outcomes while doing a similar kind of a job.
Reinforcement learning is the problem faced by an agent that must learn behavior through trialanderror interactions with a dynamic environment.^{}
Discussion
Why Reinforcement learning...? Reinforcement learning is a powerful machine learning technique for solving problems in dynamic and adaptive environments. Combined with a simulation or digital twin, reinforcement learning can train models to automate or optimize the efficiency of industrial systems and processes.^{}
Concepts with Reinforcement learning Its important to highlight Reinforcement learning techniques' isomorphisms across other fields.
 Neuroscience:Neuroscientists have been studying how the brain generates behaviors for decades. At the neural level, reinforcement allows for the strengthening of synaptic associations between pathways carrying conditioned and unconditioned stimulus information.
 Psychology:Classical conditioning is learning new behaviors through a series of association. Operant conditioning a learning process through which strength of behavior is modified by reward and punishment. RL is more closely related to operant conditioning. Because that's literally how you train your agent.
 Economics:Economic agents were portrayed as fully rational Bayesian maximizers of subjective utility. However, studies have shown that the agents (us humans) aren't fully rational agents. We frequently optimize for satisfaction rather than optimality.
 Mathematics:Operations Research is a field that focuses on using analytical methods to learn how to make better business decisions. How do you efficiently and accurately simulate the system so that you could perform optimizations on top of it to minimize cost,maximize reward etc.
 Engineering:Optimal Control is a research area where it is focused on finding a control law for a given system such that a certain optimality criterion is achieved.^{}
How Reinforcement Learning works In order to have a reference frame for the type of problem we want to solve, we will start by going back to a mathematical concept developed in the 1960s, called the Markov decision process.
 Markov decision process:When talking about reinforcement learning, we want to optimize the problem of a Markov decision process. It consists of a mathematical model that aids decision making in situations where the outcomes are in part random, and in part under the control of an agent.^{}
The main elements of this model are an Agent, an Environment, and a State, as shown in the diagram:
 The agent can perform certain actions (such as moving the paddle left or right).
 Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1.
*State is the information used to determine what happens next.The set of states, actions, and rewards, together with the rules for transitioning from one state to another, make up a Markov decision process.^{} Formally state is the a function of the history.
St = f(Ht)
RL techniques: Qlearning Qlearning is a reinforcement learning technique used in machine learning. The goal of QLearning is to learn a policy, which tells an agent which action to take under which circumstances. It does not require a model of the environment and can handle problems with stochastic transitions and rewards, without requiring adaptations.
For any finite Markov decision process (FMDP), Qlearning eventually finds an optimal policy,
Qlearning can be used to find an optimal action for any given state in a finite Markov decision process. Qlearning tries to maximize the value of the Qfunction that represents the maximum discounted future reward when we perform action a in state s.
We can define the Qfunction for a transition point (st, at, rt, st+1) in terms of the Qfunction at the next point (st+1, at+1, rt+1, st+2), similar to what we did with the total discounted future reward. This equation is known as the Bellman equation for Qlearning:
^{}Algorithm for QLearning where rt is the reward observed for the current state st.and (alpha) is the learning rate(0 < (alpha) <= 1)^{}
An episode of the algorithm ends when state (st + 1)is a final or terminal state. However, Qlearning can also learn in nonepisodic tasksIf the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.
Example of Reinforcement learning The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player.On the low level the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice).After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a 1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.^{}
Graph of PONG game A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.
Libraries for Reinforcement learning  OpenAI Gym:The most popular environment for developing and comparing reinforcement learning models,is completely compatible with high computational libraries like TensorFlow.Its a python based rich AI simulation environment, The Gym environment also offers APIs which facilitate feeding observations along with rewards back to agents.^{}
 TensorFlow:This is an another wellknown opensource library by Google followed by more than 95,000 developers everyday in areas of natural language processing, intelligent chatbots, robotics, and more.
The TensorFlow community allows for the framework development in most popular languages such as Python, C, Java, JavaScript and Go.
 Keras:Keras presents simplicity in implementing neural networks with just a few lines of codes with faster execution. It provides senior developers and principal scientists with a highlevel interface to high tensor computation framework, TensorFlow and centralizes on the model architecture.
 DeepMind Lab:DeepMind Lab is a Google 3D platform with customization for agentbased AI research. It is utilized to understand how selfsufficient artificial agents learn complicated tasks in large, partially observed environments.^{}
 Pytorch:Pytorch, open sourced by Facebook, is another wellknown deep learning library adopted by many reinforcement learning researchers.where Pytorch unchained its capabilities with renowned RL algorithms like policy gradient and simplified ActorCritic method.
What maked Reinforcement learning different from other ML paradigms?  There is no supervisor, only a reward signal.
 Feed back is delayed, not instantaneous.
 Time really matters (sequential, non i.i.d data).
 Agents action affect the subsequent data it receives.
Practical Applications  Manufacturing: A robot uses deep RL to pick a device from one box and putting it in a container. Whether it succeeds or fails,it memorizes the object and gains knowledge and train’s itself to do this job with great speed and precision.Many eCommerce sites and other supermarkets use these intelligent robots.^{}
 Inventory Management:A major issue in supply chain inventory management is the coordination of inventory policies adopted by different supply chain actors, such as suppliers,manufacturers, distributors, so as to smooth material flow and minimize costs while responsively meeting customer demand.^{}
 Delivery Management:Reinforcement learning is used to solve the problem of Split Delivery Vehicle Routing.Qlearning is used to serve appropriate customers with just one vehicle.
 Power Systems:Reinforcement learning and optimization techniques are utilized to assess the security of the electric power systems and to enhance Microgrid performance.
 Finance Sector:QLearning algorithm is able to learn an optimal trading strategy with one simple instruction; maximize the value of our portofolio. QLearning algorithm will potentially be able to gain income with worrying about the market price or the risks involved since the QLearning algorithm is smart to take all these under considerations while making a trade.^{}
Milestones
The term "optimal control" came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid1950s by Richard Bellman and others through extending a nineteenth century theory of Hamilton and Jacobi.
The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b)
Markovian decision processes (MDPs), and Ron Howard (1960) devised the policy iteration method for MDPs.
Donald Michie. In 1961 and 1963 he described a simple trialanderror learning system for learning how to play tictactoe.
Michie has consistently emphasized the role of trial and error and learning as essential aspects of artificial intelligence.
Sutton developed Klopf's ideas further, particularly the links to animal learning theories , He and Barto refined these ideas and developed a psychological model of classical conditioning based on temporaldifference learning.^{}
Finally, the temporaldifference and optimal control threads were fully brought together in 1989 with Chris Watkins's development of Qlearning.
The ﬁrst demonstrations of multiheaded learning in reinforcement learning were by Jaderberg et alia.
References
 Andrej Karpathy. 2016. "Deep Reinforcement Learning: Pong from Pixels." Accessed 20180728
 Ben Lorica. 2017."Practical applications of reinforcement learning in industry.." Accessed 20180728
 Igor Halperin. 2018."Reinforcement Learning in Finance." Accessed 20180728
 Kaelbling et al. 1996. "Reinforcement Learning: A Survey"Article, Accessed 20180728
 Mark Hammond. 2018."Get your hard hat: Intelligent industrial systems with deep reinforcement learning." Accessed 20180728.
 Maruti Techlabs. 2017."Reinforcement Learning and Its Practical Applications." Accessed 20180728
 Pravin Dhandre. 2017. "How Reinforcement Learning works.." Accessed 20180728
 Pravin Dhandre. 2018. "Top 5 tools for reinforcement learning." Accessed 20180728
 Qlearning. 2018. "Qlearning.." Accessed 20180728
 QLearning Matrix. 2018. "File:QLearning Matrix Initialized and After Training.png." Accessed 20180728
 SASHA SHENG Blog. 2018."Reinforcement Learning Isomorphisms  Part 1." Accessed 20180728
 Sugandha Lahoti. 2018. "How Google’s DeepMind is creating images with artificial intelligence." Accessed 20180728
 Sutton, Barto. 1978. "History of Reinforcement Learning" Accessed 20180727.
Further Reading
 Junling Hu.2016. "Reinforcement learning explained." Learning to act based on longterm payoffs. December 8 2016. Accessed 20180727.
 Richard S. Sutton and Andrew G. Barto.2014, 2015. "Reinforcement Learning An Introduction". Accessed 20180727.
 Abhijit Gosavi.February 11, 2017. "A Tutorial for Reinforcement Learning." Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Accessed 20180727.
Article Stats
Cite As
See Also
 QLearning
 ErrorDriven Learning
 Data Science
 Machine Learning
 Artificial Neural Network
 Dynamic Programming