As featured in “Artificial Intelligence for Robotics

By Francis X. Govers

Copyright © 2018.  Francis X. Govers III

“Control. Control.  You must learn control!” – Yoda, Star Wars: The Empire Strikes Back (1)

Generally, this is the first programming lesson I try to teach any new person approaching robotics.  The first item you build when laying out the software for a robot, or any other autonomous or remotely operated machine, is the control loop.  We will introduce some basic control loop concepts, with an emphasis on soft real time control.

Any robot has some sort of master control loop that takes data in, processes the data, makes a decision, and then takes some action.  The data received could be remote control commands from a user, images from a video camera, or sensor data from a sonar rangefinder.


A central concept in maintaining control of your robot is that this loop needs to run at a steady interval.  If the loop takes a different amount of time for each step, the robot’s movements and decisions will be erratic and unstable.  It also takes quite a bit more math to do any sort of planning or forecasting if the time steps jump around, or jitter.

What we want is to divide each second into a set of intervals the same size, which we will call “frames”, in the same manner as the frames in a motion picture film.  Let’s say that we make each frame 1/20 of a second.  Then we have 20 frames in a second, each lasting 50 milliseconds.   We need to divide our processing into frame-sized chunks that can fit into these intervals.

There are two ways of maintaining a constant loop control rate – one is called “hard real time control” and the other is “soft real time control”, with the primary distinction being that soft real time is any type of control loop that is not hard real time.

Hard real time control is measured and enforced by the computer hardware and operating system.  It generally requires special measures to be taken in the OS kernel to manage interrupts and provide process time slices.  In hard real time, if a control cycle takes longer than the allocated time, then a hardware fault occurs, or the process is cut off without completing, leaving the robot in an unknown state.  An operating system that has these capabilities is called a Real Time Operating System, or RTOS.  These special OS are found on aircraft and spacecraft where control is critical.

In soft real time, the program or application is responsible for policing its own time intervals – the operating system is not involved.   The program has to monitor the amount of time each frame has taken, and then release the rest of the time in the frame to the operating system.  This is generally done using a “sleep” command.  The advantage of soft real time is that it can be used just about anywhere without requiring a special operating system.  The disadvantage is that frame times can vary based on the care the programmer takes to manage tasks and make sure that enough frame time is available for the robot to do all of its processing.

In the book in Chapter 1, we present one method for providing soft real time control in Python.  The process is fairly straightforward.  The robot’s main processing loop is set up as a time frame at 20 hertz.  At the beginning of the loop, the system time is recorded.  The robot steps through processing any available input data by polling serial ports and network interfaces, interprets any commands that have come in, looks for decisions to be made, and sends out motor commands also via serial port.

At the end of each frame, the time is again measured, and the program looks to see how long the frame took.  Let’s say we expended 10 milliseconds of our allotted 50.  We send the Python “sleep” command with a value of 40 milliseconds, allowing the CPU to do other things during that time.

Now the next task is a bit of extra work that can help keep your timing more accurate.  The sleep command is not all that precise – at least, not as precise as we would like.  We take another time measurement after the sleep command has come back, and see if we indeed have used our 50 milliseconds.  The number that comes back may be 5 or 10 milliseconds off, or may have rarely been more like 30 or 40 milliseconds off.  We call this difference “error”, for timing error, and then subtract that value from the next time we call the sleep command.  We are now actively correcting our frame back in line with our desired frame rate.  This extra bit of work makes a big difference in keeping the frame rate as constant as possible compared to the wall clock.  Without this extra correction, our frame rate intervals will drift and cause potential problems to creep into driving the robot. 

A lot of robot texts either gloss over soft-real-time control, or don’t emphasize the importance of maintaining a constant frame rate on controlling an unmanned vehicle or robot.  If you advance your robotics skills to the point where you are integrating data from an inertial measurement unit, with accelerometers and gyroscopes, then having a constant time interval becomes very important.



Artificial Intelligence for Robotics by Francis X Govers is published by Packt Publishing and is available on the Packt website ( and on Amazon at


1) Yoda Quotes:  Star Wars: The Empire Strikes Back is the prop

#MachineLearning #ArtificialIntelligence #Technology #Author #ebook #Robotics

Learn how to balance a CartPole using machine learning in this article by Sean Saito, the youngest ever Machine Learning Developer at SAP and the first bachelor hire for the position. He currently researches and develops machine learning algorithms that automate financial processes.

This article will show you how to solve the CartPole balancing problem. The CartPole is an inverted pendulum, where the pole is balanced against gravity. Traditionally, this problem is solved by control theory, using analytical equations. However, in this article, you’lllearn to solve the problem with machine learning.

OpenAI Gym

OpenAI is a non-profit organization dedicated to researching artificial intelligence, and the technologies developed by OpenAI are free for anyone to use.


Gym provides a toolkit to benchmark AI-based tasks. The interface is easy to use. The goal is to enable reproducible research. An agent can be taught inside the gym, and it canlearn activities such as playing games or walking. An environment is a library of problems.

The standard set of problems presented in the gym is as follows:

  • CartPole
  • Pendulum
  • Space Invaders
  • Lunar Lander
  • Ant
  • Mountain Car
  • Acrobot
  • Car Racing
  • Bipedal Walker

Any algorithm can work out in the gym by training for these activities. All of the problems have the same interface. Therefore, any general reinforcement learning algorithm can be used through the interface.

Installating Gym

The primary interface of the gym is used through Python. Once you have Python3 in an environment with the pip installer, the gym can be installed as follows:

sudopip install gym

Advanced users who want to modify the source can compile from the source using the following commands:

git clone

cd gym

pip install -e .

A new environment can be added to the gym with the source code. There are several environments that need more dependencies. For macOS, install the dependencies using the following command:

brew install cmake boost boost-python sdl2 swig wget

For Ubuntu, use the following commands:

apt-get install -y python-numpy python-devcmake zlib1g-dev libjpeg-devxvfblibav-tools xorg-dev python-opengllibboost-all-dev libsdl2-dev swig

Once the dependencies are present, install the complete gym as follows:

pip install 'gym[all]'

This will install most of the environments that are required.

Running an environment

Any gym environment can be initialized and run using a simple interface. Start by importing the gym library, as follows:

  1. First, import the gymlibrary:

import gym

  1. Next, create an environment by passing an argument to make. In the following code, CartPole is used as an example:

environment = gym.make('CartPole-v0')

  1. Next, reset the environment:


  1. Then, start an iteration and render the environment:

fordummy in range(100):



Also, change the action space at every step, to see CartPole moving. Running the preceding program should produce a visualization. The scene should start with a visualization, as follows:


The preceding image is called a CartPole. The CartPole is made up of a cart that can move horizontally and a pole that can move rotationally, with respect to the center of the cart.The pole is pivoted to the cart. After some time, you will notice that the pole is falling to one side, as shown in the following image:


After a few more iterations, the pole will swing back, as shown in the following image. All movements are constrained by the laws of physics. The steps are taken randomly:


Other environments can be seen in a similar way, by replacing the argument of the gym environment, such as MsPacman-v0 or MountrainCar-v0.

Markov models

The problem is set up as a reinforcement learning problem, with a trial and error method. The environment is described using state_valuesstate_values (?), and the state_values are changed by actions. The actions are determined by an algorithm, based on the current state_value, in order to achieve a particular state_value that is termed a Markov model

In an ideal case, the past state_values does have an influence on future state_values, but here, you assume that the current state_value has all of the previous state_values encoded. There are two types of state_values; one is observable and the other is non-observable. The model has to take non-observable state_values into account, as well. That is called a Hidden Markov model.


At each step of the cart and pole, several variables can be observed, such as the position, velocity, angle, and angular velocity. The possible state_values of the cart are moved right and left:

  1. state_values: Four dimensions of continuous values.
  2. Actions: Two discrete values.
  3. The dimensions, or space, can be referred to as the state_valuespace and the action space. Start by importing the required libraries, as follows:

import gym

importnumpy as np

import random

import math

  1. Next, make the environment for playing CartPole, as follows:

environment = gym.make('CartPole-v0')

  1. Define the number of buckets and the number of actions, as follows:

no_buckets = (1, 1, 6, 3)

no_actions = environment.action_space.n

  1. Define the state_value_bounds, as follows:

state_value_bounds = list(zip(environment.observation_space.low, environment.observation_space.high))

state_value_bounds[1] = [-0.5, 0.5]

state_value_bounds[3] = [-math.radians(50), math.radians(50)]

  1. Next, define the action_index, as follows:

action_index = len(no_buckets)

  1. Now, define the q_value_table, as follows:

q_value_table = np.zeros(no_buckets + (no_actions,))

  1. Define the minimum exploration rate and the minimum learning rate:

min_explore_rate = 0.01

min_learning_rate = 0.1

  1. Define the maximum episodes, the maximum time steps, the streak to the end, the solving time, the discount, and the number of streaks, as constants:

max_episodes = 1000

max_time_steps = 250

streak_to_end = 120

solved_time = 199

discount = 0.99

no_streaks = 0

  1. Define the selectaction that can decide the action, as follows:

defselect_action(state_value, explore_rate):

ifrandom.random() <explore_rate:

action = environment.action_space.sample()


action = np.argmax(q_value_table[state_value])

return action

  1. Now, select the explorertate, as follows:


return max(min_explore_rate, min(1, 1.0 - math.log10((x+1)/25)))

  1. Select the learning rate, as follows:


return max(min_learning_rate, min(0.5, 1.0 - math.log10((x+1)/25)))

  1. Next, bucketizethe state_value, as follows:


bucket_indexes = []

for i in range(len(state_value)):

ifstate_value[i] <= state_value_bounds[i][0]:

bucket_index = 0

elifstate_value[i] >= state_value_bounds[i][1]:

bucket_index = no_buckets[i] - 1


bound_width = state_value_bounds[i][1] - state_value_bounds[i][0]

offset = (no_buckets[i]-1)*state_value_bounds[i][0]/bound_width

scaling = (no_buckets[i]-1)/bound_width

bucket_index = int(round(scaling*state_value[i] - offset))


return tuple(bucket_indexes)


  1. Train the episodes, as follows:

forepisode_no in range(max_episodes):

explore_rate = select_explore_rate(episode_no)

learning_rate = select_learning_rate(episode_no)


observation = environment.reset()


start_state_value = bucketize_state_value(observation)

previous_state_value = start_state_value


fortime_step in range(max_time_steps):


selected_action = select_action(previous_state_value, explore_rate)

observation, reward_gain, completed, _ = environment.step(selected_action)

state_value = bucketize_state_value(observation)

best_q_value = np.amax(q_value_table[state_value])

q_value_table[previous_state_value + (selected_action,)] += learning_rate * (

reward_gain + discount * (best_q_value) - q_value_table[previous_state_value + (selected_action,)])

  1. Print all relevant metrics for the training process, as follows:

print('Episode number : %d' % episode_no)

print('Time step : %d' % time_step)

print('Selection action : %d' % selected_action)

print('Current state : %s' % str(state_value))

print('Reward obtained : %f' % reward_gain)

print('Best Q value : %f' % best_q_value)

print('Learning rate : %f' % learning_rate)

print('Explore rate : %f' % explore_rate)

print('Streak number : %d' % no_streaks)


if completed:

print('Episode %d finished after %f time steps' % (episode_no, time_step))

iftime_step>= solved_time:

no_streaks += 1


no_streaks = 0



previous_state_value = state_value




  1. After training for a period of time, the CartPole will be able to balance itself, as shown in the following image:


You have successfully learned a program that will stabilize the CartPoleusing a trial and error approach.

If you found this article interesting, you can explore Python Reinforcement Learning Projects to implement state-of-the-art deep reinforcement learning algorithms using Python and its powerful libraries. Python Reinforcement Learning Projects will help you hands-on experience with eight reinforcement learning projects, each addressing different topics and/or algorithms.

Learn the basics of ML-Agents and run a sample game in this guest post by Micheal Lanham, a software architect and the author of Learn Unity ML-Agents – Fundamentals of Unity Machine Learning.

Before you start using the ML-Agents platform with Unity to build ML models you need to pull down the ML-Agents package from GitHub using git. Open up a command prompt or shell window and follow along:

  1. Navigate to your work or root folder:


  1. Execute the following command:

      mkdir ML-Agents

  1. This will create the folder ML-Agents. Now, execute the following:

      cd ML-Agents

      git clone

  1. This uses git to pull down the required files for ML-Agents into a new folder called ml-agents.git will show the files as they are getting pulled into the folder. You can verify that the files have been pulled down successfully by changing to the new folder and executing the below commands:

      cd ml-agents


Good—that should have been fairly painless. If you had issues pulling the code down, you can always visit the ML-Agents page on GitHub at and manually pull the code down.

Now that you have the ML-Agents installed, take a look at one of Unity's sample projects that ships with a toolkit in the next section.

Running a sample

Unity ships the ML-Agents package with a number of prepared samples that demonstrate various aspects of learning and training scenarios. Open up Unity, load a sample project and get a feel for how the ML-Agents run by following this exercise:

  1. Open the Unity editor and go to the starting Project
  2. Click the Openbutton at the top of the dialog. Navigate to and select the ML-Agents/ml-agents/unity-environment folder, as shown in the following screenshot:


  1. This will load the unity-environmentproject into the Unity editor. Depending on the Unity version you are using, you may get a warning that the version needs to be upgraded. As long as you are using a recent version of Unity, you can just click Continue. If you do experience problems, try upgrading or downgrading your version of Unity.
  2. Locate the Scenefile in the Assets/ML-Agents/Examples/3DBall folder of the Project window, as shown in the following screenshot:


  1. Double-click the 3DBallscene file to open the scene in the editor.
  2. Press the Play button at the top center of the editor to run the scene. You will see that the scene starts running and that the balls are being dropped, but the balls just fall off the platforms. This is because the scene starts up in Playermode, which means you can control the platforms with keyboard input. Try to balance the balls on the platform using the arrow keys on the keyboard.
  3. When you are done running the scene, click the Play button again to stop the scene.

Setting the agent Brain

As you just witnessed, the scene is currently set for Player control, but you’d obviously want to see how some of this ML-Agents stuff works. In order to do that, you need to change the Brain type the agent is using. Follow along to switch the Brain type in the 3D Ball agent:

  1. Locate the Ball3DAcademyobject in the Hierarchy window and expand it to reveal the Ball3DBrain 
  2. Select the Ball3DBrainobject and then look to the Inspector window, as shown in the following screenshot:


  1. Switch the Brain component to the Heuristic The Heuristicbrain setting is for ML-Agents that are internally coded within Unity scripts in a heuristic manner. Heuristic programming is nothing but selecting a simpler, quicker solution when a classic ML algorithm may take longer. Writing a Heuristic brain can often help you better define a problem. A majority of current game AIs fall within the category of using Heuristic algorithms. 
  2. Press Play to run the scene. Now, you will see the platforms balancing each of the balls – very impressive for a heuristic algorithm. Next, open the script with the heuristic brain and take a look at some of the code.
  3. Click the Gear icon beside the Ball 3D Decision (Script), and from the context menu, select Edit Script, as shown in the following screenshot:


  1. Take a look at the Decidemethod in the script as follows:

      public float[] Decide(

              List<float> vectorObs,

              List<Texture2D> visualObs,

              float reward,

              bool done,

              List<float> memory)





                 == SpaceType.continuous)


                  List<float> act = new List<float>();

         // state[5] is the velocity of the ball in the x orientation.

        // We use this number to control the Platform's z axis rotation


        // so that the Platform is tilted in the x orientation


          act.Add(vectorObs[5] * rotationSpeed);

         // state[7] is the velocity of the ball in the z orientation.

        // We use this number to control the Platform's x axis rotation


        // so that the Platform is tilted in the z orientation


          act.Add(-vectorObs[7] * rotationSpeed);

           return act.ToArray();


           // If the vector action space type is discrete, then we don't do


          return new float[1] { 1f };


  1. Look at how simple the code is. This is the heuristic brain that is balancing the balls on the platform, which is fairly impressive when you see the code. The question that may just hit you is: why are you bothering with ML programming, then? The simple answer is that the 3D ball problem is deceptively simple and can be easily modeled with eight states. Take a look at the code again and you can see that only eight states are used (0to 7), with each state representing the direction the ball is moving in. As you can see, this works well for this problem but when you get to more complex examples, you may have millions upon billions of states – hardly anything you could easily solve using heuristic methods.

That’s it! If you found this article interesting and want to try your hands at Unity ML, you can refer to Learn Unity ML-Agents – Fundamentals of Unity Machine Learning. Packed with numerous hands-on examples, the book takes you from the basics of reinforcement and Q learning to building Deep Recurrent Q-Network agents that cooperate or compete in a multiagent ecosystem.

Learn about the Markov Chain and the Markov Decision Process in this guest post by Sudarshan Ravichandran, a data scientist and AI enthusiast, and the author of Hands-On Reinforcement Learning with Python.

A mathematical framework for solving reinforcement learning(RL) problems, the Markov Decision Process (MDP) is widely used to solve various optimization problems. MDP provides a mathematical framework for solving RL problems, andalmost all RL problems can be modeled as MDP. This tutorial will take you through the nuances of MDP and its applications.

Before going into MDP, you must first understand the Markov chain and Markov process, which form the foundation of MDP.

The Markov property states that the future depends only on the present and not on the past. The Markov chain is a probabilistic model that solely depends on the current state to predict the next state and not the previous states.This means that the future is conditionally independent of the past. The Markov chain strictly follows the Markov property. 

For example, if you know that the current state is cloudy, you can predict that the next state could be rainy. You came to the conclusion that the next state could be rainy only by considering the current state (cloudy) and not the past states, which might be sunny or windy.

However, the Markov property does not hold true for all processes. For example, throwing a dice (the next state) has no dependency on the previous number (the current state), whatsoever.

Moving from one state to another is called transition and its probability is called a transition probability. The transition probabilities can be formulated in the form of a table, as shown next, and it is called a Markov table. It shows, given the current state, what the probability of moving to the next state is:

Current state

Next state

Transition probability














You can also represent the Markov chain in the form a state diagram that shows the transition probability:


The preceding state diagram shows the probability of moving from one state to another. Still don't understand the Markov chain? Okay, let’s talk.

Me: "What are you doing?"

You: "I'm reading about the Markov chain."

Me: "What is your plan after reading?"

You: "I'm going to sleep."

Me: "Are you sure you're going to sleep?"

You: "Probably. I'll watch TV if I'm not sleepy."

Me: "Cool; this is also a Markov chain."

You: "Eh?"

The above conversation can be formulated into a Markov chain. The state diagram will be as follows:


 The Markov chain lies in the core concept that the future depends only on the present and not on the past. A stochastic process is called a Markov process if it follows the Markov property. 

MDP is an extension of the Markov chain,which provides a mathematical framework for modeling decision-making situations. MDP is represented by five important elements: 

  • A set of states the agent can actually be in


 A set of actions that can be performed by an agent, for moving from one state to another


  • A transition probability,


, which is the probability of moving from one state


 to another state


by performing some action.


  • A reward probability,


, which is the probability of a reward acquired by the agent for moving from one state 


to another state


 by performing some action 



  • A discount factor,


, which controls the importance of immediate and future rewards. This is discussed in detail in the next section.

Rewards and returns

In an RL environment, an agent interacts with the environment by performing an action and moves from one state to another. Based on the action it performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for a good action and -1 for a bad action.

How do you decide if an action is good or bad? In a maze game, a good action is when the agent makes a move such that it doesn't hit a maze wall; a bad action is when the agent moves and hits the maze wall. 

An agent tries to maximize the total amount of rewards (cumulative rewards) it receives from the environment instead of immediate rewards. The total amount of rewards the agent receives from the environment is called returns. So, you can formulate total amount of rewards (returns) received by the agents as follows:




 is the reward received by the agent at a time step 


 while performing an action


to move from one state to another. 


is the reward received by the agent at a time step 


 while performing an action to move from one state to another. Similarly,


 is the reward received by the agent at a final time step 


while performing an action to move from one state to another.

Episodic and continuous tasks

Episodic tasks are the tasks that have a terminal state (end). In RL, episodes are considered agent-environment interactions from initial to final states.

For example, in a car racing video game, you start the game (initial state) and play the game until it is over (final state). This is called an episode. Once the game is over, you start the next episode by restarting the game, and you will begin from the initial state irrespective of the position you were in the previous game. So, each episode is independent of the other.

In a continuous task, there is no terminal state. Continuous tasks will never end. For example, a personal assistance robot does not have a terminal state.

Discount factor

You have seen that an agent’s goal is to maximize the return. For an episodic task, you can define the return as Rt= rt+1 + rt+2 + ..... +rT, where T is the final state of the episode.

Since there’s final state for a continuous task, you can define the return for continuous tasks as Rt= rt+1 + rt+2+....,which sums up to infinity. But how do you maximize the return if it never stops?

That's why the notion of a discount factor is introduced. You can redefine the return with a discount factor


, as follows:


... (1)



… (2)

The discount factor decides how much importance can be given to the future rewards and immediate rewards. The value of the discount factor lies within 0 to 1. A discount factor of 0 means that immediate rewards are more important, while a discount factor of 1 would mean that future rewards are more important than immediate rewards.

A discount factor of 0 will never learn, considering only the immediate rewards; similarly, a discount factor of 1 will learn forever looking for a future reward, which may lead to infinity. So the optimal value of the discount factor lies between 0.2 and 0.8. 

You give importance to immediate and future rewards depending on the use case. In some cases, future rewards are more desirable than immediate rewards and vice versa. In a chess game, the goal is to defeat the opponent's king.

If you give importance to the immediate reward, which is acquired by actions like pawn defeating any opponent player and so on, the agent will learn to perform this sub-goal instead of learning to reach the actual goal. In such a case, you must give importance to future rewards.

In some other cases, immediate rewards are preferred over future rewards. (Say, would you prefer chocolates today or 13 months later?)

The policy function

The policy function can be represented as follows:

This indicates mapping from states to actions. So, basically, a policy function tells the action to be performed in each state. The ultimate goal lies in finding the optimal policy that specifies the correct action to be performed in each state, which maximizes the reward.

State value function

A state value function is also simply called a value function. It specifies how good it is for an agent to be in a particular state with a policy π. A value function is often denoted by V(s). It denotes the value of a state following a policy.

The state value function can be defined as follows: