Let's understand the two objects that make up RL problems: agent and environment. Agents, usually called AI, perform decision-making or action, so they are our main interest in a learning problem. In robotics, the agent becomes a robot, including the hardware, so it doesn't just mean a computer program. Environment provides state ****information that agents use to make decisions, and then provides reward information to the agents about the results of their actions. In the figure below, the environment first provides the game screen as state information, and when the agent decides how to play, it shows the score as a reward.
Reinforcement learning involves learning in an environment where sequential interactions between the agent and the environment is assumed. The term interaction here refers to the exchange of information, where a decision made by the agent results in a reaction from the environment. At this time, the environment provides information that the agent can use when making the next decision. Let's see how interaction occurs in detail.
The agent and the environment interact at each of a sequence of time step, $t=0,1,2,\ldots, T$. Here we can set $T<\infty$ when a termination condition exists or we set $T=\infty$, otherwise.
State (or observation) refers to the information that the environment outputs. We use notation $S_{t}\in\mathcal{S}$ to denote state information at time $t$. At time step $t=0$, an initial state $S_{0}$ is given. Here $\mathcal{S}$ means state space from which states are sampled repetitively. For example, suppose you consider a game screen as a state. In that case, the state space becomes the set of all the possible images the game simulator can generate. State space $\mathcal{S}$ may seem important, but it is not actually crucial in practice. It is only necessary to specify the probability space, which is required to define joint probability distribution on the state space and transition, which describes the change of state information between different time steps $S_{t} \rightarrow S_{t+1}$.
Action refers to the information that the agent outputs and the environment receives. We use notation $A_{t}\in\mathcal{A}$ to denote the action information at time $t$. Analogous to state space, $\mathcal{A}$ means action space, the set of possible actions the agent can choose. Let’s go back to the game. Game players press buttons on the controller while looking at the screen (state $S_{t}$). Pressing a button implies an action $A_{t}$, and the combination of buttons that can be pressed on the controller becomes the action space $\mathcal{A}$. Contrary to state space, it is important to explicitly define the action space and eliminate infeasible actions since our agent aims to find the best action by exploring the action space during the training time.
Reward is additional information provided to the agent by the environment as a consequence of action. When the agent decides on an action $A_{t}$ based on the current state $S_{t}$, the environment provides the next state $S_{t+1}$ and reward information $R_{t+1}\in\mathbb{R}$. The reward is provided as a numeric data type. Overall, sequential interaction between the agent and environment occurs in the order of the state, action, reward, and repetitively. And this sequence is called a trajectory $\tau$.
$\tau=\{S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, S_{2}, A_{2},\ldots,\}$
Remember the order as SARSA when memorizing (not to confuse it with SALSA!)
Yes, we miss a very important component of the framework. How does the environment determine the next state $S_{t+1}$ and reward $R_{t+1}$ to be transmitted to the agent at the next time step $t+1$ based on the action $A_{t}$ made by the agent? In the real world, this is determined by physical laws and perceived through sensors before being transmitted to the computer program. In a simulation world, the rules must be embedded in advance. These rules are called model or dynamics. In the case of board games, e.g., Go or Chess, dynamics is easy to implement since the result of the action is deterministic, and the next state comes from the opponent's player's action, so it determines the transition. Similarly, computer games themselves have game-engine built-in; hence we can utilize them as dynamics. Therefore, do not worry too much in the case of game-playing or simulators built-in. However, if you want to apply reinforcement learning to real problem, most of your time will be spent creating the dynamics in the environment! Fortunately, previous AI researchers have developed a lot of open-source environments, so we can focus only on learning algorithm in this class. We are going to use OpenAI Gym
, especially.
Examples of OpenAI Gym Environments