*Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium

, , , , ,

Amitnikhade

June 14, 2023

*Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium

“The Markov chain captures the essence of independence between the past and the future, highlighting that the present state holds the key to predicting what lies ahead.”

**The Markov property says that the future is only influenced by the present, not the past. A Markov chain, also called a Markov process, is a sequence of states that follows this property. It’s a probabilistic model that predicts the next state based solely on the current state, without considering the previous states. In other words, the future is not affected by the past.**

Let’s imagine we are trying to predict the weather. If we observe that it is currently cloudy, we can make an educated guess that it might rain next. We reached this conclusion by only looking at the current weather (cloudy) and not considering what the weather was like before, such as sunny or windy conditions.

Let’s take the example of flipping a coin. The Markov property doesn’t apply here because the outcome of the next coin flip (the next state) doesn’t depend on what came up on the previous flip (the current state). Whether the previous flip was heads or tails has no impact on the likelihood of getting heads or tails in the next flip. Each coin flip is separate and random.

When we move from one state to another, it is called a transition, and the likelihood of this transition happening is known as a transition probability. We represent this probability as 𝑃(s′|𝑠). It tells us the chance of going from the current state 𝑠 to the next state 𝑠′.

In the context of a Markov Reward Process (MRP), a collection of distinct states is present, where transitions between states occur with specific probabilities. Each state within the MRP is associated with a reward value that indicates its desirability. This reward function is typically represented as R(s).

The MDP consists of states, a transition probability, a reward function,

and also actions. Actions are the choices we can make in each state to reach a specific goal.

A policy guides an agent’s actions in an environment by specifying what action to take in each state.

When the agent interacts with the environment initially, a random policy is often used. This random policy instructs the agent to take random actions in each state. Through multiple iterations, the agent learns from the rewards it receives and gradually improves its actions. With enough iterations, the agent discovers a good policy that yields positive rewards.

This good policy is known as the optimal policy. It is the policy that leads the agent to achieve high rewards and successfully accomplish its goals.

The Good policy can also be called as a **deterministic policy**, A deterministic policy instructs the agent to take a specific action in a given state.

In a deterministic policy, the robot always chooses the same action for a given situation. It’s like following a fixed rule without any variations. For example, if the robot is at position (2, 3) on a grid, the deterministic policy would always tell it to move down.

In a **stochastic policy**, the robot has some flexibility and can make different choices even in the same situation. It assigns probabilities to different actions. For instance, in position (2, 3), the stochastic policy might say there’s a 40% chance of moving down, 30% chance of moving left, 20% chance of moving right, and 10% chance of moving up. So, each time the robot is at position (2, 3), it randomly selects an action based on these probabilities. It’s like having some randomness and exploring different options.

To put it simply, a deterministic policy is like always following the same fixed rule, while a stochastic policy allows for some randomness and flexibility in the robot’s choices.

When the available options for actions are limited and distinct, a stochastic policy is referred to as a **categorical policy**. In simpler terms, a categorical policy uses a probability distribution to decide which action to take when there are specific choices available.

A categorical policy is like choosing from a set menu with specific options. It assigns probabilities to each option and picks the one with the highest probability.

On the other hand, a Gaussian policy is like smoothly adjusting a volume knob or a slider. It generates actions based on a smooth and continuous range of possibilities, like turning the knob to find the right volume level.

Imagine playing an exciting puzzle game where your task is to guide a character through a series of rooms to find a hidden treasure. Each time you try to solve the puzzle, it’s like starting a new episode.

When you begin the game, your character starts in a specific room, which is the starting point or initial state. You have control over the character’s movements like walking, jumping, and interacting with objects in the rooms.

As you progress, you’ll encounter obstacles, traps, and challenging puzzles that you must overcome to reach the final room where the treasure is hidden. This final room represents the end goal or final state. When your character successfully finds the treasure, the episode comes to an exciting end.

With each new attempt or playthrough, you start from a different room, face new challenges, and strive to reach the final goal of discovering the treasure. These episodes give you the opportunity to improve your strategies, learn from mistakes, and track your progress as you solve the puzzle.

The horizon represents the duration or time step during which the agent engages in interaction with the environment.

**Reinforcement learning is a sophisticated branch of machine learning that empowers an intelligent agent to dynamically learn and optimize decision-making processes in complex environments. By iteratively interacting with the environment, receiving evaluative feedback in the form of rewards, and leveraging trial and error, the agent adapts its actions to learn optimal strategies aimed at maximizing cumulative rewards.**

386