Madness of Randomness on the earth of Markov decision process!! #MDP sate,motion and reward. | by Abdurahman Hussain | May, 2023


Markov decision process (MDP) is a mathematical framework that gives a proper method to model decision-making in situations where outcomes are partly random and partly under the control of a call maker. MDPs are utilized in a wide selection of fields, including artificial intelligence (AI), operations research, economics, game Theory and control engineering. In this text, we are going to concentrate on the appliance of MDPs in AI.

Introduction to Bellman’s Equation

A key idea in dynamic programming and reinforcement learning is Bellman’s equation. It bears Richard Bellman’s name, who developed the equation within the Fifties.

It’s used to compute the optimal policy for a Markov decision process (MDP), which is a mathematical framework for modeling decision-making processes in situations where outcomes are partly random and partly under the control of a call maker.

The equation is a recursive expression that relates the worth of a state to the worth of its possible successor states. It might probably be written as:

V(s) = max_a [ r(s,a) + gamma * sum_s’ [ P(s’ | s,a) * V(s’) ] ]


  • V(s) is the worth of being in state s
  • max_a [ ] means the utmost over all possible actions a
  • r(s,a) is the immediate reward obtained by taking motion a in state s
  • gamma is the discount factor, which determines the importance of future rewards relative to immediate rewards
  • P(s’ | s,a) is the probability of transitioning to state s’ on condition that motion a is taken in state s
  • sum_s’ [ ] means the sum over all possible successor states s’

The equation essentially says that the worth of a state is the utmost expected sum of discounted future rewards that may be obtained from that state, taking into consideration all possible actions and successor states. It’s a recursive equation since it is dependent upon the values of successor states, which themselves depend upon the values of their successor states, and so forth.

Bellman’s equation is usually utilized in reinforcement learning algorithms to iteratively update the values of states because the agent learns from experience. The equation can be prolonged to incorporate the worth of taking a particular motion in a state, leading to the Q-value function, which is one other essential concept in reinforcement learning.

Solving Markov Decision Processes

The goal of an MDP is to seek out a policy π that maps each state s to an motion a, such that the expected long-term reward of following the policy is maximized. In other words, the agent wants to seek out the perfect possible sequence of actions to take to be able to maximize its reward over time.

Markov Decision Processes (MDPs) are widely utilized in the sector of Artificial Intelligence (AI) and Machine Learning (ML) to model decision-making problems in stochastic environments. In lots of real-world problems, the environment is inherently random and unpredictable, making it difficult to make optimal decisions. MDPs provide a mathematical framework for modeling such problems and finding optimal solutions.

In an MDP, an agent interacts with an environment that consists of a set of states, actions, and rewards. At every time step, the agent observes the present state of the environment, chooses an motion to perform, and receives a reward based on the motion taken and the resulting state. The goal of the agent is to seek out a policy, a mapping from states to actions, that maximizes its expected cumulative reward over time.

In a random world, the environment is characterised by uncertainty and randomness. The transitions between states aren’t deterministic, and there isn’t a method to predict with certainty what’s going to occur next. This makes it difficult to design an optimal policy that takes under consideration all possible future outcomes.

One method to handle randomness in an MDP is to make use of a probabilistic transition function, which specifies the probability of moving from one state to a different after taking a specific motion. This function may be estimated from data or learned through experience. In a random world, the transition function may be more complex, with multiple possible outcomes for every motion.

One other method to handle randomness is to introduce a notion of randomness within the rewards. In a random world, rewards could also be uncertain and variable, and the agent may not have the ability to accurately predict the reward related to each motion. For instance, in a game of poker, the reward related to a specific motion is dependent upon the hidden cards held by the opponent, that are unknown to the agent.

To handle randomness in an MDP, various algorithms have been developed, resembling Monte Carlo methods, Temporal Difference (TD) learning, and Q-learning. These algorithms use different techniques to estimate the worth of states and actions, that are then used to derive an optimal policy.

In conclusion, MDPs provide a robust framework for modeling decision-making problems in a random world. By incorporating randomness into the model, MDPs can assist AI and ML systems make optimal decisions even in uncertain and unpredictable environments.

For more readings and understanding

Reference List:

Larsson, J. (2011). Markov decision processes: Applications. Uppsala University, Sweden. Retrieved from

fun way

“Bellman equation.” Hugging Face.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x