policy gradients

The path to optimal behavior is not found in a single leap, but through countless small steps guided by the gradient of experience.

In this post, we will derive the base policy gradient objective, which serves as the foundation for many reinforcement learning algorithms, such as REINFORCE, PPO, and GRPO.

These algorithms have served as the backbone for training state-of-the-art reasoning models such as OpenAI’s o3, xAI’s Grok 4, and DeepSeek’s R1.

I have contributed this derivation to Nat Lambert’s RLHF Book, as a piece to the larger story of policy optimization in reinforcement learning. If you are interested in learning more about RLHF, I encourage you to check it out!

Deriving the Policy Gradient

In Reinforcement Learning, the main objective is to learn a policy that maximizes the expected reward over time:

$$ \begin{align} J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] \label{eq:policy_objective_expectation} \end{align} $$

where $\tau = (s_0, a_0, s_1, a_1, \ldots)$ is a trajectory and $R(\tau) = \sum_{t=0}^\infty r_t$ is the total reward of the trajectory. Alternatively, we can write the expectation as an integral over all possible trajectories:

$$ \begin{align} J(\theta) = \int_\tau p_\theta (\tau) R(\tau) d\tau \label{eq:policy_objective_integral} \end{align} $$

Notice that we can express the trajectory probability as follows:

$$ \begin{align} p_\theta (\tau) = p(s_0) \prod_{t=0}^\infty \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t) \label{eq:trajectory_probability} \end{align} $$

If we take the gradient of the objective (Equation \ref{eq:policy_objective_expectation}) with respect to the policy parameters $\theta$:

$$ \begin{align} \nabla_\theta J(\theta) = \int_\tau \nabla_\theta p_\theta (\tau) R(\tau) d\tau \label{eq:policy_gradient_integral} \end{align} $$

Notice that we can use the log-derivative trick in order to rewrite the gradient of the integral as an expectation:

$$ \begin{align} \nabla_\theta \log p_\theta(\tau) &= \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} &&\text{(from chain rule)} \nonumber \\ \implies \nabla_\theta p_\theta(\tau) &= p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) &&\text{(rearranging)} \label{eq:log_chain_rule} \end{align} $$

Using this log-derivative trick:

$$ \begin{align} \nabla_\theta J(\theta) &= \int_\tau \nabla_\theta p_\theta (\tau) R(\tau) d\tau \nonumber \\ &= \int_\tau p_\theta (\tau) \nabla_\theta \log p_\theta (\tau) R(\tau) d\tau \nonumber \\ &= \mathbb{E}_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log p_\theta (\tau) R(\tau) \right] \label{eq:policy_gradient_expectation} \end{align} $$

Expanding the log probability of the trajectory:

$$ \begin{align} \log p_\theta (\tau) = \log p(s_0) + \sum_{t=0}^\infty \log \pi_\theta(a_t|s_t) + \sum_{t=0}^\infty \log p(s_{t+1}|s_t, a_t) \label{eq:log_trajectory_probability} \end{align} $$

Now, if we take the gradient of the above we get:

  • $\nabla_\theta \log p(s_0) = 0$ (initial state doesn’t depend on $\theta$)
  • $\nabla_\theta \log p(s_{t+1}|s_t, a_t) = 0$ (environment transition dynamics don’t depend on $\theta$)
  • only $\nabla_\theta \log \pi_\theta(a_t|s_t)$ survives

Therefore, the gradient of the log probability of the trajectory simplifies to:

$$ \begin{align} \nabla_\theta \log p_\theta (\tau) = \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \label{eq:simplified_log_gradient} \end{align} $$

Substituting this back in Equation \ref{eq:policy_gradient_expectation}, we get:

$$ \begin{align} \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right] \label{eq:basic_policy_gradient} \end{align} $$

General Policy Gradient Formulation

Quite often, people use a more general formulation of the policy gradient:

$$ \begin{align} g = \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \Psi_t \right] \label{eq:general_gradient} \end{align} $$

Where $\Psi_t$ can be the following (where the rewards can also often be discounted by $\gamma$), a taxonomy adopted from Schulman et al. 2015:

  1. $R(\tau) = \sum_{t=0}^{\infty} r_t$: total reward of the trajectory.
  2. $\sum_{t'=t}^{\infty} r_{t'}$: reward following action $a_t$, also described as the return, $G$.
  3. $\sum_{t'=t}^{\infty} r_{t'} - b(s_t)$: baselined version of previous formula.
  4. $Q^{\pi}(s_t, a_t)$: state-action value function.
  5. $A^{\pi}(s_t, a_t)$: advantage function, which yields the lowest possible theoretical variance if it can be computed accurately.
  6. $r_t + V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$: TD residual.

Where the baseline is a value used to reduce variance of policy updates.

This foundational derivation forms the mathematical backbone that enables agents to learn optimal behaviors through experience, powering the remarkable reasoning capabilities we see in today’s most advanced AI systems.