LLM 정렬을 위한 강화학습 방법론 (PPO, DPO, GRPO)

LLM의 학습은 크게 세 단계로 나눌 수 있다.

(1) Pre-training (2) Supervised Fine-Tuning (SFT) (3) Reinforcement Learning from Human Feedback (RLHF)

ref: https://cameronrwolfe.substack.com/p/understanding-and-using-supervised

각 단계는 다음과 같은 특징이 있다.

Pre-training

막대한 양의 데이터로부터 텍스트간의 관계와 패턴을 학습하는 과정
Language Modeling이 주 목적이기 때문에 별도의 라벨링이 없어도 된다
Pre-train을 거친 언어모델을 Base Model이라 부른다.

SFT

주로 사람의 지시를 따르도록 (Instruction-following) 추가 학습을 하는 과정
$P(\text{Response} | \text{Instruction})$ 을 최대화하는 것이 목적이기 때문에, 고품질의 Instruction dataset이 필요하다
SFT를 거친 언어모델을 Instruct Model이라 부른다.

RLHF

모델이 더 인간이 선호하는 답변을 하도록 추가 학습을 하는 과정
Reward-based (PPO 등)과 Reward-free (DPO 등) 방법이 존재하며, Preference pairs로 구성된 데이터셋이 필요하다
RLHF를 거친 언어모델을 Aligned Model이라 부른다.

본 글에서는 LLM Alignment 방법으로 가장 많이 사용되는 PPO, DPO, GRPO에 대해 다룬다.

0. Reinforcement Learning from Human Feedback (RLHF)

LLM을 $\theta$로 파라미터화된 policy로 간주할 때, LLM policy $\pi_\theta(\mathbf{y} | \mathbf{x})$ 는 주어진 prompt $\mathbf{x} \in \mathcal{X}$ 에서 response $ \mathbf{y} \in \mathcal{Y}$ 를 auto-regressive 방법으로 생성한다.

$$ \pi_\theta(\mathbf{y} | \mathbf{x}) = \prod_t \pi_\theta(y_t | \mathbf{x}, \mathbf{y}_{<t}) $$

이 LLM policy $\pi_\theta$에 human preference를 반영하기 위한 RLHF은 다음과 같은 목적 함수를 최대화하는 것으로 정의할 수 있다.

$$J_r(\pi_\theta) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}, \mathbf{y} \sim \pi_\theta} \left[ r(\mathbf{x}, \mathbf{y}) - \beta \log \frac{\pi_\theta(\mathbf{y} | \mathbf{x})}{\pi_{\text{ref}}(\mathbf{y} | \mathbf{x})} \right].$$

이 때 $r$은 reward function으로, 프롬프트와 그에 대응하는 결과물을 입력으로 scalar value를 산출한다.

$\pi_{\text{ref}}$는 reference model로 $\pi_\theta$를 regularize하는 데에 사용하며, $\beta$로 degree를 조정한다.

ref: https://datasciencedojo.com/blog/rlhf-and-dpo-for-finetuning-llms/

reward가 환경으로부터 주어지는 일반적인 강화학습 문제와는 달리, RLHF에서는 preference datasets $\mathcal{D} = \{(\mathbf{x}, \mathbf{y}_w, \mathbf{y}_l)\}$ 으로부터 reward를 추정하는 Reward Model (RM)을 학습해서 사용한다.

RM을 학습하기 위해서 Bradley-Terry model을 사용한다.

$$\mathbb{P}_{\phi}( \mathbf{y}_w \succ \mathbf{y}_l \mid \mathbf{x}) = \frac{\exp(r_{\phi}(\mathbf{x}, \mathbf{y}_w))}{\exp(r_{\phi}(\mathbf{x}, \mathbf{y}_w)) + \exp(r_{\phi}(\mathbf{x}, \mathbf{y}_l))} = \sigma(r_{\phi}(\mathbf{x}, \mathbf{y}_w) - r_{\phi}(\mathbf{x}, \mathbf{y}_l))$$

이 수식을 통해 $y_w$ 가 선호될 확률을 최대화한다.

RLHF Objective는 Negative Log-Likelihood를 씌워서 설정한다.

$$\mathcal{L}_{R}(r_{\phi}) = -\mathbb{E}_{(\mathbf{x}, \mathbf{y}_w, \mathbf{y}_l) \sim \mathcal{D}} \left[\log \sigma(r_{\phi}(\mathbf{x}, \mathbf{y}_w) - r_{\phi}(\mathbf{x}, \mathbf{y}_l))\right]$$

이렇게 reward 항을 학습된 reward model로 대체하면, online RL 알고리즘을 통해 $J_r(\pi_\theta)$를 최대화할 수 있다.

1. Proximal Policy Optimization (PPO)

Proximal Policy Optimization Algorithms (arXiv, 2017)

ref: https://gangjeong22.tistory.com/197?category=1179037

PPO가 정책을 업데이트하는 수식은 다음과 같다.

$$L^{CLIP}(\theta) = \mathbb{E} \left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$

$r_t(\theta) = \frac{\pi_\theta(y_t|\mathbf{x}, y_{<t})}{\pi_{\text{old}}(y_t|\mathbf{x}, y_{<t})}$: 새 정책과 이전 정책의 확률 비율 (probability ratio)
$\hat{A}_t$: advantage
$\epsilon$: clipping 범위

즉, 새 정책이 이전 정책과 비교해서 특정 토큰을 생성할 확률 $r_t(\theta)$을 $[1-\epsilon, 1+\epsilon]$ 범위로 제한한다.

이 방법을 통해 정책 업데이트의 안정성을 확보할 수 있다.

실제 PPO를 RLHF에 적용할 때는 다음과 같은 과정을 거친다.

for each iteration:
    1. Sampling: Instruction x를 샘플링하고, 현재 정책 π_θ로 response y를 생성
    2. Reward 계산: r(x,y) = r_φ(x,y) - β·log(π_θ/π_ref)  ← J(θ)의 reward
    3. Advantage 계산: A(x,y) = r(x,y) - V(x)
    4. Policy 업데이트: PPO의 clipping objective를 사용해서 θ 업데이트

2. Direct Preference Optimization (DPO)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NeurIPS, 2023)

PPO는 구현과정이 여전히 복잡하고 비용이 많이 필요하다는 단점이 있다.

(매 스텝마다 policy model, reference model, reward model, value model 4개의 모델을 유지해야 함)

DPO는 preference data로 reward model을 fitting하는게 아니라 final LM $\pi_\theta$를 직접 학습시킨다.

즉, 별도의 RM이나 강화학습 단계 없이 $J(\theta)$를 closed form으로 구하는 것이다.

먼저 optimal policy는 다음과 같이 구할 수 있다:

$$\pi^*(\mathbf{y}|\mathbf{x}) = \frac{1}{Z(\mathbf{x})} \pi_{\text{ref}}(\mathbf{y}|\mathbf{x}) \exp\left(\frac{1}{\beta} r(\mathbf{x}, \mathbf{y})\right)$$

이 때 $Z(\mathbf{x})$는 partition function이다.

이 수식을 reward에 대해 정리하면 다음과 같다:

$$r(\mathbf{x}, \mathbf{y}) = \beta \log \frac{\pi^*(\mathbf{y}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})} + \beta \log Z(\mathbf{x})$$

즉, reward를 별도의 모델링 없이 optimal policy $\pi^*$와 reference policy $\pi_{\text{ref}}$의 비율로 표현할 수 있다는 것이 핵심이다.

이를 기존 RLHF Loss에 대입하면 다음과 같다.

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(\mathbf{x}, \mathbf{y}_w, \mathbf{y}_l) \sim \mathcal{D}} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(\mathbf{y}_w|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_w|\mathbf{x})} - \beta \log \frac{\pi_\theta(\mathbf{y}_l|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_l|\mathbf{x})}\right)\right]$$

실제로 DPO를 적용할 때에는 다음과 같은 과정을 거친다.

for each iteration:
    1. Batch 샘플링: D에서 preference pairs (x, y_w, y_l) 배치 샘플링
    
    2. Log-probability 계산:
       - log π_θ(y_w|x): 현재 정책이 선호된 답변을 생성할 log 확률
       - log π_θ(y_l|x): 현재 정책이 덜 선호된 답변을 생성할 log 확률
       - log π_ref(y_w|x): reference가 선호된 답변을 생성할 log 확률
       - log π_ref(y_l|x): reference가 덜 선호된 답변을 생성할 log 확률
    
    3. Implicit reward 계산:
       - r_w = β · (log π_θ(y_w|x) - log π_ref(y_w|x))
       - r_l = β · (log π_θ(y_l|x) - log π_ref(y_l|x))
    
    4. DPO loss 계산 및 θ 업데이트:
       - L_DPO = -log σ(r_w - r_l)

DPO는 on-policy sampling도, RM도, Value function도, clipping 등도 필요하지 않기 때문에 훨씬 간결하고 직관적이다.

3. Group Relative Policy Optimization (GRPO)

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (arXiv, 2024)

GRPO는 RL 기반 방법이면서도 가치함수 (value function / critic model) 없이 정책을 최적화하는 것을 목표로 한다.

PPO는 Advantage를 계산하기 위해 별도의 value function $V(x)$가 필요했다.

이는 Policy 모델과 비슷한 크기이기 때문에 상당한 자원이 필요할 뿐더러, LLM에서는 보통 마지막 토큰에만 reward가 제공되기 때문에 각 토큰별로 value를 추정하는 value function을 학습하는 것은 매우 어려운 작업이다.

GRPO의 핵심은 이 가치 함수(Baseline)를 동일한 질문 $x$에 대해 생성된 $G$개의 답변 그룹 ${y_1, y_2, ..., y_G}$의 통계량으로 대체한다는 점이다.

먼저, 그룹 내 답변들의 평균 보상을 Baseline으로 간주한다.

$$\bar{R}(x) = \frac{1}{G} \sum_{i=1}^{G} r(x, y_i)$$

이를 바탕으로 Advantage를 계산할 때, Z-score로 Normalization한다.

$$\hat{A}(x, y_i) = \frac{r(x, y_i) - \bar{R}(x)}{\text{std}(\{r_1, ..., r_G\}) + \epsilon}$$

즉 critic의 역할을 "이 답변이 같은 그룹의 다른 답변들보다 얼마나 상대적으로 좋은가?"로 설정하는 것이며,

때문에 별도의 Value model이 필요없다. (Reward Model은 여전히 필요하다)

이를 반영한 최종 목적 함수는 다음과 같다.

$$J_{GRPO}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(y|x)} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( \underbrace{\frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min \left( \rho_{i,t}(\theta) \hat{A}_{i}, \text{clip}(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_{i} \right)}_{\text{Clipped Surrogate Objective}} - \underbrace{\beta D_{KL}(\pi_\theta || \pi_{\text{ref}})}_{\text{KL Regularization}} \right) \right]$$

실제 GRPO를 적용하는 과정은 다음과 같다.

for each iteration:
    
    1. Reference 업데이트: π_ref ← π_θ (필요 시)
    
    2. Sampling:
       질문 x를 샘플링하고, 현재 정책 π_θ로 G개의 답변 {y_1, ..., y_G} 생성
        
    3. Reward 계산: 
       각 답변에 대해 r_i = r_φ(x, y_i) - β·D_KL(π_θ || π_ref)
        
    4. Group Advantage 계산 (Critic 대체):
       - 그룹 평균: μ = mean({r_1, ..., r_G})
       - 그룹 표준편차: σ = std({r_1, ..., r_G})
       - Advantage: Â_i = (r_i - μ) / (σ + ε)
        
    5. Policy 업데이트:
       - 각 답변 y_i의 토큰별로 PPO Loss 계산
       - Clipping Objective를 사용하여 π_θ 업데이트

Reference

[쉬운 리뷰] Proximal Policy Optimization Algorithms 제대로 이해하기

TRPO의 복잡한 최적화를 PPO가 간단하게 만들었다.adaptive KL penalty 보다는 clipping을 한 것이 성능이 좋았다.역시 메서드가 간단해야 파급력이 크다.당시 (2017년) 강화학습 알고리즘은 크게 세 부류로

velog.io

https://velog.io/@bb1702/LLM-Direct-Preference-Optimization-Your-Language-Model-is-Secretly-a-Reward-Model

[쉬운 리뷰] Direct Preference Optimization: Your Language Model is Secretly a Reward Model 제대로 이해하기

23년 Neurips에 publish된 논문 Direct Preference Optimization (a.k.a DPO)DPO는 RLHF의 핵심 objective를 그대로 유지하면서도 복잡했던 reward model 학습, PPO rollout loop 제거, 단순 B

velog.io

https://arxiv.org/abs/2203.02155

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not ali

arxiv.org

https://arxiv.org/abs/1707.06347

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standar

arxiv.org

https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to reque

proceedings.neurips.cc

https://arxiv.org/abs/2404.10719

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applicati

arxiv.org

https://arxiv.org/abs/2402.03300

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced f

arxiv.org

'Engineering' 카테고리의 다른 글

vLLM이란? (2/2) (0)	2025.10.11
vLLM이란? (1/2) (0)	2025.09.21
쿠버네티스 (Kubernetes, K8s)란? (1)	2025.08.28
GPT-5 Prompting Guide 설명 (0)	2025.08.15

Han Archive

LLM 정렬을 위한 강화학습 방법론 (PPO, DPO, GRPO)

Pre-training

SFT

RLHF

0. Reinforcement Learning from Human Feedback (RLHF)

1. Proximal Policy Optimization (PPO)

2. Direct Preference Optimization (DPO)

3. Group Relative Policy Optimization (GRPO)

Reference

'Engineering' 카테고리의 다른 글

티스토리툴바

LLM 정렬을 위한 강화학습 방법론 (PPO, DPO, GRPO)

Pre-training

SFT

RLHF

0. Reinforcement Learning from Human Feedback (RLHF)

1. Proximal Policy Optimization (PPO)

2. Direct Preference Optimization (DPO)

3. Group Relative Policy Optimization (GRPO)

Reference

'Engineering' 카테고리의 다른 글

'Engineering' Related Articles

티스토리툴바