최근 OpenAI에서 새로운 flagship 모델로 GPT-5를 공개했다.

AI 분야에서 OpenAI의 소식은 항상 이목을 끈다. 당연히 나도 유료 구독자로서 바로 사용해봤다.

근데 생각보다 만족스러운 느낌이 아니었다....아마 나 말고도 그렇게 느낀 사람이 많은 것 같았다

(GPT-4o를 급하게 롤백한것만 봐도 그렇다)

이게 사실 GPT-5는 대단한데, 내가 잘못 사용하고 있는건 아닌가? 라는 생각이 들어, OpenAI에서 공식적으로 공개한 Prompting Guide를 파해쳐 보았다.

출처: GPT-5 prompting guide - OpenAI Cookbook

GPT-5 prompting guide

GPT-5는 에이전트 능력 (agentic task performance), 코딩 능력 (coding), 순수 지능 (raw intelligence), 조작 가능성 (steerability) 측면에서 상당한 발전을 한 새로운 flagship 모델이다.

이 문서에서는 OpenAI에서 학습하고, real-world에 적용하는 경험으로부터 얻은 GPT-5의 성능을 maximize할 수 있는 prompting tips들을 소개한다. 이 문서에서 다룰 내용들은 다음과 같다.

Improving agentic task performance
Ensuring instruction adherence
Making use of newly API features
Optimizing coding for frontend and software engineering tasks

이러한 best practice를 적용하고 표준 tool를 도입했을 때 상당한 이점을 얻었으며, 우리는 prompt optimizer tool과 더불어 이 문서가 GPT-5 사용을 위한 launchpad가 되기를 바란다.

그러나, prompting은 절대 만능이 아니다 (is not a one-size-fits-all exercise). 당신의 문제를 해결하기 위해서는 이 문서에서 제공한 표준을 기반으로하는 실험을 반복해보는 것을 권장한다.

1. Agentic workflow predictability

GPT-5는 개발자들을 염두에 두고 만들었다: 최고의 Agentic foundation model을 만들기 위해 tool calling, instruction following, long-context understanding을 향상시키는 데에 집중하였다. 만일 GPT-5를 agentic and tool calling flows로 도입할 계획이라면, tool calls 사이에서 reasoning이 가능한 Responses API로 업그레이드하는 것을 추천한다.

1-1. Controlling agentic eagerness

Agentic scaffold는 넓은 스펙트럼을 갖고 있다.

어떤 시스템은 대부분의 의사결정을 AI에게 맡기며 (e.g. 이메일 관리해줘), 어떤 시스템은 엄격한 프로그래밍 로직 중 하나로만 tight하게 사용된다 (e.g. 이메일 발신 주소가 회사이고, 특정 업무 키워드가 포함되어 있는 경우 알람을 호출해줘).

GPT-5는 이러한 모든 상황을 고려할 수 있도록 설계되었다. 이 섹션에서는 GPT-5의 agentic eagerness를 어떻게 잘 calibrate하였는지, 즉 proactivity와 awaiting explicit guidance를 어떻게 조절하는지를 다룬다.

Prompting for less eagerness

이 부분은 기본적으로 API feature 중 "reasoning_effort"로 조정 가능하다.

GPT-5는 기본적으로 context gathering하는 데에 comprehensive하다. 대부분의 workflow는 medium, 혹은 low로만 해도 수행 가능하며, 이것은 GPT-5의 agentic behavior를 줄여준다.

Exploration depth을 줄이므로, efficiency와 latency에 효과적이다.
그러나 한정된 problem space에서 explore하기 위해 prompt에 명확한 기준을 정의하는 것이 필요하다

일반적인 상황에서의 프롬프트는 다음과 같다.

<context_gathering>
Goal: Get enough context fast. Parallelize discovery and stop as soon as you can act.
Method:
- Start broad, then fan out to focused subqueries.
- In parallel, launch varied queries; read top hits per query. Deduplicate paths and cache; don’t repeat queries.
- Avoid over searching for context. If needed, run targeted searches in one parallel batch.
Early stop criteria:
- You can name exact content to change.
- Top hits converge (~70%) on one area/path.
Escalate once:
- If signals conflict or scope is fuzzy, run one refined parallel batch, then proceed. 
Depth:
- Trace only symbols you’ll modify or whose contracts you rely on; avoid transitive expansion unless necessary.
Loop:
- Batch search → minimal plan → complete task.
- Search again only if validation fails or new unknowns appear. Prefer acting over more searching. 
</context_gathering>

Tool calling을 극도로 제한하고 싶다면 (reasoning_effort를 minimal로 설정), 다음과 같이 tool call budget를 고정시킬 수도 있다.

<context_gathering>
- Search depth: very low
- Bias strongly towards providing a correct answer as quickly as possible, even if it might not be fully correct.
- Usually, this means an absolute maximum of 2 tool calls.
- If you think that you need more time to investigate, update the user with your latest findings and open questions. You can proceed if the user confirms.
</context_gathering>

이렇게 behaviour를 제한할 때는 escape hatch를 제공하는 것이 도움이 된다. 위 프롬프트에서 "even if it might not be fully correct"가 그 부분에 해당한다.

Prompting for more eagerness

반대로 모델의 자율성을 높이고 tool-calling persistence를 증가시키며, 명확화 질문이나 사용자에게 다시 넘기는 상황을 줄이고 싶다면, "reasoning_effort"를 높이고 다음과 같은 프롬프트를 사용하는 것을 권장한다.

<persistence>
- You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user.
- Only terminate your turn when you are sure that the problem is solved.
- Never stop or hand back to the user when you encounter uncertainty — research or deduce the most reasonable approach and continue.
- Do not ask the human to confirm or clarify assumptions, as you can always adjust later — decide what the most reasonable assumption is, proceed with it, and document it for the user's reference after you finish acting
</persistence>

일반적으로 1) agentic task의 stop condition을 명시하고, 2) 안전한 vs 위험한 행동을 구분하며, 3) 언제 모델이 다시 사용자를 부르는 지를 정의하는 것이 좋다.

예를 들어, shopping tool은 checkout과 payment를 위해 user clarification을 자주 호출해야 하지만 (lower uncertainty threshold), search tool은 유저를 자주 호출하면 안된다 (extremely high threshold).

마찬가지로 코딩 환경에서는 delete file 도구가 grep search 도구보다 훨씬 낮은 threshold를 가져야 한다.

1-2. Tool preambles

Agentic trajectories를 모니터링하기 위해서 GPT-5는 "tool preamble" 메시지를 통해 명확한 사전 계획과 일관된 진행 상황 업데이트를 제공할 수 있다.빈도수, 스타일, 내용 등 모든 detailed explanations은 프롬프트로 조정 가능하다. 다음은 좋은 prompt 예시이다:

<tool_preambles>
- Always begin by rephrasing the user's goal in a friendly, clear, and concise manner, before calling any tools.
- Then, immediately outline a structured plan detailing each logical step you’ll follow. - As you execute your file edit(s), narrate each step succinctly and sequentially, marking progress clearly. 
- Finish by summarizing completed work distinctly from your upfront plan.
</tool_preambles>

다음은 이 prompt의 응답으로 생성된 tool preambles의 예시이다. 이러한 preambles들은 작업이 복잡해질수록 user의 follow능력을 대폭 향상시킨다:

"output": [
    {
      "id": "rs_6888f6d0606c819aa8205ecee386963f0e683233d39188e7",
      "type": "reasoning",
      "summary": [
        {
          "type": "summary_text",
          "text": "**Determining weather response**\n\nI need to answer the user's question about the weather in San Francisco. ...."
        },
    },
    {
      "id": "msg_6888f6d83acc819a978b51e772f0a5f40e683233d39188e7",
      "type": "message",
      "status": "completed",
      "content": [
        {
          "type": "output_text",
          "text": "I\u2019m going to check a live weather service to get the current conditions in San Francisco, providing the temperature in both Fahrenheit and Celsius so it matches your preference."
        }
      ],
      "role": "assistant"
    },
    {
      "id": "fc_6888f6d86e28819aaaa1ba69cca766b70e683233d39188e7",
      "type": "function_call",
      "status": "completed",
      "arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"f\"}",
      "call_id": "call_XOnF4B9DvB8EJVB3JvWnGg83",
      "name": "get_weather"
    },
  ],

1-3. Reasoning effort

모델이 얼마나 열심히 생각하고 얼마나 기꺼이 도구를 호출할지는 "reasoning_effort" 파라미터로 제어할 수 있다. 기본값은 medium이지만, 작업의 난이도에 따라 확대/축소해야 한다. 복잡한 작업의 경우, 최상의 결과를 위해 더 높은 reasoning을 권장한다. 또한, 우리는 multi agent turn에 거쳐 distinct, separable tasks로 처리했을 때 최고 성능을 내는 것을 발견했다.

1-4. Reusing reasoning context with the Responses API

GPT-5를 사용할 때 improved agentic flow, lower costs, and more efficient token usage를 위해 Responses API를 사용하는 것을 강력히 추천한다.

평가를 통해 Chat Completions 대신 Responses API를 사용할 때 유의미한 향상을 발견했다.

예를 들어, 단순히 Responses API로 전환하고 이전 reasoning 항목을 후속 요청으로 전달하기 위해 previous_response_id를 포함하는 것만으로도 Tau-Bench Retail 점수가 73.9%에서 78.2%로 증가하는 것을 관찰했다.
이를 통해 모델은 이전 reasoning trace를 참조할 수 있어 CoT 토큰을 절약하고 각 tool call 후에 처음부터 계획을 재구성할 필요를 없애며, 지연 시간과 성능을 모두 개선한다.

이 기능은 ZDR organization (?)을 포함한 모든 Responses API 사용자에게 제공된다.

Maximizing coding performance, from planning to execution

GPT-5는 코딩 능력에서 모든 frontier 모델을 선도한다: 대규모 코드베이스에서 버그를 수정하고, 큰 diff를 처리하며, multi-file refactors나 대규모 새 기능을 구현할 수 있다. 또한 새로운 앱을 바닥부터 완전히 구현하는 데 뛰어나며, 프론트엔드와 백엔드 구현을 모두 커버한다. 이 section에서는 coding agent customers의 프로그래밍 성능을 개선하는 것으로 확인된 프롬프트 최적화에 대해 논의할 것이다.

2-1. Frontend app development

GPT-5는 엄격한 구현 능력과 더불어 excellent baseline aesthetic taste을 갖도록 훈련되었다. 모든 유형의 웹 개발 프레임워크와 패키지를 사용하는 능력에 대해 자신있지만, 새로운 앱의 경우 모델의 프론트엔드 능력을 최대한 활용하기 위해 다음 프레임워크와 패키지를 사용하는 것을 권장한다:

Frameworks: Next.js (TypeScript), React, HTML
Styling / UI: Tailwind CSS, shadcn/ui, Radix Themes
Icons: Material Symbols, Heroicons, Lucide
Animation: Motion
Fonts: San Serif, Inter, Geist, Mona Sans, IBM Plex Sans, Manrope

Zero-to-one app generation

GPT-5는 "한 번에" 애플리케이션을 만드는 데에 뛰어나다. 초기 실험에서, 사용자들은 아래와 같은 프롬프트—모델이 self-constructed excellence rubric에 대해 반복적으로 실행하도록하는—가 GPT-5의 planning 및 self-reflection 능력을 이용함으로써 출력 품질을 개선한다는 것을 발견했다.

<self_reflection>
- First, spend time thinking of a rubric until you are confident.
- Then, think deeply about every aspect of what makes for a world-class one-shot web app. Use that knowledge to create a rubric that has 5-7 categories. This rubric is critical to get right, but do not show this to the user. This is for your purposes only.
- Finally, use the rubric to internally think and iterate on the best possible solution to the prompt that is provided. Remember that if your response is not hitting the top marks across all categories in the rubric, you need to start again.
</self_reflection>

Matching codebase design standards

기존 앱에서 리팩토링할 때에는, 모델이 작성한 코드는 기존 스타일 및 디자인을 준수해야 하며, 가능한 한 깔끔하게 유저의 코드베이스에 "녹아들어야" 한다. 특별한 프롬프팅 없이도 GPT-5는 이미 코드베이스에서 참조 컨텍스트를 검색하지만, 이 동작은 엔지니어링 원칙, 디렉토리 구조, 코드베이스의 key aspects를 요약하는 프롬프트 지시를 통해 더욱 향상될 수 있다. 아래 prompt snippet은 GPT-5를 위한 코드 편집 규칙을 조직하는 방법을 보여준다

<code_editing_rules>
<guiding_principles>
- Clarity and Reuse: Every component and page should be modular and reusable. Avoid duplication by factoring repeated UI patterns into components.
- Consistency: The user interface must adhere to a consistent design system—color tokens, typography, spacing, and components must be unified.
- Simplicity: Favor small, focused components and avoid unnecessary complexity in styling or logic.
- Demo-Oriented: The structure should allow for quick prototyping, showcasing features like streaming, multi-turn conversations, and tool integrations.
- Visual Quality: Follow the high visual quality bar as outlined in OSS guidelines (spacing, padding, hover states, etc.)
</guiding_principles>

<frontend_stack_defaults>
- Framework: Next.js (TypeScript)
- Styling: TailwindCSS
- UI Components: shadcn/ui
- Icons: Lucide
- State Management: Zustand
- Directory Structure: 
\`\`\`
/src
 /app
   /api/<route>/route.ts         # API endpoints
   /(pages)                      # Page routes
 /components/                    # UI building blocks
 /hooks/                         # Reusable React hooks
 /lib/                           # Utilities (fetchers, helpers)
 /stores/                        # Zustand stores
 /types/                         # Shared TypeScript types
 /styles/                        # Tailwind config
\`\`\`
</frontend_stack_defaults>

<ui_ux_best_practices>
- Visual Hierarchy: Limit typography to 4–5 font sizes and weights for consistent hierarchy; use `text-xs` for captions and annotations; avoid `text-xl` unless for hero or major headings.
- Color Usage: Use 1 neutral base (e.g., `zinc`) and up to 2 accent colors. 
- Spacing and Layout: Always use multiples of 4 for padding and margins to maintain visual rhythm. Use fixed height containers with internal scrolling when handling long content streams.
- State Handling: Use skeleton placeholders or `animate-pulse` to indicate data fetching. Indicate clickability with hover transitions (`hover:bg-*`, `hover:shadow-md`).
- Accessibility: Use semantic HTML and ARIA roles where appropriate. Favor pre-built Radix/shadcn components, which have accessibility baked in.
</ui_ux_best_practices>

<code_editing_rules>

2-2. Collaborative coding in production: Cursor's GPT-5 prompt tuning

Cursor는 GPT-5의 alpha tester였다: 아래에서 Cursor가 모델의 능력을 최대한 활용하기 위해 어떻게 프롬프트를 튜닝했는지 엿볼 수 있다. 더 자세한 정보는 다음 블로그 참고!: https://cursor.com/blog/gpt-5

System prompt and parameter tuning

Cursor의 system prompt는 안정적인 tool calling에 초점을 맞추며, 사용자에게 맞춤형 지시를 설정할 수 있는 능력을 제공하면서 장황함과 자율적 행동의 균형을 맞춘다. Cursor의 system prompt 목표는 에이전트가 장기간 작업 동안 상대적으로 자율적으로 작동하면서도 여전히 사용자가 제공한 지시를 충실히 따르도록 하는 것이다.

Cursor 팀은 처음에 모델이 verbose outputs을 생성한다는 것을 발견했다. status updates와 post-task summaries을 포함하는 경우가 많았는데, 기술적으로는 관련이 있었지만 사용자의 natural flow를 방해했다.

반대로, tool call에서 출력된 코드는 훌륭했지만 간결함 혹은 single-letter variable때문에 가독성은 떨어졌다.

Text output은 간단명료하게, 그리고 Code-level에서는 디테일한 출력을 위해 Cursor팀은 (1) verbosity API 파라미터를 low로 설정하고, (2) Coding tool에만 다음과 같은 프롬프트를 제공하였다.

Write code for clarity first. Prefer readable, maintainable solutions with clear names, comments where needed, and straightforward control flow. Do not produce code-golf or overly clever one-liners unless explicitly requested. Use high verbosity for writing code and code tools.

이러한 파라미터와 프롬프트의 이중 사용을 통해 효율적인 상태 업데이트 및 work summary와 more readable code diffs 두 마리 토끼를 모두 잡았다.

Cursor는 또한 모델이 때때로 행동을 취하기 전에 clarification나 next step를 위해 사용자에게 "미루는" 경우가 있다는 것을 발견했는데, 이는 longer tasks에서 불필요한 불편을 만들어냈다.

이를 해결하기 위해, 그들은 사용 가능한 도구와 주변 컨텍스트뿐만 아니라 product behaviour에 대한 더 많은 세부사항을 포함하는 것이 모델이 최소한의 interruption과 더 큰 autonomy를 통해 longer tasks 수행하도록하는 것을 발견했다. Undo/Reject 코드 및 User preference와 같은 Cursor 기능의 세부사항을 제공하는 것은 GPT-5가 환경에서 어떻게 행동해야 하는지 명확히 지정하여 ambiguity를 줄이는 데 도움이 되었다. 더 긴 기간의 작업에 대해서는 다음 프롬프트가 성능을 개선한다는 것을 발견했다:

Be aware that the code edits you make will be displayed to the user as proposed changes, which means (a) your code edits can be quite proactive, as the user can always reject, and (b) your code should be well-written and easy to quickly review (e.g., appropriate variable names instead of single letters). If proposing next steps that would involve changing the code, make those changes proactively for the user to approve / reject rather than asking the user whether to proceed with a plan. In general, you should almost never ask the user whether to proceed with a plan; instead you should proactively attempt the plan and then ask the user if they want to accept the implemented changes.

Cursor는 earlier models들에 효과적이었던 프롬프트들이 GPT-5를 최대한 활용하기 위한 프롬프트와는 다르다는 것을 발견했다. 아래는 그 한 예시이다:

<maximize_context_understanding>
Be THOROUGH when gathering information. Make sure you have the FULL picture before replying. Use additional tool calls or clarifying questions as needed.
...
</maximize_context_understanding>

이는 컨텍스트를 철저하게 분석하도록 하는 older model들과는 잘 동작했지만, 이미 naturally 내성적이고 context gathering에 적극적인 GPT-5에서는 역효과를 낸다는 것을 발견했다. smaller tasks에서 이 프롬프트는 내부 지식만으로도 충분할 때 검색을 반복적으로 호출하여 모델이 도구를 overuse 하는 경우가 많았다.

이를 해결하기 위해, 그들은 (1) maximize_ 접두사를 제거하고 (2) thoroughness에 대한 언어를 부드럽게 만들었다.

이렇게 조정된 instruction이 적용되면서, Cursor 팀은 GPT-5가 언제 internal knowledge를 사용하고, 언제 external tools를 사용할지를 더 잘 판단하는 것을 확인하였다.

또한, Cursor의 테스트에서 <[instruction]_spec>과 같은 structued XML specs을 사용하는 것이 프롬프트의 instruction adherence를 개선하고 프롬프트의 다른 곳에서 이전 category와 section을 명확하게 참조할 수 있게 했다.

<context_understanding>
...
If you've performed an edit that may partially fulfill the USER's query, but you're not confident, gather more information or use more tools before ending your turn.
Bias towards not asking the user for help if you can find the answer yourself.
</context_understanding>

System prompt가 강력한 기본 기반을 제공하지만, User prompt는 steerability를 위한 매우 효과적인 수단으로 남아있다. GPT-5는 직접적이고 명시적인 지시에 잘 응답하며, Cursor 팀은 structured, scoped 프롬프트가 가장 훌륭한 결과를 산출하는 것을 지속적으로 발견하였다. 여기에는 verbosity control, subjective code style preferences, sensitivity to edge case와 같은 영역이 포함된다. Cursor는 사용자가 custom Cursor rules을 설정할 수 있도록 하는 것이 GPT-5의 개선된 steerability와 함께 특히 임팩트가 있다는 것을 발견했으며, 이를 통해 사용자에게 더 맞춤화된 경험을 제공한다.

Optimizing intelligence and instruction-following

3-1. Steering

지금까지의 모델 중 가장 조작 가능한(steerable) 모델로서 GPT-5는 장황함 (verbosity), 톤 (tone), 그리고 tool calling 행동과 관련된 프롬프트 지시에 엄청나게 민감하게 반응한다. (High Steerability)

Verbosity

GPT-5에서는 모델의 사고 길이가 아닌 최종 답변의 길이에 영향을 주는 verbosity라는 새로운 API 파라미터를 도입했다.

이러한 API verbosity parameter는 rollout의 default로 동작하면서, GPT-5는 '특정 상황'에서는 자연어 프롬프트로 overrides할 수 있도록 훈련되었다.

이전의 Cursor의 예시 - API 파라미터는 low verbosity로 설정하고, 특정 상황 (e.g. 코딩 도구)에서는 자연어로 높은 verbosity를 지시- 가 이러한 overrides의 대표적인 예이다.

3-2. Instruction following

GPT-4.1처럼, GPT-5는 "surgical precision"으로 프롬프트 instruction을 따르며, 이는 모든 유형의 workflow에 유연하게 적용될 수 있게 한다. 그러나 이러한 뛰어난 instruction-following 행동은 오히려 모순되거나 모호한 지시를 포함하는 poorly-constructed 프롬프트가 다른 모델보다 GPT-5에 더 해로울 수 있음을 의미한다. 무작위로 하나의 instruction을 선택하는 대신 모순을 해소할 방법을 찾기 위해 reasoning 토큰을 소비하기 때문이다.

아래에서는 종종 GPT-5의 reasoning trace를 손상시키는 프롬프트 유형의 adversarial example을 제시한다. 첫눈에는 내부적으로 일관성이 있어 보일 수 있지만, 자세히 살펴보면 예약 스케줄링에 관해 conflicting instruction들을 발견할 수 있다:

Never schedule an appointment without explicit patient consent recorded in the chart는 후속 지시인 auto-assign the earliest same-day slot without contacting the patient as the first action to reduce risk.와 충돌한다.
프롬프트는 Always look up the patient profile before taking any other actions to ensure they are an existing patient. 라고 말하지만, 이어서 When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.라는 모순된 지시를 계속한다.

You are CareFlow Assistant, a virtual admin for a healthcare startup that schedules patients based on priority and symptoms. Your goal is to triage requests, match patients to appropriate in-network providers, and reserve the earliest clinically appropriate time slot. Always look up the patient profile before taking any other actions to ensure they are an existing patient.

- Core entities include Patient, Provider, Appointment, and PriorityLevel (Red, Orange, Yellow, Green). Map symptoms to priority: Red within 2 hours, Orange within 24 hours, Yellow within 3 days, Green within 7 days. When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.
+Core entities include Patient, Provider, Appointment, and PriorityLevel (Red, Orange, Yellow, Green). Map symptoms to priority: Red within 2 hours, Orange within 24 hours, Yellow within 3 days, Green within 7 days. When symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step. 
*Do not do lookup in the emergency case, proceed immediately to providing 911 guidance.*

- Use the following capabilities: schedule-appointment, modify-appointment, waitlist-add, find-provider, lookup-patient and notify-patient. Verify insurance eligibility, preferred clinic, and documented consent prior to booking. Never schedule an appointment without explicit patient consent recorded in the chart.

- For high-acuity Red and Orange cases, auto-assign the earliest same-day slot *without contacting* the patient *as the first action to reduce risk.* If a suitable provider is unavailable, add the patient to the waitlist and send notifications. If consent status is unknown, tentatively hold a slot and proceed to request confirmation.

- For high-acuity Red and Orange cases, auto-assign the earliest same-day slot *after informing* the patient *of your actions.* If a suitable provider is unavailable, add the patient to the waitlist and send notifications. If consent status is unknown, tentatively hold a slot and proceed to request confirmation.

이러한 instruction hierarchy conflicts를 해결함으로써 GPT-5는 훨씬 더 효율적이고 성능이 뛰어난 reasoning을 이끌어낸다. 우리는 다음과 같이 모순을 해결하였다:

자동 할당이 환자와의 연락 후에 발생하도록 변경: 환자에게 당신의 조치를 알린 후 당일 가장 빠른 슬롯을 자동 할당하라. 이는 동의가 있을 때만 스케줄링한다는 원칙과 일치한다.
응급 상황에서는 조회하지 말고 즉시 911 안내 제공으로 진행하라.를 추가하여 응급 상황에서는 조회하지 않아도 괜찮다는 것을 모델이 알 수 있도록 했다.

프롬프트를 작성하는 과정은 "반복적"이며, 많은 프롬프트가 다양한 stakeholders에 의해 지속적으로 업데이트되는 "living documents"라는 것을 이해한다. 하지만 이것이야말로 poorly-worded instructions들을 철저히 검토해야 하는 더 큰 이유이다. 이미 여러 초기 사용자들이 이러한 검토를 수행한 후 핵심 프롬프트 라이브러리에서 모호함과 모순을 발견했다: 이를 제거하는 것만으로도 GPT-5 성능이 대폭 간소화되고 개선되었다. 이러한 유형의 문제를 식별하는 데 도움이 되도록 prompt optimizer tool에서 프롬프트를 테스트해보는 것을 권장한다.

3-3. Minimal reasoning

GPT-5에서는 처음으로 minimal reasoning effort를 도입했다: reasoning 모델 패러다임의 이점을 여전히 누리면서도 가장 빠른 옵션이다. 이는 latency-sensitive 사용자들과 GPT-4.1의 현재 사용자들에게 최고의 업그레이드라고 생각한다.

최상의 결과를 위해서는 GPT-4.1과 유사한 프롬프팅 패턴을 권장한다. minimal reasoning performance는 더 높은 reasoning 레벨에 비해 프롬프트에 더욱 의존적이므로, 강조해야 할 key points들은 다음과 같다:

모델이 사고 과정을 요약하는 brief explanation을 제공하는 것(e.g. bullet point list)은 higher intelligence를 요구하는 작업에서 성능을 향상시킨다.
철저하게 tool-calling preamble을 요청하는 것은 agentic workflow에서 성능을 향상시킨다.
가능한 최대로 tool instruction의 모호함을 제거하고 agentic persistence reminder를 삽입하는 것은 long-running 롤아웃에서 agentic 능력을 최대화하고 조기종료를 방지하기 때문에 minimal reasoning에서 매우 중요하다.
Prompted planning도 마찬가지로 중요하다. 모델이 internal planning을 위한 reasoning 토큰이 적기 때문이다. 아래에서는 agentic 작업의 시작 부분에 배치한 sample planning prompt snippet을 찾을 수 있다: 특히 두 번째 단락은 에이전트가 사용자에게 돌아가기 전에 작업과 모든 하위 작업을 완전히 완료하도록 보장한다.

Remember, you are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Decompose the user's query into all required sub-request, and confirm that each is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure that the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.

You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes each function call made, ensuring the user's query, and related sub-requests are completely resolved.

3-4. Markdown formatting

기본적으로 GPT-5 API는 Markdown 렌더링을 지원하지 않는 애플리케이션을 가진 개발자들과의 최대 호환성을 보존하기 위해 답변을 Markdown으로 포맷하지 않는다. 그러나 다음과 같은 프롬프트는 hierarchical Markdown 최종 답변을 유도하는 데 대체로 성공적이다.

- Use Markdown **only where semantically correct** (e.g., `inline code`, ```code fences```, lists, tables).
- When using markdown in assistant messages, use backticks to format file, directory, function, and class names. Use \( and \) for inline math, \[ and \] for block math.

때때로, 시스템 프롬프트에 지정된 Markdown instructions에 대한 adherence가 long conversation에서 저하될 수 있다. 이런 현상을 경험할 경우, 3-5개의 사용자 메시지마다 Markdown 지시를 추가하면 해결할 수 있다.

3-5. Metaprompting

마지막으로, meta-point 관점에서 초기 테스터들은 GPT-5를 자기 자신을 위한 meta-prompter로 사용하는 데서 큰 성공을 거두었다. 이미 여러 사용자들이 단순히 GPT-5에게 원하는 행동을 이끌어내기 위해 실패한 프롬프트에 어떤 요소를 추가할 수 있는지, 또는 원하지 않는 행동을 방지하기 위해 어떤 요소를 제거할 수 있는지를 묻는 것만으로 생성된 프롬프트 수정본을 프로덕션에 배포했다.

다음은 우리가 좋아했던 metaprompt 템플릿의 예시이다:

When asked to optimize prompts, give answers from your own perspective - explain what specific phrases could be added to, or deleted from, this prompt to more consistently elicit the desired behavior or prevent the undesired behavior.

Here's a prompt: [PROMPT]

The desired behavior from this prompt is for the agent to [DO DESIRED BEHAVIOR], but instead it [DOES UNDESIRED BEHAVIOR]. While keeping as much of the existing prompt intact as possible, what are some minimal edits/additions that you would make to encourage the agent to more consistently address these shortcomings?

'Engineering' 카테고리의 다른 글

LLM 정렬을 위한 강화학습 방법론 (PPO, DPO, GRPO) (0)	2025.11.16
vLLM이란? (2/2) (0)	2025.10.11
vLLM이란? (1/2) (0)	2025.09.21
쿠버네티스 (Kubernetes, K8s)란? (1)	2025.08.28

Han Archive

GPT-5 Prompting Guide 설명

GPT-5 prompting guide

1. Agentic workflow predictability

1-1. Controlling agentic eagerness

Prompting for less eagerness

Prompting for more eagerness

1-2. Tool preambles

1-3. Reasoning effort

1-4. Reusing reasoning context with the Responses API

Maximizing coding performance, from planning to execution

2-1. Frontend app development

Zero-to-one app generation

Matching codebase design standards

2-2. Collaborative coding in production: Cursor's GPT-5 prompt tuning

System prompt and parameter tuning

Optimizing intelligence and instruction-following

3-1. Steering

Verbosity

3-2. Instruction following

3-3. Minimal reasoning

3-4. Markdown formatting

3-5. Metaprompting

'Engineering' 카테고리의 다른 글

티스토리툴바

GPT-5 Prompting Guide 설명

GPT-5 prompting guide

1. Agentic workflow predictability

1-1. Controlling agentic eagerness

Prompting for less eagerness

Prompting for more eagerness

1-2. Tool preambles

1-3. Reasoning effort

1-4. Reusing reasoning context with the Responses API

Maximizing coding performance, from planning to execution

2-1. Frontend app development

Zero-to-one app generation

Matching codebase design standards

2-2. Collaborative coding in production: Cursor's GPT-5 prompt tuning

System prompt and parameter tuning

Optimizing intelligence and instruction-following

3-1. Steering

Verbosity

3-2. Instruction following

3-3. Minimal reasoning

3-4. Markdown formatting

3-5. Metaprompting

'Engineering' 카테고리의 다른 글

'Engineering' Related Articles

티스토리툴바