1. 背景

2. 术语表

术语 含义 你关心的点
tool_choice=required 强制模型必须产出 tool call 会触发 JSON schema 约束解码
tool_call_parser 把模型输出解析成 OpenAI tool_calls 决定 tool call 的解析格式
reasoning_parser 把输出拆成 reasoning_contentcontent 同时影响解码期 gating 与返回期拆分
constrained decoding 约束解码,采样时对 logits 做 mask 约束一旦生效,非 JSON token 会被禁止
json_schema 用 JSON schema 强约束输出结构 required 默认走它
ReasonerGrammarBackend 两段式 grammar 包装器 think_end_id 前放行思考段
think_end_id 结束思考段的 token id 决定何时开始强制 JSON
require_reasoning 请求级布尔值,表示需要思考段 决定 gating 初始状态

3. 调用栈分析

这里把顺序拆成两段时期进行讨论:

解码期调用栈

  1. HTTP 入口把请求交给 Serving
  2. ServingChat 发现 required,选择 json_schema 约束
# required 或指定函数会使用 json_schema
if tool_choice is required or is specific function:
	tool_call_constraint = ('json_schema', schema)
  1. ServingChat 同时计算是否需要思考段
# require_reasoning 是布尔值
require_reasoning = decide_from_request(enable_thinking or thinking)
  1. TokenizerManagertokenized request 发给 Scheduler
  2. SchedulerOutputProcessorMixin 会对 next_token_id 进行处理:
if req.grammar is not None:
  # FIXME: this try-except block is for handling unexpected xgrammar issue.
  try:
      if batch.spec_algorithm.is_none():
          # Normal decode: single token
          **req.grammar.accept_token(next_token_id)**
      elif batch.is_spec_v2:
          # Speculative decode: next_token_id is a list of accepted tokens
          for token_id in next_token_id:
              **req.grammar.accept_token(token_id)**
  except ValueError as e:
      # Grammar accept_token can raise ValueError if the token is not in the grammar.
      # This can happen if the grammar is not set correctly or the token is invalid.
      logger.error(
          f"Grammar accept_token failed for req {req.rid} with token {next_token_id}: {e}"
      )
      self.abort_request(AbortReq(rid=req.rid))

<aside> ⚠️

  1. 开启 reasoning_parser 时,grammar backend 会被包装成两段式 gating </aside>