<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Visualization | YuyaoGe's Website</title><link>https://geyuyao.com/tag/visualization/</link><atom:link href="https://geyuyao.com/tag/visualization/index.xml" rel="self" type="application/rss+xml"/><description>Visualization</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Tue, 09 Dec 2025 00:00:00 +0000</lastBuildDate><image><url>https://geyuyao.com/media/icon_hucac340dfc176d8b4c8a8aa7a23204f12_18561_512x512_fill_lanczos_center_3.png</url><title>Visualization</title><link>https://geyuyao.com/tag/visualization/</link></image><item><title>Long-Insight: Long-running Agent Trajectory Analysis</title><link>https://geyuyao.com/project/long-insight/</link><pubDate>Tue, 09 Dec 2025 00:00:00 +0000</pubDate><guid>https://geyuyao.com/project/long-insight/</guid><description>&lt;style>
.li-lang-bar{
position:sticky;top:0;z-index:50;
display:flex;justify-content:flex-end;gap:8px;
padding:8px 0;margin:-8px 0 16px 0;
backdrop-filter:blur(6px);
}
.li-lang-bar button{
border:1px solid rgba(127,127,127,.35);
background:transparent;
color:inherit;
padding:4px 12px;
border-radius:999px;
font-size:13px;
cursor:pointer;
transition:all .15s ease;
}
.li-lang-bar button:hover{background:rgba(127,127,127,.12);}
.li-lang-bar button.active{
background:#667eea;color:#fff;border-color:#667eea;
}
/* Safe default: until JS decides language, show only EN to avoid flashing both sections */
.li-lang-zh{display:none;}
body[data-li-lang="zh"] .li-lang-zh{display:block;}
body[data-li-lang="zh"] .li-lang-en{display:none;}
body[data-li-lang="en"] .li-lang-zh{display:none;}
body[data-li-lang="en"] .li-lang-en{display:block;}
&lt;/style>
&lt;script>
/* Run as early as possible so the correct language div is visible before user sees the page. */
(function(){
try{
var lang = null;
try{
var url = new URL(window.location.href);
var q = url.searchParams.get('lang');
if(q === 'en' || q === 'zh') lang = q;
}catch(_){}
if(!lang){
try{
var stored = localStorage.getItem('li_lang');
if(stored === 'en' || stored === 'zh') lang = stored;
}catch(_){}
}
if(!lang){
var nav = (navigator.language || navigator.userLanguage || 'en').toLowerCase();
lang = nav.indexOf('zh') === 0 ? 'zh' : 'en';
}
if(document.body){
document.body.setAttribute('data-li-lang', lang);
} else {
document.documentElement.setAttribute('data-li-lang', lang);
document.addEventListener('DOMContentLoaded', function(){
document.body.setAttribute('data-li-lang', lang);
});
}
}catch(_){}
})();
&lt;/script>
&lt;div class="li-lang-bar">
&lt;button type="button" data-lilang="zh" onclick="window.__setLILang &amp;&amp; window.__setLILang('zh')">中文&lt;/button>
&lt;button type="button" data-lilang="en" onclick="window.__setLILang &amp;&amp; window.__setLILang('en')">English&lt;/button>
&lt;/div>
&lt;div class="li-lang-zh" markdown="1">
&lt;h1 id="long-insight长程轨迹分析平台">Long-Insight：长程轨迹分析平台&lt;/h1>
&lt;h2 id="背景当-agent-轨迹长到无法阅读">背景：当 Agent 轨迹长到无法阅读&lt;/h2>
&lt;p>当我们用前沿大模型去解决真实的软件工程任务时，它们会产生极其漫长的执行轨迹。以某 Long-running 数据集的 370 条轨迹为例：&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>模型&lt;/th>
&lt;th>平均 Token 数&lt;/th>
&lt;th>相当于 128K 的倍数&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Kimi-K2-0905&lt;/strong>&lt;/td>
&lt;td>6,837,594&lt;/td>
&lt;td>&lt;strong>52.1 倍&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DeepSeek-V3.1&lt;/strong>&lt;/td>
&lt;td>5,059,423&lt;/td>
&lt;td>&lt;strong>38.5 倍&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GLM-4.6&lt;/strong>&lt;/td>
&lt;td>2,616,703&lt;/td>
&lt;td>&lt;strong>19.9 倍&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Claude-Sonnet-4&lt;/strong>&lt;/td>
&lt;td>977,747&lt;/td>
&lt;td>&lt;strong>7.4 倍&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>一条 &lt;code>application_development&lt;/code> 类型的轨迹平均有 &lt;strong>810 万 Token&lt;/strong>，是 128K 上下文窗口的 &lt;strong>61.9 倍&lt;/strong>。370 条轨迹中&lt;strong>至少 85% 超出了任何模型的上下文窗口&lt;/strong>。&lt;/p>
&lt;p>这意味着两个现实问题：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>人无法阅读&lt;/strong> — 400+ 轮次的交互日志，无论多么耐心的工程师都无法逐行审阅&lt;/li>
&lt;li>&lt;strong>模型也无法处理&lt;/strong> — 即便是百万级上下文窗口，大多数轨迹仍然放不进去&lt;/li>
&lt;/ol>
&lt;p>但这些轨迹中蕴含着极其宝贵的信息。正如我们在 SWE SFT 数据筛选中总结的核心原则：&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>&amp;ldquo;沙子里面有金子就可以，不一定全是金子，但沙子里不能有玻璃碴。&amp;rdquo;&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;p>我们需要的不是盲目挑选&amp;quot;看起来好&amp;quot;的轨迹，而是&lt;strong>精确识别和过滤坏模式（bad patterns）&lt;/strong>。为此，我们需要一套工具来&lt;strong>理解&lt;/strong>这些超长轨迹在做什么、做得好不好。&lt;/p>
&lt;p>这就是 Long-Insight 的由来。&lt;/p>
&lt;h2 id="核心思路">核心思路&lt;/h2>
&lt;p>Long-Insight 解决三个问题：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>看不懂&lt;/strong> → 将线性轨迹分解为结构化的步骤 DAG，每个步骤都有类型、摘要、父子依赖&lt;/li>
&lt;li>&lt;strong>放不下&lt;/strong> → 智能压缩轨迹，在保留因果结构的前提下减少 60–80% 的 Token&lt;/li>
&lt;li>&lt;strong>评不了&lt;/strong> → 两阶段 LLM 自动评分，量化轨迹的难度和质量&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="启发" srcset="
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_9cafd98d00676ed5f3924480c73b0ab9.webp 400w,
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_f6dfbafd07a75b25d3eb2b8ee98a13af.webp 760w,
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_1200x1200_fit_q99_h2_lanczos.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_9cafd98d00676ed5f3924480c73b0ab9.webp"
width="760"
height="427"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="第一部分轨迹步骤分解">第一部分：轨迹步骤分解&lt;/h2>
&lt;h3 id="设计思路">设计思路&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>初始化&lt;/strong>：创建空的 JSON 文件&lt;/li>
&lt;li>&lt;strong>循环&lt;/strong>：逐个读取轨迹 turn → 调用 LLM 分析 → 判断&amp;quot;新步骤&amp;quot;还是&amp;quot;续写&amp;quot; → 更新步骤 DAG&lt;/li>
&lt;li>&lt;strong>结束&lt;/strong>：得到完整的步骤划分 JSON，包含 8 种步骤类型、因果叙述和父子依赖关系&lt;/li>
&lt;/ul>
&lt;p>每个步骤被分类为：任务理解、项目探索、环境准备、代码实现、测试验证、问题调试、文档记录、总结规划。&lt;/p>
&lt;h3 id="宏观分析">宏观分析&lt;/h3>
&lt;p>以一条 Sonnet 4.5 在 SWE-bench 上的轨迹为例：&lt;/p>
&lt;iframe class="li-demo-iframe" data-src="https://geyuyao.com/demos/long-insight/demo_dag_sonnet_12-15.html" src="https://geyuyao.com/demos/long-insight/demo_dag_sonnet_12-15.html?lang=zh" style="width:100%; height:600px; border:1px solid rgba(255,255,255,0.1); border-radius:8px;" loading="lazy">&lt;/iframe>
&lt;p>虽然从宏观上看，轨迹整体呈线形结构，但仔细观察就会发现，Agent 并不是简单地&amp;quot;一条路走到黑&amp;quot;。它在不断地进行&lt;strong>发散（信息收集）→ 收敛（总结规划）→ 试错（回滚）→ 再执行&lt;/strong>的循环。&lt;/p>
&lt;p>整个轨迹可以划分为六个行为阶段：&lt;/p>
&lt;h4 id="第一阶段环境感知与基线建立steps-124">第一阶段：环境感知与基线建立（Steps 1–24）&lt;/h4>
&lt;p>Agent 非常注重&lt;strong>测试优先&lt;/strong>，花费大量精力分析 &lt;code>test_package.py&lt;/code>，通过阅读测试代码来反推需求，而不是盲目猜测。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="第一阶段 DAG" srcset="
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_c5d15d84f51c6a1a131d1fcacafd58bd.webp 400w,
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_e23d0916b791bdd78290dd498686e286.webp 760w,
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_c5d15d84f51c6a1a131d1fcacafd58bd.webp"
width="526"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="第二阶段策略调整与重规划steps-2534">第二阶段：策略调整与重规划（Steps 25–34）&lt;/h4>
&lt;p>经历了步骤 28 的回滚后，Agent 没有急于再次编码，而是转入&amp;quot;测试发现阶段&amp;quot;，通过非侵入式脚本去探测项目状态。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="第二阶段 DAG" srcset="
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1c2abde1ad276dd748074b4d98e4f431.webp 400w,
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1f16de73ffd2b272106da6e4d7280785.webp 760w,
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1c2abde1ad276dd748074b4d98e4f431.webp"
width="759"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="第三阶段基础设施建设steps-3851">第三阶段：基础设施建设（Steps 38–51）&lt;/h4>
&lt;p>在触碰核心算法前，先修复/实现底层依赖，例如 &lt;code>Dimension&lt;/code> 类构造方法和工具函数 &lt;code>execute_decomposition_method&lt;/code>。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="第三阶段 DAG" srcset="
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_6ee14c95b64e256a39cf4ee726a02fd9.webp 400w,
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_17eea17b42f1c5661c4ffdeb9bb5e191.webp 760w,
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_6ee14c95b64e256a39cf4ee726a02fd9.webp"
width="566"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="第四阶段核心算法的逐个实现steps-5279">第四阶段：核心算法的逐个实现（Steps 52–79）&lt;/h4>
&lt;p>这是轨迹中&lt;strong>最长的一段&lt;/strong>。Agent 采用&amp;quot;类比克隆&amp;quot;策略：先攻克最难的基类 &lt;code>PDDP&lt;/code> 的 &lt;code>fit&lt;/code> 方法，一旦跑通，迅速复制到子类 &lt;code>DePDDP&lt;/code>、&lt;code>IPDDP&lt;/code>、&lt;code>KMPDDP&lt;/code>、&lt;code>BisectingKmeans&lt;/code>。每个实现后紧跟单元测试（TDD 模式）。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="第四阶段 DAG" srcset="
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_f7d65e1fb4bd1d826f8e08870820bed4.webp 400w,
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_88c68c453950b9caa831ac4bfb065fe9.webp 760w,
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_f7d65e1fb4bd1d826f8e08870820bed4.webp"
width="571"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="第五阶段修复错误steps-80110">第五阶段：修复错误（Steps 80–110）&lt;/h4>
&lt;p>Agent 不仅修复了一个类，而是&lt;strong>系统性地遍历所有相关类&lt;/strong>，在 &lt;code>fit&lt;/code> 方法入口处统一添加输入验证逻辑。这展示了 Agent 的 &lt;strong>全局一致性（Consistency）&lt;/strong> 意识。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="第五阶段 DAG" srcset="
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_4c16f92971e6f162bf1ce8960b5e2d6c.webp 400w,
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_f0932d87cfcaf6f8af9be236ff1c0363.webp 760w,
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_4c16f92971e6f162bf1ce8960b5e2d6c.webp"
width="417"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="第六阶段全量回归与交付steps-111120">第六阶段：全量回归与交付（Steps 111–120）&lt;/h4>
&lt;p>运行全量测试套件（57 个测试用例全过），在真实场景下验证，生成交付文档并提交。&lt;/p>
&lt;h3 id="局部结构分析">局部结构分析&lt;/h3>
&lt;h4 id="汇聚结构fan-in">汇聚结构（Fan-in）&lt;/h4>
&lt;p>步骤 15（&lt;code>创建 TODO 和 OVERVIEW 笔记&lt;/code>）的父亲是 &lt;code>[12, 13, 14]&lt;/code> — Agent 在分别查找函数定义、类定义并验证数据加载后，将分散的信息汇聚成一份项目文档。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="汇聚结构" srcset="
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_d8edb61335f6565190bd8ab5e12beab9.webp 400w,
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_fc4ce839a11cfe5ca9e42415d2b6d178.webp 760w,
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_d8edb61335f6565190bd8ab5e12beab9.webp"
width="760"
height="652"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>类似的汇聚模式：&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="汇聚结构 2" srcset="
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1c78f0b9eb5b25ddf78ccb85bb9f1e94.webp 400w,
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_eaa112241cf6ede8497aec62b6d78726.webp 760w,
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1c78f0b9eb5b25ddf78ccb85bb9f1e94.webp"
width="760"
height="395"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="汇聚结构 3" srcset="
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e8c84fe4c43f5d3d3ada8d6709570420.webp 400w,
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e2ed45f8a43b9ada704f91b02aed55fc.webp 760w,
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e8c84fe4c43f5d3d3ada8d6709570420.webp"
width="760"
height="493"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="回溯结构backtrace">回溯结构（Backtrace）&lt;/h4>
&lt;p>步骤 28（&lt;code>中断实施并重新审视&lt;/code>）— Agent 执行了 &lt;code>git checkout&lt;/code> 或 &lt;code>git reset --hard&lt;/code>，代表&lt;strong>对死胡同的剪枝&lt;/strong>。Agent 意识到当前路径是错误的，切断分支，退回之前的状态。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="回溯结构" srcset="
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_49be02e7470797e8685f16887a3f5f8a.webp 400w,
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_a02e8d58db6167aa3cde08f690c7bb3b.webp 760w,
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_49be02e7470797e8685f16887a3f5f8a.webp"
width="760"
height="418"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="第二部分轨迹压缩">第二部分：轨迹压缩&lt;/h2>
&lt;h3 id="为什么必须压缩">为什么必须压缩？&lt;/h3>
&lt;p>不同任务类别的 Token 消耗差异巨大：&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>任务类别&lt;/th>
&lt;th>平均 Token 数&lt;/th>
&lt;th>超出 128K 的倍数&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>application_development&lt;/strong>&lt;/td>
&lt;td>8,135,018&lt;/td>
&lt;td>61.9 倍&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>build_deployment&lt;/strong>&lt;/td>
&lt;td>3,784,482&lt;/td>
&lt;td>28.8 倍&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ui_optimization&lt;/strong>&lt;/td>
&lt;td>1,412,891&lt;/td>
&lt;td>10.7 倍&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>machine_learning&lt;/strong>&lt;/td>
&lt;td>643,637&lt;/td>
&lt;td>4.9 倍&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>frontend_development&lt;/strong>&lt;/td>
&lt;td>294,325&lt;/td>
&lt;td>可处理&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>核心问题：85% 的轨迹超出上下文窗口、过长输入导致评测 LLM 性能下降、API 成本剧增。&lt;/p>
&lt;h3 id="压缩策略">压缩策略&lt;/h3>
&lt;p>核心原则：&lt;strong>保留 Agent 决策相关的所有内容，删除系统元数据和冗余信息。&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>删除&lt;/strong>：&lt;code>uuid&lt;/code>、&lt;code>parentUuid&lt;/code>、&lt;code>timestamp&lt;/code>、&lt;code>sessionId&lt;/code>、&lt;code>version&lt;/code> 等元数据；&lt;code>toolUseResult&lt;/code>（Agent 不可见的系统内部记录）&lt;/li>
&lt;li>&lt;strong>完整保留&lt;/strong>：Agent 的思考过程、代码输出、工具调用（含代码、参数、Todo List）&lt;/li>
&lt;li>&lt;strong>选择性截断&lt;/strong>：用户消息截断至 200 字符、工具结果超过 200 字符时截断&lt;/li>
&lt;/ul>
&lt;h3 id="压缩效果">压缩效果&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>指标&lt;/th>
&lt;th>压缩前&lt;/th>
&lt;th>压缩后&lt;/th>
&lt;th>改善&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>字符数&lt;/strong>&lt;/td>
&lt;td>41,659&lt;/td>
&lt;td>17,296&lt;/td>
&lt;td>&lt;strong>-58.5%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>行数&lt;/strong>&lt;/td>
&lt;td>538&lt;/td>
&lt;td>216&lt;/td>
&lt;td>&lt;strong>-59.9%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>估算 Token&lt;/strong>&lt;/td>
&lt;td>~20,800&lt;/td>
&lt;td>~8,600&lt;/td>
&lt;td>&lt;strong>-58.7%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>核心内容&lt;/strong>&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>&lt;strong>无损保留&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="第三部分自动评分">第三部分：自动评分&lt;/h2>
&lt;h3 id="评分体系">评分体系&lt;/h3>
&lt;p>我们设计了两阶段评分系统，从&lt;strong>任务难度&lt;/strong>和&lt;strong>提升潜力&lt;/strong>两个维度对轨迹进行自动评价：&lt;/p>
&lt;p>&lt;strong>阶段一：任务难度评分（0–10）&lt;/strong>&lt;/p>
&lt;p>输入字段包括问题描述、Issue 数量、评测结果（是否解决、补丁状态、测试日志）、总 Token 数和总轮次数。评估维度：&lt;/p>
&lt;ul>
&lt;li>问题本质复杂度（涉及文件数量、逻辑复杂程度）&lt;/li>
&lt;li>修复难度（需要多深的架构理解）&lt;/li>
&lt;li>问题描述清晰度&lt;/li>
&lt;li>项目复杂度（Issue 数量反映项目规模）&lt;/li>
&lt;li>实际解决难度（Token 消耗、轮次数、测试结果）&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>阶段二：提升潜力评分（0–10）&lt;/strong>&lt;/p>
&lt;p>输入为经压缩的完整对话历史 &lt;code>messages&lt;/code>。分数越高代表轨迹质量越好。定义了多种典型不良模式，不良情况越多，分数越低：&lt;/p>
&lt;ul>
&lt;li>测试尊重度 — 是否认真对待失败的测试&lt;/li>
&lt;li>验证闭环完整性 — 发现→分析→修复→验证 的闭环是否完整&lt;/li>
&lt;li>问题定位准确性 — 是否找到了真正的根因&lt;/li>
&lt;li>行为重复度 — 是否在重复无效操作&lt;/li>
&lt;li>探索效率 — 是否进行了有效的代码探索&lt;/li>
&lt;li>错误应对能力 — 是否良好应对报错信息&lt;/li>
&lt;li>推理质量 — thinking 块是否有实质内容&lt;/li>
&lt;li>轨迹稳定性 — 是否频繁偏离主线&lt;/li>
&lt;li>行为有效性 — 后期操作是否仍然有意义&lt;/li>
&lt;/ul>
&lt;h3 id="实际应用">实际应用&lt;/h3>
&lt;p>我们对 &lt;strong>12,839 条&lt;/strong> Sonnet 4.5 SWE-bench 轨迹进行了自动评分。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="评分分布" srcset="
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_a7faf91365c93184e4a6b3f1edcce891.webp 400w,
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_d31556507b117630306c2e3697bc0ff8.webp 760w,
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_a7faf91365c93184e4a6b3f1edcce891.webp"
width="760"
height="305"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>根据评分结果进行数据筛选：&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>筛选阈值&lt;/th>
&lt;th>保留轨迹数&lt;/th>
&lt;th>保留比例&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>提升潜力 ≥ 8 分&lt;/td>
&lt;td>11,381&lt;/td>
&lt;td>88.6%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>提升潜力 ≥ 9 分&lt;/td>
&lt;td>999&lt;/td>
&lt;td>7.8%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>评分结果适合用于 SWE SFT 训练数据的质量过滤 — 通过去除低分轨迹中的坏模式（bad patterns），提升训练数据质量，而不是简单地按难度或领域做选择性偏差。&lt;/p>
&lt;/div>
&lt;div class="li-lang-en" markdown="1">
&lt;h1 id="long-insight-a-platform-for-long-running-agent-trajectory-analysis">Long-Insight: A Platform for Long-running Agent Trajectory Analysis&lt;/h1>
&lt;h2 id="background-when-agent-trajectories-are-too-long-to-read">Background: When Agent Trajectories Are Too Long to Read&lt;/h2>
&lt;p>When frontier LLMs tackle real software-engineering tasks, the resulting execution traces are staggeringly long. Take a recent Long-running benchmark of 370 trajectories as an example:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Model&lt;/th>
&lt;th>Avg. Tokens&lt;/th>
&lt;th>Multiples of 128K&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Kimi-K2-0905&lt;/strong>&lt;/td>
&lt;td>6,837,594&lt;/td>
&lt;td>&lt;strong>52.1×&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>DeepSeek-V3.1&lt;/strong>&lt;/td>
&lt;td>5,059,423&lt;/td>
&lt;td>&lt;strong>38.5×&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GLM-4.6&lt;/strong>&lt;/td>
&lt;td>2,616,703&lt;/td>
&lt;td>&lt;strong>19.9×&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Claude-Sonnet-4&lt;/strong>&lt;/td>
&lt;td>977,747&lt;/td>
&lt;td>&lt;strong>7.4×&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>A single &lt;code>application_development&lt;/code> trajectory averages &lt;strong>8.1M tokens&lt;/strong> — &lt;strong>61.9×&lt;/strong> a 128K context window. &lt;strong>At least 85% of the 370 trajectories exceed any model&amp;rsquo;s context window.&lt;/strong>&lt;/p>
&lt;p>This creates two concrete problems:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Humans can&amp;rsquo;t read them&lt;/strong> — no matter how patient an engineer, 400+ turns of interaction logs cannot be reviewed line by line.&lt;/li>
&lt;li>&lt;strong>Models can&amp;rsquo;t process them&lt;/strong> — even with million-token context windows, most trajectories still won&amp;rsquo;t fit.&lt;/li>
&lt;/ol>
&lt;p>Yet these trajectories contain extremely valuable signals. As we summarized while filtering SWE SFT data:&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>&amp;ldquo;It&amp;rsquo;s fine if there&amp;rsquo;s only some gold in the sand — but there can&amp;rsquo;t be broken glass.&amp;rdquo;&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;p>What we need is not blind selection of &amp;ldquo;good-looking&amp;rdquo; trajectories, but &lt;strong>precise identification and filtering of bad patterns&lt;/strong>. To do that, we need tools that &lt;strong>understand&lt;/strong> what these ultra-long trajectories are doing — and how well.&lt;/p>
&lt;p>That is the origin of Long-Insight.&lt;/p>
&lt;h2 id="core-idea">Core Idea&lt;/h2>
&lt;p>Long-Insight solves three problems:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Unreadable&lt;/strong> → decompose linear trajectories into a structured step DAG with types, summaries, and parent–child dependencies.&lt;/li>
&lt;li>&lt;strong>Doesn&amp;rsquo;t fit&lt;/strong> → smart compression that cuts 60–80% of tokens while preserving causal structure.&lt;/li>
&lt;li>&lt;strong>Hard to score&lt;/strong> → a two-stage LLM auto-scorer that quantifies trajectory difficulty and quality.&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Inspiration" srcset="
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_9cafd98d00676ed5f3924480c73b0ab9.webp 400w,
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_f6dfbafd07a75b25d3eb2b8ee98a13af.webp 760w,
/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_1200x1200_fit_q99_h2_lanczos.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_hu6f5ce95f944f138ef609bc187e73138b_118672_9cafd98d00676ed5f3924480c73b0ab9.webp"
width="760"
height="427"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="part-1-trajectory-step-decomposition">Part 1: Trajectory Step Decomposition&lt;/h2>
&lt;h3 id="design">Design&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Initialization&lt;/strong>: create an empty JSON file.&lt;/li>
&lt;li>&lt;strong>Loop&lt;/strong>: read trajectory turns one by one → call an LLM to analyze → decide &amp;ldquo;new step&amp;rdquo; vs. &amp;ldquo;continuation&amp;rdquo; → update the step DAG.&lt;/li>
&lt;li>&lt;strong>Finalization&lt;/strong>: produce a complete step-partition JSON with 8 step types, causal narratives, and parent–child links.&lt;/li>
&lt;/ul>
&lt;p>Each step is classified into one of: Task Understanding, Project Exploration, Environment Setup, Code Implementation, Test Validation, Problem Debugging, Documentation, and Summary &amp;amp; Planning.&lt;/p>
&lt;h3 id="macro-level-analysis">Macro-level Analysis&lt;/h3>
&lt;p>Below is one Sonnet 4.5 trajectory on SWE-bench:&lt;/p>
&lt;iframe class="li-demo-iframe" data-src="https://geyuyao.com/demos/long-insight/demo_dag_sonnet_12-15.html" src="https://geyuyao.com/demos/long-insight/demo_dag_sonnet_12-15.html?lang=en" style="width:100%; height:600px; border:1px solid rgba(255,255,255,0.1); border-radius:8px;" loading="lazy">&lt;/iframe>
&lt;p>Although the trajectory looks roughly linear at the macro level, a closer look reveals that the Agent isn&amp;rsquo;t simply &amp;ldquo;marching straight to the end.&amp;rdquo; It continuously cycles through &lt;strong>diverge (gather information) → converge (summarize &amp;amp; plan) → trial-and-error (rollback) → execute again&lt;/strong>.&lt;/p>
&lt;p>The whole trajectory naturally splits into six behavioral phases:&lt;/p>
&lt;h4 id="phase-1-environment-sensing-and-baseline-building-steps-124">Phase 1: Environment Sensing and Baseline Building (Steps 1–24)&lt;/h4>
&lt;p>The Agent is decidedly &lt;strong>test-first&lt;/strong>, spending substantial effort analyzing &lt;code>test_package.py&lt;/code> and reverse-engineering requirements from test code rather than guessing blindly.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Phase 1 DAG" srcset="
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_c5d15d84f51c6a1a131d1fcacafd58bd.webp 400w,
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_e23d0916b791bdd78290dd498686e286.webp 760w,
/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%283%29_hu503a984001ecb3cdccf44d8827767745_336075_c5d15d84f51c6a1a131d1fcacafd58bd.webp"
width="526"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="phase-2-strategy-adjustment-and-replanning-steps-2534">Phase 2: Strategy Adjustment and Replanning (Steps 25–34)&lt;/h4>
&lt;p>After the rollback at step 28, the Agent doesn&amp;rsquo;t rush back into coding. Instead, it enters a &amp;ldquo;test-discovery phase,&amp;rdquo; probing project state with non-invasive scripts.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Phase 2 DAG" srcset="
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1c2abde1ad276dd748074b4d98e4f431.webp 400w,
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1f16de73ffd2b272106da6e4d7280785.webp 760w,
/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_hucb10cbd15dd4e81cd70ea2a2b14b46f6_414454_1c2abde1ad276dd748074b4d98e4f431.webp"
width="759"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="phase-3-infrastructure-construction-steps-3851">Phase 3: Infrastructure Construction (Steps 38–51)&lt;/h4>
&lt;p>Before touching the core algorithms, the Agent fixes/implements low-level dependencies — for example, the &lt;code>Dimension&lt;/code> class constructor and the utility function &lt;code>execute_decomposition_method&lt;/code>.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Phase 3 DAG" srcset="
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_6ee14c95b64e256a39cf4ee726a02fd9.webp 400w,
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_17eea17b42f1c5661c4ffdeb9bb5e191.webp 760w,
/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%285%29_hu65fb11108f65949068717b6af1492b61_290221_6ee14c95b64e256a39cf4ee726a02fd9.webp"
width="566"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="phase-4-core-algorithms-one-by-one-steps-5279">Phase 4: Core Algorithms, One by One (Steps 52–79)&lt;/h4>
&lt;p>This is the &lt;strong>longest stretch&lt;/strong> of the trajectory. The Agent adopts an &amp;ldquo;analogy-clone&amp;rdquo; strategy: first crack the hardest base class &lt;code>PDDP&lt;/code>&amp;rsquo;s &lt;code>fit&lt;/code> method, then quickly replicate it to subclasses &lt;code>DePDDP&lt;/code>, &lt;code>IPDDP&lt;/code>, &lt;code>KMPDDP&lt;/code>, and &lt;code>BisectingKmeans&lt;/code>. Each implementation is immediately followed by unit tests (TDD style).&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Phase 4 DAG" srcset="
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_f7d65e1fb4bd1d826f8e08870820bed4.webp 400w,
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_88c68c453950b9caa831ac4bfb065fe9.webp 760w,
/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%286%29_hu57553b610404cabc366393e60a015494_436445_f7d65e1fb4bd1d826f8e08870820bed4.webp"
width="571"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="phase-5-fixing-errors-steps-80110">Phase 5: Fixing Errors (Steps 80–110)&lt;/h4>
&lt;p>The Agent doesn&amp;rsquo;t just patch a single class — it &lt;strong>systematically walks through every related class&lt;/strong>, uniformly adding input validation at the entry point of &lt;code>fit&lt;/code> methods. This demonstrates the Agent&amp;rsquo;s awareness of &lt;strong>global consistency&lt;/strong>.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Phase 5 DAG" srcset="
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_4c16f92971e6f162bf1ce8960b5e2d6c.webp 400w,
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_f0932d87cfcaf6f8af9be236ff1c0363.webp 760w,
/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%287%29_hu35240c3ac3dc7bf556eb32e066b3bb44_496711_4c16f92971e6f162bf1ce8960b5e2d6c.webp"
width="417"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="phase-6-full-regression-and-delivery-steps-111120">Phase 6: Full Regression and Delivery (Steps 111–120)&lt;/h4>
&lt;p>Run the full test suite (all 57 cases pass), validate in real scenarios, generate delivery documentation, and submit.&lt;/p>
&lt;h3 id="local-structure-analysis">Local Structure Analysis&lt;/h3>
&lt;h4 id="fan-in">Fan-in&lt;/h4>
&lt;p>Step 15 (&lt;code>Create TODO and OVERVIEW notes&lt;/code>) has parents &lt;code>[12, 13, 14]&lt;/code> — after separately searching function definitions, class definitions, and verifying data loading, the Agent fuses the scattered information into one project document.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Fan-in 1" srcset="
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_d8edb61335f6565190bd8ab5e12beab9.webp 400w,
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_fc4ce839a11cfe5ca9e42415d2b6d178.webp 760w,
/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%288%29_hudf785de7a78070806996a47f8c4765cc_260019_d8edb61335f6565190bd8ab5e12beab9.webp"
width="760"
height="652"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Similar fan-in patterns:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Fan-in 2" srcset="
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1c78f0b9eb5b25ddf78ccb85bb9f1e94.webp 400w,
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_eaa112241cf6ede8497aec62b6d78726.webp 760w,
/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%289%29_huf11e7570515067a3124dd0a61d228c19_218408_1c78f0b9eb5b25ddf78ccb85bb9f1e94.webp"
width="760"
height="395"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Fan-in 3" srcset="
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e8c84fe4c43f5d3d3ada8d6709570420.webp 400w,
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e2ed45f8a43b9ada704f91b02aed55fc.webp 760w,
/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%2810%29_hufdc39511fc7efdabc7d4ef6b1353ca4b_314844_e8c84fe4c43f5d3d3ada8d6709570420.webp"
width="760"
height="493"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="backtrace">Backtrace&lt;/h4>
&lt;p>Step 28 (&lt;code>Abort implementation and reassess&lt;/code>) — the Agent executed &lt;code>git checkout&lt;/code> or &lt;code>git reset --hard&lt;/code>, representing &lt;strong>pruning a dead-end branch&lt;/strong>. It realized the current path was wrong, cut the branch, and reverted to an earlier state.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Backtrace" srcset="
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_49be02e7470797e8685f16887a3f5f8a.webp 400w,
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_a02e8d58db6167aa3cde08f690c7bb3b.webp 760w,
/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/image_%2811%29_hu50d42a63f6720231e004a2a3bba72b1e_269993_49be02e7470797e8685f16887a3f5f8a.webp"
width="760"
height="418"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="part-2-trajectory-compression">Part 2: Trajectory Compression&lt;/h2>
&lt;h3 id="why-compression-is-mandatory">Why Compression Is Mandatory&lt;/h3>
&lt;p>Token consumption varies dramatically across task categories:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Task Category&lt;/th>
&lt;th>Avg. Tokens&lt;/th>
&lt;th>Multiples of 128K&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>application_development&lt;/strong>&lt;/td>
&lt;td>8,135,018&lt;/td>
&lt;td>61.9×&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>build_deployment&lt;/strong>&lt;/td>
&lt;td>3,784,482&lt;/td>
&lt;td>28.8×&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ui_optimization&lt;/strong>&lt;/td>
&lt;td>1,412,891&lt;/td>
&lt;td>10.7×&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>machine_learning&lt;/strong>&lt;/td>
&lt;td>643,637&lt;/td>
&lt;td>4.9×&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>frontend_development&lt;/strong>&lt;/td>
&lt;td>294,325&lt;/td>
&lt;td>fits&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Core issues: 85% of trajectories exceed any model&amp;rsquo;s context window, over-long inputs degrade the judging LLM&amp;rsquo;s performance, and API costs balloon.&lt;/p>
&lt;h3 id="compression-strategy">Compression Strategy&lt;/h3>
&lt;p>Core principle: &lt;strong>preserve everything relevant to the Agent&amp;rsquo;s decisions; drop system metadata and redundancy.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Drop&lt;/strong>: metadata such as &lt;code>uuid&lt;/code>, &lt;code>parentUuid&lt;/code>, &lt;code>timestamp&lt;/code>, &lt;code>sessionId&lt;/code>, &lt;code>version&lt;/code>; &lt;code>toolUseResult&lt;/code> (internal system records invisible to the Agent).&lt;/li>
&lt;li>&lt;strong>Keep in full&lt;/strong>: the Agent&amp;rsquo;s thinking, code outputs, and tool calls (including code, arguments, and Todo lists).&lt;/li>
&lt;li>&lt;strong>Selectively truncate&lt;/strong>: user messages capped at 200 chars; tool results truncated when they exceed 200 chars.&lt;/li>
&lt;/ul>
&lt;h3 id="compression-results">Compression Results&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Before&lt;/th>
&lt;th>After&lt;/th>
&lt;th>Improvement&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Characters&lt;/strong>&lt;/td>
&lt;td>41,659&lt;/td>
&lt;td>17,296&lt;/td>
&lt;td>&lt;strong>-58.5%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Lines&lt;/strong>&lt;/td>
&lt;td>538&lt;/td>
&lt;td>216&lt;/td>
&lt;td>&lt;strong>-59.9%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Estimated tokens&lt;/strong>&lt;/td>
&lt;td>~20,800&lt;/td>
&lt;td>~8,600&lt;/td>
&lt;td>&lt;strong>-58.7%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Core content&lt;/strong>&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>100%&lt;/td>
&lt;td>&lt;strong>lossless&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="part-3-automatic-scoring">Part 3: Automatic Scoring&lt;/h2>
&lt;h3 id="scoring-system">Scoring System&lt;/h3>
&lt;p>We designed a two-stage scoring pipeline that evaluates trajectories along two dimensions: &lt;strong>task difficulty&lt;/strong> and &lt;strong>improvement potential&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>Stage 1: Task Difficulty Score (0–10)&lt;/strong>&lt;/p>
&lt;p>Inputs include the problem description, number of issues, evaluation results (resolved or not, patch status, test logs), total tokens, and total turns. Dimensions assessed:&lt;/p>
&lt;ul>
&lt;li>Inherent problem complexity (number of files involved, logical depth)&lt;/li>
&lt;li>Fix difficulty (how deep an architectural understanding is required)&lt;/li>
&lt;li>Clarity of the problem description&lt;/li>
&lt;li>Project complexity (issue counts as a proxy for project size)&lt;/li>
&lt;li>Empirical solving difficulty (token consumption, turn count, test outcomes)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Stage 2: Improvement-Potential Score (0–10)&lt;/strong>&lt;/p>
&lt;p>The input is the compressed full conversation history &lt;code>messages&lt;/code>. Higher scores mean higher-quality trajectories. We defined several stereotypical bad patterns — the more bad patterns observed, the lower the score:&lt;/p>
&lt;ul>
&lt;li>Test-respect — does the Agent take failing tests seriously?&lt;/li>
&lt;li>Validation-loop completeness — is the discover → analyze → fix → verify loop intact?&lt;/li>
&lt;li>Root-cause accuracy — did the Agent locate the actual root cause?&lt;/li>
&lt;li>Behavioral repetition — is it repeating ineffective operations?&lt;/li>
&lt;li>Exploration efficiency — is it exploring code productively?&lt;/li>
&lt;li>Error-handling capability — does it deal well with error messages?&lt;/li>
&lt;li>Reasoning quality — do thinking blocks contain substantive content?&lt;/li>
&lt;li>Trajectory stability — does it frequently drift off the main thread?&lt;/li>
&lt;li>Action effectiveness — are late-stage actions still meaningful?&lt;/li>
&lt;/ul>
&lt;h3 id="application">Application&lt;/h3>
&lt;p>We auto-scored &lt;strong>12,839&lt;/strong> Sonnet 4.5 trajectories on SWE-bench, at a cost of roughly &lt;strong>¥1 per trajectory&lt;/strong>.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Score distribution" srcset="
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_a7faf91365c93184e4a6b3f1edcce891.webp 400w,
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_d31556507b117630306c2e3697bc0ff8.webp 760w,
/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/project/long-insight/score_distribution_hu4c787f1bba4462257befff7bc7c4199b_163119_a7faf91365c93184e4a6b3f1edcce891.webp"
width="760"
height="305"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Filtering thresholds derived from the scores:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Threshold&lt;/th>
&lt;th>Retained&lt;/th>
&lt;th>Retention&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Improvement potential ≥ 8&lt;/td>
&lt;td>11,381&lt;/td>
&lt;td>88.6%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Improvement potential ≥ 9&lt;/td>
&lt;td>999&lt;/td>
&lt;td>7.8%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>These scores feed directly into the SWE SFT quality filter — removing bad patterns from low-scoring trajectories to improve training data quality, rather than introducing selection bias by difficulty or domain.&lt;/p>
&lt;/div>
&lt;script>
(function(){
function getInitialLang(){
try{
var url = new URL(window.location.href);
var q = url.searchParams.get('lang');
if(q === 'en' || q === 'zh') return q;
var stored = localStorage.getItem('li_lang');
if(stored === 'en' || stored === 'zh') return stored;
}catch(_){}
var nav = (navigator.language || navigator.userLanguage || 'en').toLowerCase();
return nav.indexOf('zh') === 0 ? 'zh' : 'en';
}
var currentLILang = null;
// Bump LI_DEMO_VERSION when demo HTML files change so browsers bypass cache.
var LI_DEMO_VERSION = '2026-05-27a';
function postLangToIframe(f, lang){
try{
if(f.contentWindow){
f.contentWindow.postMessage({type:'li-demo-set-lang', lang: lang}, '*');
}
}catch(_){}
}
function setLang(lang){
if(lang !== 'en' &amp;&amp; lang !== 'zh') lang = 'en';
currentLILang = lang;
document.body.setAttribute('data-li-lang', lang);
try{ localStorage.setItem('li_lang', lang); }catch(_){}
try{
var url = new URL(window.location.href);
url.searchParams.set('lang', lang);
window.history.replaceState({}, '', url.toString());
}catch(_){}
document.querySelectorAll('.li-lang-bar [data-lilang]').forEach(function(b){
b.classList.toggle('active', b.getAttribute('data-lilang') === lang);
});
document.querySelectorAll('iframe.li-demo-iframe').forEach(function(f){
var base = f.getAttribute('data-src') || (f.src ? f.src.split('?')[0] : '');
if(!base) return;
var desired = base + '?lang=' + lang + '&amp;v=' + LI_DEMO_VERSION;
// attach a one-time load listener so we ALSO send postMessage once the iframe is ready
if(!f.dataset.liLoadHook){
f.addEventListener('load', function(){
postLangToIframe(f, currentLILang || lang);
});
f.dataset.liLoadHook = '1';
}
if(!f.dataset.liInited){
// first time: force the iframe to load with the correct lang
f.setAttribute('src', desired);
f.dataset.liInited = '1';
f.dataset.liLang = lang;
} else if(f.dataset.liLang !== lang){
// language changed: notify via postMessage (no reload). Don't update src to avoid losing user state.
postLangToIframe(f, lang);
f.dataset.liLang = lang;
}
});
}
window.__setLILang = setLang;
setLang(getInitialLang());
})();
&lt;/script></description></item></channel></rss>