<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Coding Agent | YuyaoGe's Website</title><link>https://geyuyao.com/tag/coding-agent/</link><atom:link href="https://geyuyao.com/tag/coding-agent/index.xml" rel="self" type="application/rss+xml"/><description>Coding Agent</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 11 Jun 2026 00:00:00 +0000</lastBuildDate><image><url>https://geyuyao.com/media/icon_hucac340dfc176d8b4c8a8aa7a23204f12_18561_512x512_fill_lanczos_center_3.png</url><title>Coding Agent</title><link>https://geyuyao.com/tag/coding-agent/</link></image><item><title>更复杂的 Agent 能带来更好的性能吗？</title><link>https://geyuyao.com/post/swe-agent-vs-mini-swe-agent/</link><pubDate>Thu, 11 Jun 2026 00:00:00 +0000</pubDate><guid>https://geyuyao.com/post/swe-agent-vs-mini-swe-agent/</guid><description>&lt;style>
.li-lang-bar{display:flex;justify-content:flex-end;gap:8px;margin:0 0 20px 0;}
.li-lang-bar button{border:1px solid rgba(127,127,127,.35);background:transparent;color:inherit;padding:4px 14px;border-radius:0;font-size:13px;cursor:pointer;transition:all .15s ease;}
.li-lang-bar button:hover{background:rgba(127,127,127,.12);}
.li-lang-bar button.active{background:#2563eb;color:#fff;border-color:#2563eb;}
.li-lang-zh{display:none;}
body[data-li-lang="zh"] .li-lang-zh{display:block;}
body[data-li-lang="zh"] .li-lang-en{display:none;}
body[data-li-lang="en"] .li-lang-zh{display:none;}
body[data-li-lang="en"] .li-lang-en{display:block;}
&lt;/style>
&lt;script>
(function(){
try{
var lang=null;
try{var q=new URL(window.location.href).searchParams.get('lang');if(q==='en'||q==='zh')lang=q;}catch(_){}
if(!lang){try{var s=localStorage.getItem('li_lang');if(s==='en'||s==='zh')lang=s;}catch(_){}}
if(!lang){var n=(navigator.language||navigator.userLanguage||'en').toLowerCase();lang=n.indexOf('zh')===0?'zh':'en';}
if(document.body){document.body.setAttribute('data-li-lang',lang);}
else{document.documentElement.setAttribute('data-li-lang',lang);document.addEventListener('DOMContentLoaded',function(){document.body.setAttribute('data-li-lang',lang);});}
}catch(_){}
})();
&lt;/script>
&lt;div class="li-lang-bar">
&lt;button type="button" data-lilang="zh" onclick="window.__setLILang &amp;&amp; window.__setLILang('zh')">中文&lt;/button>
&lt;button type="button" data-lilang="en" onclick="window.__setLILang &amp;&amp; window.__setLILang('en')">English&lt;/button>
&lt;/div>
&lt;div class="li-lang-zh" markdown="1">
&lt;p>TLDR: 更复杂的SWE-Agent在SWE-pro bench上相比于mini-swe-agent表现更差且出现了实例卡死&lt;/p>
&lt;p>ps：本文AI率低于20%&lt;/p>
&lt;p>首先抛出一个常见的直觉：&lt;strong>agent 框架做得越完善，性能应该越强。&lt;/strong> 尽管直觉上大家都这么认为，但是目前没有人严谨地证明过。为此，我希望在SWE任务上验证这个直觉是否是正确的🤔。&lt;/p>
&lt;p>💡思路如下：&lt;/p>
&lt;p>在SWE Pro Bench上测试两个复杂度不同的Agent Framework，对比他们各自的得分。&lt;/p>
&lt;p>结果却发现&lt;strong>简单的Agent框架反而获得了更高的性能&lt;/strong>🤯。&lt;/p>
&lt;div class="alert alert-note">
&lt;div>
&lt;p>&lt;strong>背景知识&lt;/strong>&lt;/p>
&lt;p>&lt;strong>什么是 SWE 任务？&lt;/strong> SWE（Software Engineering）任务衡量的是 agent 的端到端真实开发能力：给定一个真实代码仓库和一个 GitHub issue，让 agent 自主地读代码、定位问题、跨文件修改、写出 patch（代码补丁），最后由测试判定是否&amp;quot;解决&amp;quot;。SWE-bench Pro 就是针对这一任务的 benchmark。&lt;/p>
&lt;p>&lt;strong>SWE-agent 与 mini-swe-agent&lt;/strong> 是两个面向 SWE 场景的 agent 框架：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>SWE-agent&lt;/strong> 的核心论点是 Agent-Computer Interface（ACI），即为 agent 精心设计一套专用工具：给它配多种自定义工具、每个工具各有接口；执行则交给独立的 &lt;a href="https://github.com/SWE-agent/SWE-ReX" target="_blank" rel="noopener">&lt;code>SWE-ReX&lt;/code>&lt;/a> 后端，用持久 pexpect 交互式 shell（工作目录、环境变量跨命令保留），且每条命令先过 bashlex 预解析（切分、语法校验、精确抠退出码）。&lt;/li>
&lt;li>&lt;strong>mini-swe-agent&lt;/strong> 是 SWE-agent 的最小实现（整个 agent 类约 100 行 Python）：只有 bash 一个&amp;quot;工具&amp;quot;，连模型的 tool-calling 接口都不用；用 &lt;code>subprocess.run&lt;/code> 执行每条命令，每个 action 完全独立。&lt;/li>
&lt;/ul>
&lt;/div>
&lt;/div>
&lt;p>介绍完背景，那么问题来了：&lt;strong>在同模型、同 benchmark 下，相比于mini-SWE-Agent，SWE-Agent这套&amp;quot;更完善&amp;quot;的工程，是否能带来更高的收益呢？？&lt;/strong>&lt;/p>
&lt;h2 id="更完善的-swe-agent-并没有更强">更完善的 SWE-agent 并没有更强&lt;/h2>
&lt;p>我使用 Claude Sonnet 4.5 分别在两个 Agent Framework 上，对 SWE-bench Pro 的全部 731 题进行全量测试，限制了最大调用次数为 50 次。&lt;/p>
&lt;p>结果如下：&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Agent Framework&lt;/th>
&lt;th>N&lt;/th>
&lt;th>resolved&lt;/th>
&lt;th>通过率&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>mini-swe-agent&lt;/strong>&lt;/td>
&lt;td>731&lt;/td>
&lt;td>322&lt;/td>
&lt;td>&lt;strong>44.0%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>SWE-agent&lt;/strong>&lt;/td>
&lt;td>731&lt;/td>
&lt;td>302&lt;/td>
&lt;td>&lt;strong>41.3%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;blockquote>
&lt;p>参照：官方对 Sonnet 4.5 的测试结果约为 43.6%&lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>，mini 的 44.0% 与之吻合，说明了我们实验的可信性。&lt;/p>
&lt;/blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="分语言通过率对比" srcset="
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_385de0308e1f839c65be1f56e7d58fcb.webp 400w,
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_240e4bf4e45fc54693a0874eb7a5ba71.webp 760w,
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_385de0308e1f839c65be1f56e7d58fcb.webp"
width="760"
height="380"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>极简的 mini-swe-agent 比 swe-agent 还高了 2.7%。令人意外的是 &lt;strong>SWE-agent 全量跑到 722/731 后，最后 9 个实例直接卡死&lt;/strong>——容器 Up 5–12 小时、日志连续几小时无更新，只能kill掉。而 mini-swe-agent 跑同样这 9 道题却没有此问题。&lt;/p>
&lt;p>于我而言，比起结果上的意外，我更好奇为什么这九个实例会卡死🤔。&lt;/p>
&lt;h2 id="为什么会有-9-个被卡死的容器">为什么会有 9 个被卡死的容器&lt;/h2>
&lt;p>Kill掉容器前，我抓了每个卡死容器的 &lt;code>docker logs&lt;/code>，显示：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">INFO ... 200 OK POST /run_in_session
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">🦖 ERROR Bashlex fail: here-document at line 0 delimited by end-of-file (wanted &amp;#34;&amp;#39;EOF&amp;#39;&amp;#34;)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>容器没死、&lt;code>swerex-remote&lt;/code> 进程还在正常返回 &lt;code>200 OK&lt;/code>——&lt;strong>是 agent 在空转&lt;/strong>。顺着 agent 的 trace 看，它在用 &lt;strong>heredoc 写大文件&lt;/strong>：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">cat &amp;gt; some_file.go &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;EOF&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s">... 一大段 Go 代码 ...
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s">EOF&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>根因找到了：SWE-agent 的执行后端 &lt;code>swe-rex&lt;/code>，会先用 &lt;code>bashlex&lt;/code>（一个纯 Python 写的 bash 解析器）把每条命令解析一遍，再送进容器执行。&lt;/p>
&lt;p>麻烦正出在这一步。&lt;code>bashlex&lt;/code> 对 heredoc 的支持并不完整，碰到 &lt;code>cat &amp;lt;&amp;lt;'EOF' … 一大段代码 … EOF&lt;/code> 这种大块写文件，它会直接解析失败，抛出 &lt;code>Bashlex fail&lt;/code>。&lt;/p>
&lt;p>解析一旦垮掉，swe-rex 就判断不出这条命令到底有没有跑完、退出码是几；agent 收到一份残缺的反馈，又不会换种写法自救，只能把同一条命令一遍遍重试，容器就这样一卡就是 5 到 12 个小时。&lt;/p>
&lt;h2 id="swe-agent-为什么要多此一举地先解析命令">SWE-Agent 为什么要多此一举地先解析命令？&lt;/h2>
&lt;p>swe-rex 维持着一个长期存活的 shell 会话，让工作目录、环境变量、激活的虚拟环境这些状态能在多条命令之间延续。&lt;/p>
&lt;p>但代价是，当命令在一个不断流动的会话里执行时，&amp;ldquo;它到哪儿算结束、返回码是多少&amp;quot;就不再像跑完一个独立进程那样一目了然，只能靠 &lt;code>bashlex&lt;/code> 把命令切开、再注入哨兵字符串去输出流里把退出码捞回来。&lt;/p>
&lt;p>好处是：命令既然被解析成了结构，还能做安全检查、命令改写之类更精细的封装，也正是 SWE-agent 主打的 Agent-Computer Interface 思路。它用额外的复杂度，换来了更强的会话语义。&lt;/p>
&lt;p>而 mini-swe-agent 则相反，走的是极简路线。它根本不维持会话，每条命令都用一句 &lt;code>subprocess.run(shell=True)&lt;/code> 直接丢给系统真正的 shell：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># minisweagent/environments/local.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">run&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">command&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">shell&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># 交给系统 shell（/bin/sh -c）&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">text&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">cwd&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">cwd&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">timeout&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">timeout&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">stdout&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">PIPE&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">stderr&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">STDOUT&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>这么做丢掉了会话状态，每条命令都从头开始，agent 得自己把路径和环境写全；但也正因如此，它绕开了所有&amp;quot;自己解析 bash&amp;quot;的麻烦。heredoc 再大也是真 shell 的本职工作，命令跑完、进程一退出，退出码自然就有了。&lt;/p>
&lt;p>于是同一条写大文件的命令，在 SWE-agent 撞上 &lt;code>bashlex&lt;/code> 的短板、把容器拖死，在 mini-swe-agent 这边却平平无奇地跑了过去。这就是一组很典型的工程取舍——swe-rex 用更高的复杂度换更强的会话语义，也因此多背了一类失败面；mini 放弃了会话的便利，换来更小、更可控的出错空间。&lt;/p>
&lt;h2 id="结语">结语&lt;/h2>
&lt;p>从广义的意义来说，这次的发现证明了奥卡姆剃刀原则：越简单的东西反而是越有效的（也可能是第一性原理）。&lt;/p>
&lt;p>当然，我认为这只是一个非常简单的 toy experiment。这并不能说明更复杂、更精密的 agent 效果就不好，可能只是因为 SWE-agent 恰好有这么一个 bug。也许一个经过精细调教的、更复杂的 agent 可以比 mini agent 更好。&lt;/p>
&lt;p>当然，anyway，这只是一些猜测。接下来我会用更严谨的实验深挖开头提到的问题，感兴趣的朋友欢迎持续关注。&lt;/p>
&lt;h2 id="附录">附录&lt;/h2>
&lt;h3 id="分语言对比">分语言对比&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>语言&lt;/th>
&lt;th>mini&lt;/th>
&lt;th>SWE-agent&lt;/th>
&lt;th>谁优&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>go&lt;/td>
&lt;td>95/280 = &lt;strong>34%&lt;/strong>&lt;/td>
&lt;td>78/280 = 28%&lt;/td>
&lt;td>mini +6pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>python&lt;/td>
&lt;td>139/266 = 52%&lt;/td>
&lt;td>143/266 = &lt;strong>54%&lt;/strong>&lt;/td>
&lt;td>SWE-agent +2pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>js&lt;/td>
&lt;td>77/165 = &lt;strong>47%&lt;/strong>&lt;/td>
&lt;td>73/165 = 44%&lt;/td>
&lt;td>mini +3pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>ts&lt;/td>
&lt;td>11/20 = &lt;strong>55%&lt;/strong>&lt;/td>
&lt;td>8/20 = 40%&lt;/td>
&lt;td>mini（N=20，小样本不稳）&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>mini 的优势几乎全部来自 Go，高出 6 个百分点、差 17 题，而这里面光 &lt;code>gravitational/teleport&lt;/code> 一个仓就占了大头（仅 mini 解出的有 16 道，仅 SWE-agent 解出的只有 5 道）。这其实并不意外：Go 题大多是体量大、改动多的重仓，正好是最容易触发 SWE-agent heredoc 卡死的地方。可一旦换到 Python，SWE-agent 反而还略高了 2 个百分点。&lt;/p>
&lt;h3 id="配对显著性">配对显著性&lt;/h3>
&lt;p>再看配对显著性。把 731 题按 instance_id 一一对齐，能分成四类：&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>数量&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>都过&lt;/td>
&lt;td>245&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>仅 mini 过&lt;/td>
&lt;td>77&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>仅 SWE-agent 过&lt;/td>
&lt;td>57&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>都没过&lt;/td>
&lt;td>352&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>真正分出胜负的是那 134 道一边过、一边不过的题，其中 mini 占 77、SWE-agent 占 57，确实偏向 mini，但 McNemar 精确检验给出的 p 值是 0.10，谈不上显著。更能说明问题的是另外两类：245 道两边都解得出、352 道两边都解不出，这说明它们能覆盖的题其实高度重叠。&lt;/p>
&lt;h3 id="卡死的公平性核算">卡死的公平性核算&lt;/h3>
&lt;p>最后补一笔公平账，对于那 9 个卡死的实例（flipt 5 个、teleport 2 个，加上 vuls 和 tutanota 各 1 个）逐题核对下来，真正算得上&amp;quot;不公平丢分&amp;quot;的其实只有 2 个，也就是 mini 能解、而 SWE-agent 仅仅因为卡死被记了 0 分的 &lt;code>vuls e4728e38&lt;/code> 和 &lt;code>teleport 47530e1f&lt;/code>。就算把这 2 分补回去，SWE-agent 也不过从 302 升到 304（41.6%），mini 仍是 44.0%，差距反而更小，&amp;ldquo;不显著&amp;quot;的结论丝毫没变。况且换个角度想，执行后端稳不稳本来就是一个 agent 端到端能力的一部分，在&amp;quot;衡量整套 scaffold&amp;quot;的口径下，记 0 并不算冤枉它😅。&lt;/p>
&lt;/div>
&lt;div class="li-lang-en" markdown="1">
&lt;p>&lt;strong>TL;DR.&lt;/strong> On SWE-bench Pro, the more elaborate SWE-agent underperforms the minimalist mini-swe-agent, and additionally suffers from instances that hang indefinitely.&lt;/p>
&lt;p>A widely held intuition holds that &lt;strong>the more complete an agent framework is, the better it should perform.&lt;/strong> Although this assumption is rarely questioned, to my knowledge it has never been rigorously verified. I therefore set out to test it on a software-engineering (SWE) task.&lt;/p>
&lt;p>The design is simple: evaluate two agent frameworks of differing complexity on SWE-bench Pro and compare their scores. The result was counterintuitive: &lt;strong>the simpler framework achieved the higher score.&lt;/strong>&lt;/p>
&lt;div class="alert alert-note">
&lt;div>
&lt;p>&lt;strong>Background&lt;/strong>&lt;/p>
&lt;p>&lt;strong>What is an SWE task?&lt;/strong> A software-engineering (SWE) task measures an agent&amp;rsquo;s end-to-end, real-world development ability: given a real code repository and a GitHub issue, the agent must autonomously read the code, localize the problem, make cross-file edits, and produce a patch, which is then judged &amp;ldquo;resolved&amp;rdquo; by a suite of hidden tests. SWE-bench Pro is a benchmark targeting exactly this kind of task.&lt;/p>
&lt;p>&lt;strong>SWE-agent and mini-swe-agent&lt;/strong> are two agent frameworks built for the SWE setting:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>SWE-agent&lt;/strong> is organized around the notion of an Agent-Computer Interface (ACI): a carefully designed set of dedicated tools, each with its own interface. Execution is delegated to a separate &lt;a href="https://github.com/SWE-agent/SWE-ReX" target="_blank" rel="noopener">&lt;code>SWE-ReX&lt;/code>&lt;/a> backend, which maintains a persistent &lt;code>pexpect&lt;/code> interactive shell (the working directory and environment variables persist across commands) and pre-parses every command with &lt;code>bashlex&lt;/code> (splitting, syntax validation, and precise exit-code extraction).&lt;/li>
&lt;li>&lt;strong>mini-swe-agent&lt;/strong> is a minimal reimplementation (the agent class is roughly 100 lines of Python): it exposes a single tool, bash, and does not even rely on the model&amp;rsquo;s tool-calling interface; each command is executed through &lt;code>subprocess.run&lt;/code>, with every action fully independent.&lt;/li>
&lt;/ul>
&lt;/div>
&lt;/div>
&lt;p>With this background in place, the central question becomes: &lt;strong>under the same model and the same benchmark, does SWE-agent&amp;rsquo;s heavier engineering actually translate into a higher payoff than mini-swe-agent?&lt;/strong>&lt;/p>
&lt;h2 id="the-more-elaborate-swe-agent-is-not-stronger">The more elaborate SWE-agent is not stronger&lt;/h2>
&lt;p>Using Claude Sonnet 4.5, I evaluated both frameworks on the full 731-problem SWE-bench Pro public set, capping the call budget at 50 per problem.&lt;/p>
&lt;p>The results are as follows:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Agent framework&lt;/th>
&lt;th>N&lt;/th>
&lt;th>resolved&lt;/th>
&lt;th>resolve rate&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>mini-swe-agent&lt;/strong>&lt;/td>
&lt;td>731&lt;/td>
&lt;td>322&lt;/td>
&lt;td>&lt;strong>44.0%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>SWE-agent&lt;/strong>&lt;/td>
&lt;td>731&lt;/td>
&lt;td>302&lt;/td>
&lt;td>&lt;strong>41.3%&lt;/strong>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;blockquote>
&lt;p>For reference, the officially reported figure for Sonnet 4.5 is around 43.6%&lt;sup id="fnref1:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>; mini&amp;rsquo;s 44.0% aligns closely with it, which lends credibility to the present setup.&lt;/p>
&lt;/blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Per-language resolve rate" srcset="
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_385de0308e1f839c65be1f56e7d58fcb.webp 400w,
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_240e4bf4e45fc54693a0874eb7a5ba71.webp 760w,
/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/swe-agent-vs-mini-swe-agent/figures/lang_compare_hu04fac4dd4ff9f738eeed95b0b0ae5b98_87878_385de0308e1f839c65be1f56e7d58fcb.webp"
width="760"
height="380"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The minimalist mini-swe-agent is, in fact, 2.7 percentage points higher than SWE-agent. More surprisingly, &lt;strong>after SWE-agent had completed 722 of the 731 instances, its final 9 instances hung outright&lt;/strong>: the containers stayed up for 5 to 12 hours with no log activity for hours on end, and had to be killed manually (and thus scored 0). Running the very same 9 problems, mini-swe-agent exhibited no such behavior.&lt;/p>
&lt;p>For me, beyond the surprise in the numbers, the more intriguing question was why these 9 instances hung in the first place.&lt;/p>
&lt;h2 id="why-did-9-containers-hang">Why did 9 containers hang?&lt;/h2>
&lt;p>Before killing the containers, I captured the &lt;code>docker logs&lt;/code> of each one:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">INFO ... 200 OK POST /run_in_session
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">🦖 ERROR Bashlex fail: here-document at line 0 delimited by end-of-file (wanted &amp;#34;&amp;#39;EOF&amp;#39;&amp;#34;)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The containers were alive and &lt;code>swerex-remote&lt;/code> was still returning &lt;code>200 OK&lt;/code>; in other words, the agent was merely spinning in place. Tracing the agent back, it was writing a large file via a heredoc:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">cat &amp;gt; some_file.go &lt;span class="s">&amp;lt;&amp;lt;&amp;#39;EOF&amp;#39;
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s">... a large block of Go code ...
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s">EOF&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This pins down the root cause: SWE-agent&amp;rsquo;s execution backend, &lt;code>swe-rex&lt;/code>, first parses every command with &lt;code>bashlex&lt;/code> (a pure-Python bash parser) before dispatching it into the container.&lt;/p>
&lt;p>That is precisely where things break. &lt;code>bashlex&lt;/code>&amp;rsquo;s support for heredocs is incomplete; confronted with a large block-write such as &lt;code>cat &amp;lt;&amp;lt;'EOF' … a large block of code … EOF&lt;/code>, it fails outright and raises &lt;code>Bashlex fail&lt;/code>.&lt;/p>
&lt;p>Once parsing collapses, swe-rex can no longer determine whether the command finished or what its exit code was. The agent receives a malformed observation, does not recover by attempting a different approach, and simply retries the same command over and over, leaving the container hung for 5 to 12 hours.&lt;/p>
&lt;h2 id="why-does-swe-agent-bother-parsing-commands-in-the-first-place">Why does SWE-agent bother parsing commands in the first place?&lt;/h2>
&lt;p>The answer follows from its design goals. swe-rex maintains a long-lived shell session, so that state such as the working directory, environment variables, and any activated virtual environment persists across commands.&lt;/p>
&lt;p>The cost, however, is that when commands run inside a continuously flowing session, &amp;ldquo;where a command ends and what its return code is&amp;rdquo; is no longer as self-evident as it is when a standalone process exits. swe-rex must therefore rely on &lt;code>bashlex&lt;/code> to split commands and inject sentinel strings in order to recover exit codes from the output stream.&lt;/p>
&lt;p>The benefit is that, once a command has been parsed into a structured form, the backend can additionally perform safety checks, command rewriting, and other fine-grained wrapping. This is exactly the Agent-Computer Interface philosophy that SWE-agent champions; on its own terms, the design is sound, trading extra complexity for stronger session semantics.&lt;/p>
&lt;p>mini-swe-agent takes the opposite, minimalist route. It maintains no session at all; every command is handed directly to the system&amp;rsquo;s real shell through a single &lt;code>subprocess.run(shell=True)&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># minisweagent/environments/local.py&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">result&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">run&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">command&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">shell&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># hand it to the system shell (/bin/sh -c)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">text&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">cwd&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">cwd&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">timeout&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">timeout&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">stdout&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">PIPE&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">stderr&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">subprocess&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">STDOUT&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This sacrifices session state (each command starts afresh, and the agent must spell out paths and environment itself), but for exactly that reason it sidesteps all the trouble of parsing bash in-process. A heredoc, however large, is the real shell&amp;rsquo;s native job; once the command finishes and the process exits, the exit code is simply there.&lt;/p>
&lt;p>As a result, the same large-file-writing command that crashes the container under SWE-agent, by hitting &lt;code>bashlex&lt;/code>&amp;rsquo;s limitation, runs uneventfully under mini-swe-agent. This is a textbook engineering trade-off: swe-rex buys stronger session semantics at the price of an additional failure surface, whereas mini forgoes the convenience of a session in exchange for a smaller and more controllable space of errors.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Broadly speaking, this finding echoes Occam&amp;rsquo;s razor: when capability is comparable, the simpler solution is often the more effective, and arguably the more robust, one.&lt;/p>
&lt;p>That said, this is admittedly a small toy experiment. It does not establish that more complex, more sophisticated agents are necessarily worse; SWE-agent&amp;rsquo;s shortfall here is, to a large extent, dragged down by one specific bug. A carefully tuned, more elaborate agent could well surpass mini.&lt;/p>
&lt;p>These remain, for now, preliminary conjectures. In follow-up work I intend to investigate the opening question more rigorously: whether a more complete agent is genuinely worth it. Stay tuned.&lt;/p>
&lt;h2 id="appendix">Appendix&lt;/h2>
&lt;h3 id="per-language-comparison">Per-language comparison&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Language&lt;/th>
&lt;th>mini&lt;/th>
&lt;th>SWE-agent&lt;/th>
&lt;th>Winner&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>go&lt;/td>
&lt;td>95/280 = &lt;strong>34%&lt;/strong>&lt;/td>
&lt;td>78/280 = 28%&lt;/td>
&lt;td>mini +6pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>python&lt;/td>
&lt;td>139/266 = 52%&lt;/td>
&lt;td>143/266 = &lt;strong>54%&lt;/strong>&lt;/td>
&lt;td>SWE-agent +2pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>js&lt;/td>
&lt;td>77/165 = &lt;strong>47%&lt;/strong>&lt;/td>
&lt;td>73/165 = 44%&lt;/td>
&lt;td>mini +3pp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>ts&lt;/td>
&lt;td>11/20 = &lt;strong>55%&lt;/strong>&lt;/td>
&lt;td>8/20 = 40%&lt;/td>
&lt;td>mini (N=20, small sample)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>mini&amp;rsquo;s advantage comes almost entirely from Go, where it leads by 6 percentage points (17 problems); within Go, a single repository, &lt;code>gravitational/teleport&lt;/code>, accounts for most of it (16 solved only by mini versus 5 only by SWE-agent). This is unsurprising: Go problems tend to involve large, heavily modified repositories, which is precisely where SWE-agent&amp;rsquo;s heredoc hang is most likely to be triggered. On Python, by contrast, SWE-agent is actually 2 points higher.&lt;/p>
&lt;h3 id="paired-significance">Paired significance&lt;/h3>
&lt;p>Aligning all 731 problems by &lt;code>instance_id&lt;/code> yields four categories:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>Count&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Both solved&lt;/td>
&lt;td>245&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Only mini&lt;/td>
&lt;td>77&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Only SWE-agent&lt;/td>
&lt;td>57&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Neither&lt;/td>
&lt;td>352&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>What actually separates the two are the 134 problems solved by exactly one side: 77 for mini and 57 for SWE-agent. The tilt favors mini, but McNemar&amp;rsquo;s exact test yields p = 0.10, which is not significant. More telling are the other two categories: 245 problems solved by both and 352 solved by neither, indicating that the two scaffolds cover a highly overlapping set of problems.&lt;/p>
&lt;h3 id="fairness-accounting-for-the-hangs">Fairness accounting for the hangs&lt;/h3>
&lt;p>Finally, a fairness check. Of the 9 hung instances (5 from flipt, 2 from teleport, and one each from vuls and tutanota), only 2 constitute genuinely &amp;ldquo;unfair&amp;rdquo; losses, i.e., problems that mini solved but on which SWE-agent was scored 0 purely because it hung: &lt;code>vuls e4728e38&lt;/code> and &lt;code>teleport 47530e1f&lt;/code>. Even crediting those 2 back, SWE-agent rises only from 302 to 304 (41.6%), while mini remains at 44.0%; the gap narrows but the &amp;ldquo;not significant&amp;rdquo; conclusion is unchanged. Moreover, the robustness of the execution backend is itself part of an agent&amp;rsquo;s end-to-end capability, so scoring 0 is not unfair under a &amp;ldquo;whole-scaffold&amp;rdquo; evaluation.&lt;/p>
&lt;/div>
&lt;script>
(function(){
function getInitialLang(){
try{
var q=new URL(window.location.href).searchParams.get('lang');
if(q==='en'||q==='zh')return q;
var s=localStorage.getItem('li_lang');
if(s==='en'||s==='zh')return s;
}catch(_){}
var n=(navigator.language||navigator.userLanguage||'en').toLowerCase();
return n.indexOf('zh')===0?'zh':'en';
}
function setLang(lang){
if(lang!=='en'&amp;&amp;lang!=='zh')lang='en';
document.body.setAttribute('data-li-lang',lang);
try{localStorage.setItem('li_lang',lang);}catch(_){}
try{var url=new URL(window.location.href);url.searchParams.set('lang',lang);window.history.replaceState({},'',url.toString());}catch(_){}
document.querySelectorAll('.li-lang-bar [data-lilang]').forEach(function(b){
b.classList.toggle('active',b.getAttribute('data-lilang')===lang);
});
}
window.__setLILang=setLang;
setLang(getInitialLang());
})();
&lt;/script>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>SWE-bench Pro, &lt;a href="https://arxiv.org/abs/2509.16941" target="_blank" rel="noopener">https://arxiv.org/abs/2509.16941&lt;/a>&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&amp;#160;&lt;a href="#fnref1:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>利用 SKILL让 Agent 自动筛选 &amp; 解读 Huggingface 的每日论文</title><link>https://geyuyao.com/post/auto_paper_reader/</link><pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate><guid>https://geyuyao.com/post/auto_paper_reader/</guid><description>&lt;p style="display:flex;gap:16px;flex-wrap:wrap;align-items:center;justify-content:center;margin:10px 0 30px;">
&lt;a href="https://github.com/YuyaoGe/Paper_Agent_Skill" target="_blank" rel="noopener">
&lt;img src="https://img.shields.io/badge/SKILL-2563EB?style=for-the-badge&amp;logo=github&amp;logoColor=white" alt="SKILL" style="height:56px;">
&lt;/a>
&lt;a href="https://yuyaoge.github.io/paper_reader/" target="_blank" rel="noopener">
&lt;img src="https://img.shields.io/badge/WebSite-16A34A?style=for-the-badge&amp;logo=googlechrome&amp;logoColor=white" alt="WebSite" style="height:56px;">
&lt;/a>
&lt;/p>
&lt;style>
.li-lang-bar{display:flex;justify-content:flex-end;gap:8px;margin:0 0 20px 0;}
.li-lang-bar button{border:1px solid rgba(127,127,127,.35);background:transparent;color:inherit;padding:4px 14px;border-radius:0;font-size:13px;cursor:pointer;transition:all .15s ease;}
.li-lang-bar button:hover{background:rgba(127,127,127,.12);}
.li-lang-bar button.active{background:#2563eb;color:#fff;border-color:#2563eb;}
/* 默认仅显示英文，待 JS 决定语言后再切换，避免双语闪烁 */
.li-lang-zh{display:none;}
body[data-li-lang="zh"] .li-lang-zh{display:block;}
body[data-li-lang="zh"] .li-lang-en{display:none;}
body[data-li-lang="en"] .li-lang-zh{display:none;}
body[data-li-lang="en"] .li-lang-en{display:block;}
&lt;/style>
&lt;script>
/* 尽早设置语言，避免用户看到双语闪烁 */
(function(){
try{
var lang=null;
try{var q=new URL(window.location.href).searchParams.get('lang');if(q==='en'||q==='zh')lang=q;}catch(_){}
if(!lang){try{var s=localStorage.getItem('li_lang');if(s==='en'||s==='zh')lang=s;}catch(_){}}
if(!lang){var n=(navigator.language||navigator.userLanguage||'en').toLowerCase();lang=n.indexOf('zh')===0?'zh':'en';}
if(document.body){document.body.setAttribute('data-li-lang',lang);}
else{document.documentElement.setAttribute('data-li-lang',lang);document.addEventListener('DOMContentLoaded',function(){document.body.setAttribute('data-li-lang',lang);});}
}catch(_){}
})();
&lt;/script>
&lt;div class="li-lang-bar">
&lt;button type="button" data-lilang="zh" onclick="window.__setLILang &amp;&amp; window.__setLILang('zh')">中文&lt;/button>
&lt;button type="button" data-lilang="en" onclick="window.__setLILang &amp;&amp; window.__setLILang('en')">English&lt;/button>
&lt;/div>
&lt;div class="li-lang-zh" markdown="1">
&lt;h2 id="起因">起因&lt;/h2>
&lt;p>作为研究生，我们每天都要读很多论文。但除了读论文本身，&lt;strong>挑选论文同样很耗费时间和精力&lt;/strong>。&lt;/p>
&lt;p>能帮忙分析论文的软件其实不少，比如：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>&lt;a href="https://www.alphaxiv.org/" target="_blank" rel="noopener">AlphaXiv&lt;/a>&lt;/strong>：可以非常系统地精读某一篇论文。&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://papers.cool/" target="_blank" rel="noopener">papers.cool&lt;/a>&lt;/strong>：由&lt;a href="https://kexue.fm/" target="_blank" rel="noopener">苏神&lt;/a>开发，每天爬取 arXiv 新论文，用 Kimi 给出中文解读，并支持在 Kimi 里继续追问、深入分析。&lt;/li>
&lt;/ol>
&lt;p>尽管这些工具在论文解读上做得很深入，但它们有一个共同的缺点：&lt;strong>没有起到筛选的作用&lt;/strong>。它们能把某一篇论文读得很细，可“从每天几十篇里挑出值得读的”仍然得我自己来做，它们也不会帮我打标签。此外，更麻烦的是，即便发现一篇不错的论文、想 follow 它的工作，也常常会发现它的 GitHub repo 是空的，甚至根本没有链接，最后白忙活一场。&lt;/p>
&lt;p>这样的沉没成本其实很高：follow 一篇论文，可能写了很久的代码，最后才发现某个输入对不上，或者某一步根本复现不了。&lt;/p>
&lt;p>所以我想把这个工具做得真正有用、能投入真正使用，可以把时间花在有意义的工作上。&lt;/p>
&lt;h1 id="思路">思路&lt;/h1>
&lt;h2 id="需求">需求&lt;/h2>
&lt;p>具体来说，我希望它能帮我筛选每天 &lt;a href="https://huggingface.co/papers" target="_blank" rel="noopener">HuggingFace Daily Papers&lt;/a> 里的论文，并且能按分类整理好。&lt;/p>
&lt;blockquote>
&lt;p>Hugginface daily paper中的论文是论文作者自主上传的，这种积极性使得相比于arXiv，daily paper中的论文的完善度、可信度和质量更高&lt;/p>
&lt;/blockquote>
&lt;p>这个产品要做两件事：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>自动分类&lt;/strong>：为每篇论文打上标签。&lt;/li>
&lt;li>&lt;strong>验证论文的真实性，以及“容易 follow”的程度&lt;/strong>。&lt;/li>
&lt;li>&lt;strong>自动生成论文摘要&lt;/strong>（类似于&lt;a href="https://papers.cool/" target="_blank" rel="noopener">papers.cool&lt;/a>）&lt;/li>
&lt;/ol>
&lt;p>其中“容易 follow”，定义为：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>(a)&lt;/strong> 论文应该对应有 GitHub 的 repo；&lt;/li>
&lt;li>&lt;strong>(b)&lt;/strong> repo 里的代码必须是完善的——比如有些 repo 里只有一个 README，由于难以复现，因此应该被排除；&lt;/li>
&lt;li>&lt;strong>(c)&lt;/strong> 数据集等资源也应是开源的。&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="HuggingFace Daily Papers 每日列表" srcset="
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_e1e3e9c73e64b5b7c5c3a39df7b3933a.webp 400w,
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_5a6b0d89873c23f7049f285a2061e7f2.webp 760w,
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_e1e3e9c73e64b5b7c5c3a39df7b3933a.webp"
width="760"
height="464"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>HuggingFace Daily Papers：每天几十篇论文需要我们自己点进去逐个挑选&lt;/em>&lt;/p>
&lt;h2 id="tex源码-or-pdf">Tex源码 or PDF&lt;/h2>
&lt;p>关于论文分析的真实性和技术实现，我有两点考虑：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>真实性&lt;/strong>：论文发表的单位应当是frontier的高校/机构&lt;/li>
&lt;li>&lt;strong>流程&lt;/strong>：为了便于分析，应该让 Agent 直接读取论文的 &lt;strong>tex 源码&lt;/strong>，而不是PDF。&lt;/li>
&lt;/ol>
&lt;p>之所以不建议读 PDF，是因为 PDF 在解析时容易出现各种问题，而且也难以进行字符串匹配（比如要在正文里找 &lt;code>github.com/&lt;/code> 这样的链接）。&lt;/p>
&lt;hr>
&lt;h1 id="实现">实现&lt;/h1>
&lt;h2 id="先python初筛再交给-agent-自主执行">先Python初筛，再交给 Agent 自主执行&lt;/h2>
&lt;p>我希望它每天自动爬取，让我无感地拿到当天的论文。整体分两步：&lt;strong>先用 Python 脚本初筛，再交给 Agent 做判断和整理&lt;/strong>。&lt;/p>
&lt;p>对于初筛，HuggingFace 有公开接口，按日期就能取到当天列表，所以我写了个零依赖脚本 &lt;a href="https://github.com/YuyaoGe/Paper_Agent_Skill/blob/main/scripts/fetch_hf_papers.py" target="_blank" rel="noopener">&lt;code>fetch_hf_papers.py&lt;/code>&lt;/a>，用关键词规则粗筛一遍：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># scripts/fetch_hf_papers.py —— 直接调用 HF 公开接口，无需 API Key&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">url&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;https://huggingface.co/api/daily_papers?date=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">date_str&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 标题命中这些关键词 → 排除&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">TITLE_EXCLUDE_KEYWORDS&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;benchmark&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;benchmarking&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;bench&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;speech&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;audio&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;video&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;3d&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;compiler&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;cuda&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;kernel&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;triton&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;tpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;xla&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;quantization&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;quantisation&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;distillation&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># 摘要必须命中其一 → 确认是 LLM/VLM 领域&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ABSTRACT_REQUIRE_ANY&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;large language model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;llm&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;vision language model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;vlm&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;multimodal&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;reasoning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;reinforcement learning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;instruction tuning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;fine-tuning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;alignment&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;agent&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;chain-of-thought&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;in-context learning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>只需纯标准库（&lt;code>urllib&lt;/code> + &lt;code>json&lt;/code>）就够，且不需要 API key。这一步可以把把几十篇筛选到十几篇，剩下的论文需要留给 Agent 来盘。&lt;/p>
&lt;p>其次是对论文的判断&amp;amp;分析。
考虑到目前的模型已经可以很好的平衡指令遵循能力与成本，因此决定不做过多的harmness，而是全权交给Agent自主判断。
为此，我将需求写为 &lt;a href="https://github.com/YuyaoGe/Paper_Agent_Skill/blob/main/SKILL.md" target="_blank" rel="noopener">Skill&lt;/a>。&lt;/p>
&lt;hr>
&lt;h2 id="skill-pipline-design">SKILL Pipline Design&lt;/h2>
&lt;p>SKILL将每篇候选论文分为三个步骤：&lt;/p>
&lt;p>&lt;strong>第一步&lt;/strong>，提取 GitHub 链接：优先用 HF 接口的 &lt;code>githubRepo&lt;/code> 字段，如果此字段为空，就去 arXiv 的 tex 源码里检索 &lt;code>github.com/&lt;/code>。&lt;/p>
&lt;p>&lt;strong>第二步&lt;/strong>，调用 GitHub Contents API 验证仓库里有没有实质代码：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">API: https://api.github.com/repos/{owner}/{repo}/contents
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">保留（满足其一）：
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - 根目录有 .py / .sh / .ipynb 文件
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - 有 src / scripts / train / model / code 等目录
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">丢弃（命中其一）：
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - 只有 README.md / LICENSE / assets 这类非代码文件
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - API 返回 404（仓库不存在或为空）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - 仓库名 / 描述含 &amp;#34;coming-soon&amp;#34;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>第三步&lt;/strong>，写一段&lt;strong>一眼就能看懂&lt;/strong>的中文摘要，再从一套固定标签里打标签以便筛选：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">RL · 微调 · 无需训练 · 长文本 · VLM · MeM(Agent记忆) · API · 扩散模型
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>最后落成一条 JSON：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2026-03-12&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;title&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Prism-Δ: Differential Subspace Steering for Prompt Highlighting in Large Language Models&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;arxiv_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2603.10705&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;github&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://github.com/YuyaoGe/PRISM-DELTA&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;abstract&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;PRISM-Δ 是一种提示高亮方法，使 LLM 在生成时优先关注用户指定的文本片段。核心思路是分解正负交叉协方差矩阵的差值以最大化判别能量、消除共享方向，每个注意力头获得连续 softplus 重要性权重（弱但有用的头以降低强度贡献），并扩展到 Value 表示以捕获内容通道信号。在 4 个基准、5 个模型上，PRISM-Δ 在 20 个配置中的 19 个匹配或超越最佳现有方法，相对增益最高 +10.6%，流畅度损失减半，长文本检索场景相对增益最高 +4.8%。&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;tags&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;无需训练&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="harmness-engineering-design">Harmness Engineering Design&lt;/h2>
&lt;p>有了SKILL接下来就是设计 Agent (或者是Subagent) 与 SKILL 的调用关系。&lt;/p>
&lt;p>这里两个选择：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>主从调度&lt;/strong>：一个主 Agent 去调度多个子 Agent&lt;/li>
&lt;li>&lt;strong>并行调度&lt;/strong>：每天各起一个独立 Agent&lt;/li>
&lt;/ul>
&lt;p>对于主从调度，我使用OpenClaw实现，最大subagent数设置为5。&lt;/p>
&lt;p>然而经常出现超时的问题，具体而言，让claw调研某 10 天的论文，尽管它会起多个 sub-agent，但超时&amp;amp;超出上下文的问题频频出现，极不稳定。&lt;/p>
&lt;p>因此，后来选择了后者的方案：每天一个独立 Agent 并行跑，各自写一个 JSON，最后用主程序合并。一天生成一个论文列表，事实证明，越简单越稳定。&lt;/p>
&lt;p>并行用 &lt;code>xargs -P&lt;/code> 就够：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># backfill_papers.sh —— 批量补录历史日期，默认并发 6&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">printf&lt;/span> &lt;span class="s1">&amp;#39;%s\n&amp;#39;&lt;/span> &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">MISSING&lt;/span>&lt;span class="p">[@]&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="p">|&lt;/span> xargs -P &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$CONCURRENCY&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -I&lt;span class="o">{}&lt;/span> bash &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$SCRIPT_DIR&lt;/span>&lt;span class="s2">/run_kimi_one_day.sh&amp;#34;&lt;/span> &lt;span class="o">{}&lt;/span> &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PAPER_READER_DIR&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>合并脚本 &lt;code>merge_batches.py&lt;/code> 也只做确定的事：扫 &lt;code>paper_batches/*.json&lt;/code>，跳过已有日期，按日期把缺的追加进去。&lt;/p>
&lt;h2 id="关于agent框架的选择">关于Agent框架的选择&lt;/h2>
&lt;p>选 Agent 有个硬条件：&lt;strong>能在命令行非交互式启动&lt;/strong>，而不是用进入终端手动操作。&lt;/p>
&lt;p>比如：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Cursor &amp;amp; Claude Code&lt;/strong>：需要进GUI界面或者终端操作&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://moonshotai.github.io/kimi-cli/" target="_blank" rel="noopener">Kimi CLI&lt;/a>&lt;/strong>：可以将 prompt 作为CLI启动命令中的一个形参，可以很方便的调用，而且KIMI价格便宜&lt;/li>
&lt;/ul>
&lt;p>所以对于每一天的处理就是一行命令：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># run_kimi_one_day.sh —— 用 Kimi CLI 处理单天，结果写入 batch JSON&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kimi --print --quiet &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --work-dir &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PAPER_READER_DIR&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --add-dir /Users/yuyaoge/Project/Paper_Agent_Skill &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> -p &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PROMPT&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &amp;gt; &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$LOG_FILE&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> 2&amp;gt;&lt;span class="p">&amp;amp;&lt;/span>&lt;span class="m">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>尽管如此，目前还是需要每天手动启动工作流，我们希望可以对于用户无感运行。因此，希望可以设置一个适配于Macos的自动启动脚本。&lt;/p>
&lt;h2 id="在macos上的自动启动脚本">在Macos上的自动启动脚本&lt;/h2>
&lt;p>自动脚本用 macOS 的 &lt;code>launchd&lt;/code>，配置 &lt;code>com.yuyaoge.paper-daily-fetch.plist&lt;/code>：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-xml" data-lang="xml">&lt;span class="line">&lt;span class="cl">&lt;span class="c">&amp;lt;!-- 登录 / load 时立即跑一次 --&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;key&amp;gt;&lt;/span>RunAtLoad&lt;span class="nt">&amp;lt;/key&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;true/&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c">&amp;lt;!-- 之后每 2 小时再跑一次 --&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;key&amp;gt;&lt;/span>StartInterval&lt;span class="nt">&amp;lt;/key&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;integer&amp;gt;&lt;/span>7200&lt;span class="nt">&amp;lt;/integer&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;code>daily_fetch.sh&lt;/code> 里有两个Feature值得一提：&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>不处理今日论文&lt;/strong>：HuggingFace 在当天是会根据用户的上传而实时更新的，当天爬会漏掉当前时间点后面提交的论文。所以默认抓“昨天往前数 7 天”，而不包括当天的论文，顺便也补上关机那几天的论文。&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>幂等&lt;/strong>：不应该每日定时启动，因此不确定所定的时间点是否已经开机。因此，设定为自开机后，每 2 小时跑一次，如果已有结果就跳过，但若是空结果先执行一次；如果git没变化就不 push。&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="quick-start">Quick Start&lt;/h1>
&lt;blockquote>
&lt;p>&lt;strong>环境要求&lt;/strong>：macOS、Python 3，以及已安装并登录的 &lt;a href="https://moonshotai.github.io/kimi-cli/" target="_blank" rel="noopener">Kimi CLI&lt;/a>。&lt;/p>
&lt;/blockquote>
&lt;p>&lt;strong>1. 克隆仓库&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">git clone https://github.com/YuyaoGe/Paper_Agent_Skill.git &lt;span class="c1"># Skill + 脚本&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">git clone https://github.com/YuyaoGe/paper_reader.git &lt;span class="c1"># 数据 + 前端&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>2. 安装 Skill 到 Kimi&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="nb">cd&lt;/span> Paper_Agent_Skill
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">mkdir -p ~/.kimi/skills
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">ln -sfn &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PWD&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> ~/.kimi/skills/hf-paper-filter
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>3. 手动验证链路&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># run_kimi_one_day.sh YYYY-MM-DD [paper_reader 路径]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">./scripts/run_kimi_one_day.sh 2026-06-01 /path/to/paper_reader
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>4.（可选）批量补录历史区间&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># backfill_papers.sh 起始日期 结束日期 [并发数] [paper_reader 路径]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">./scripts/backfill_papers.sh 2026-04-25 2026-05-26 &lt;span class="m">6&lt;/span> /path/to/paper_reader
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">python3 ./scripts/merge_batches.py /path/to/paper_reader
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>5. 安装定时任务，实现无人值守&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">cp scripts/com.yuyaoge.paper-daily-fetch.plist ~/Library/LaunchAgents/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">launchctl load -w ~/Library/LaunchAgents/com.yuyaoge.paper-daily-fetch.plist
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h1 id="前端展示">前端展示&lt;/h1>
&lt;p>最终列表汇总到 &lt;code>paper_list.md&lt;/code> 中，以便于 Agent 可以很方便的在文件末尾追加，同时易于前端解析。&lt;/p>
&lt;p>&lt;a href="https://yuyaoge.github.io/paper_reader/" target="_blank" rel="noopener">前端&lt;/a> 设计为纯静态页面，运行时把 Markdown 拉下来解析成卡片，支持按标签筛、按日期检索：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-js" data-lang="js">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// 前端运行时直接拉取 Markdown 数据源并解析
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="kr">const&lt;/span> &lt;span class="nx">resp&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kr">await&lt;/span> &lt;span class="nx">fetch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;paper_list.md&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// 每条格式：- **Title** `[Tag]` — [id](url) | [GitHub](url)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// &amp;gt; 中文摘要
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="nx">currentPapers&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="nx">push&lt;/span>&lt;span class="p">({&lt;/span> &lt;span class="nx">title&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">tags&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">links&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">desc&lt;/span> &lt;span class="p">});&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>托管在 GitHub Pages，定时脚本每天把更新后的 &lt;code>paper_list.md&lt;/code> push 到云端，页面同步更新。&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="paper_reader 界面" srcset="
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_37c8ed392cc707c0b52f7778a03fa896.webp 400w,
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_ca258df559b5167a6518656aa5b65f4c.webp 760w,
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_37c8ed392cc707c0b52f7778a03fa896.webp"
width="760"
height="612"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>paper_reader：筛选、打标签、生成中文摘要后的样子，可以按标签筛选、按日期排序。&lt;/em>&lt;/p>
&lt;h1 id="整体流程">整体流程&lt;/h1>
&lt;p>综上，整条流程如下：&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl"> macOS launchd ──▶ daily_fetch.sh （每 2h 触发，幂等）
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ 按天拆分，并行
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> run_kimi_one_day.sh × N (xargs -P 6)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> └─ Kimi CLI 加载 hf-paper-filter Skill
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ├─ fetch_hf_papers.py Python 初筛
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ├─ GitHub Contents API 验证有无代码
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> └─ 写中文摘要 + 打标签
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ 每天一个 JSON
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> paper_batches/YYYY-MM-DD.json
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ merge_batches.py 合并
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> paper_list.md ──git push──▶ GitHub Pages（前端 fetch 渲染）
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>用到的东西也都很常规：Python 标准库做初筛、Kimi CLI 做判断、&lt;code>xargs -P&lt;/code> 并行、&lt;code>launchd&lt;/code> 定时、一个 &lt;code>paper_list.md&lt;/code> 当数据源、一个静态页面做展示。&lt;/p>
&lt;!-- # 项目链接
- Agent Skill（核心逻辑）：[github.com/YuyaoGe/Paper_Agent_Skill](https://github.com/YuyaoGe/Paper_Agent_Skill)
- 前端展示源码：[github.com/YuyaoGe/paper_reader](https://github.com/YuyaoGe/paper_reader)
- 在线 Demo：[yuyaoge.github.io/paper_reader](https://yuyaoge.github.io/paper_reader/)
- 依赖工具：[Kimi CLI](https://moonshotai.github.io/kimi-cli/) · [HuggingFace Daily Papers](https://huggingface.co/papers) -->
&lt;/div>
&lt;div class="li-lang-en" markdown="1">
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>As graduate students, we read a lot of papers every day. But beyond reading the papers themselves, &lt;strong>deciding which papers to read is just as time- and energy-consuming&lt;/strong>.&lt;/p>
&lt;p>There is no shortage of tools for analyzing papers, for example:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>&lt;a href="https://www.alphaxiv.org/" target="_blank" rel="noopener">AlphaXiv&lt;/a>&lt;/strong>: reads a single paper very systematically.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://papers.cool/" target="_blank" rel="noopener">papers.cool&lt;/a>&lt;/strong>: built by &lt;a href="https://kexue.fm/" target="_blank" rel="noopener">Su Jianlin&lt;/a>; it crawls new arXiv papers daily, uses Kimi to produce Chinese explanations, and lets you keep asking follow-up questions inside Kimi.&lt;/li>
&lt;/ol>
&lt;p>These tools go deep on &lt;em>reading&lt;/em> a paper, but they share one shortcoming: &lt;strong>they do not help with filtering&lt;/strong>. They can read a given paper in great detail, yet &amp;ldquo;picking the few worth reading out of the dozens each day&amp;rdquo; is still on me, and they will not tag papers either. Worse, even when I find a promising paper and want to follow up on its work, I often discover that its GitHub repo is empty, or that there is no link at all — and the effort is wasted.&lt;/p>
&lt;p>The sunk cost here is high: when following a paper, I might write code for a long time only to find that some input does not match, or that a reproduction step simply does not work.&lt;/p>
&lt;p>So I wanted to build something genuinely useful and actually usable in daily work, so that time goes to work that matters.&lt;/p>
&lt;h1 id="idea">Idea&lt;/h1>
&lt;h2 id="requirements">Requirements&lt;/h2>
&lt;p>Concretely, I want it to filter the papers in &lt;a href="https://huggingface.co/papers" target="_blank" rel="noopener">HuggingFace Daily Papers&lt;/a> every day, and organize them by category.&lt;/p>
&lt;blockquote>
&lt;p>Papers in HuggingFace Daily Papers are submitted by the authors themselves. That initiative tends to make daily-paper submissions more complete, more credible, and higher in quality than arXiv at large.&lt;/p>
&lt;/blockquote>
&lt;p>The product needs to do three things:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Auto-classification&lt;/strong>: tag every paper.&lt;/li>
&lt;li>&lt;strong>Verify a paper&amp;rsquo;s authenticity and how &amp;ldquo;easy to follow&amp;rdquo; it is&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Auto-generate a paper summary&lt;/strong> (similar to &lt;a href="https://papers.cool/" target="_blank" rel="noopener">papers.cool&lt;/a>).&lt;/li>
&lt;/ol>
&lt;p>Here &amp;ldquo;easy to follow&amp;rdquo; is defined as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>(a)&lt;/strong> the paper should have a corresponding GitHub repo;&lt;/li>
&lt;li>&lt;strong>(b)&lt;/strong> the code in that repo must be substantial — a repo with only a README, for instance, is hard to reproduce and should be excluded;&lt;/li>
&lt;li>&lt;strong>(c)&lt;/strong> datasets and other resources should be open-sourced as well.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="HuggingFace Daily Papers list" srcset="
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_e1e3e9c73e64b5b7c5c3a39df7b3933a.webp 400w,
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_5a6b0d89873c23f7049f285a2061e7f2.webp 760w,
/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/auto_paper_reader/figures/hf_huc3ea2232ca5d15f85d999e66ebff611e_1555544_e1e3e9c73e64b5b7c5c3a39df7b3933a.webp"
width="760"
height="464"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>HuggingFace Daily Papers: dozens of papers a day that you have to click into and sift through one by one.&lt;/em>&lt;/p>
&lt;h2 id="tex-source-vs-pdf">Tex Source vs. PDF&lt;/h2>
&lt;p>On authenticity and implementation, I had two considerations:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Authenticity&lt;/strong>: the publishing institution should be a frontier university / lab.&lt;/li>
&lt;li>&lt;strong>Pipeline&lt;/strong>: for ease of analysis, the Agent should read the paper&amp;rsquo;s &lt;strong>tex source&lt;/strong> directly rather than the PDF.&lt;/li>
&lt;/ol>
&lt;p>The reason to avoid PDFs is that PDF parsing breaks in all sorts of ways, and string matching is hard (e.g., searching the body for a &lt;code>github.com/&lt;/code> link).&lt;/p>
&lt;hr>
&lt;h1 id="implementation">Implementation&lt;/h1>
&lt;h2 id="python-pre-filter-first-then-hand-off-to-the-agent">Python Pre-filter First, Then Hand Off to the Agent&lt;/h2>
&lt;p>I want it to crawl automatically every day, so I get the day&amp;rsquo;s papers effortlessly. The whole thing is two steps: &lt;strong>a Python script does a coarse pre-filter first, then an Agent handles judgment and organizing&lt;/strong>.&lt;/p>
&lt;p>For the pre-filter: HuggingFace has a public API, and you can fetch a given day&amp;rsquo;s list by date, so I wrote a zero-dependency script &lt;a href="https://github.com/YuyaoGe/Paper_Agent_Skill/blob/main/scripts/fetch_hf_papers.py" target="_blank" rel="noopener">&lt;code>fetch_hf_papers.py&lt;/code>&lt;/a> that filters coarsely by keyword rules:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># scripts/fetch_hf_papers.py — call the public HF API directly, no API key needed&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">url&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="sa">f&lt;/span>&lt;span class="s2">&amp;#34;https://huggingface.co/api/daily_papers?date=&lt;/span>&lt;span class="si">{&lt;/span>&lt;span class="n">date_str&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Titles matching these keywords → excluded&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">TITLE_EXCLUDE_KEYWORDS&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;benchmark&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;benchmarking&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;bench&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;speech&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;audio&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;video&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;3d&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;compiler&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;cuda&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;kernel&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;triton&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;tpu&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;xla&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;quantization&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;quantisation&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;distillation&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Abstract must match at least one → confirm it is in the LLM/VLM space&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">ABSTRACT_REQUIRE_ANY&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">[&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;large language model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;llm&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;vision language model&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;vlm&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;multimodal&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;reasoning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;reinforcement learning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;instruction tuning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;fine-tuning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;alignment&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;agent&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;chain-of-thought&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s2">&amp;#34;in-context learning&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The standard library (&lt;code>urllib&lt;/code> + &lt;code>json&lt;/code>) is enough, and no API key is needed. This step narrows dozens of papers down to a dozen or so; the rest is left for the Agent to weigh.&lt;/p>
&lt;p>Next comes judging and analyzing the papers. Since today&amp;rsquo;s models already balance instruction-following and cost well, I decided not to add much harness and instead hand full judgment to the Agent. To that end, I wrote the requirements as a &lt;a href="https://github.com/YuyaoGe/Paper_Agent_Skill/blob/main/SKILL.md" target="_blank" rel="noopener">Skill&lt;/a>.&lt;/p>
&lt;hr>
&lt;h2 id="skill-pipeline-design">SKILL Pipeline Design&lt;/h2>
&lt;p>The SKILL splits each candidate paper into three steps:&lt;/p>
&lt;p>&lt;strong>Step 1&lt;/strong>, extract the GitHub link: prefer the &lt;code>githubRepo&lt;/code> field from the HF API; if it is empty, search the paper&amp;rsquo;s arXiv tex source for &lt;code>github.com/&lt;/code>.&lt;/p>
&lt;p>&lt;strong>Step 2&lt;/strong>, call the GitHub Contents API to verify whether the repo has substantial code:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">API: https://api.github.com/repos/{owner}/{repo}/contents
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Keep (any one of):
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - .py / .sh / .ipynb files in the repo root
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - directories such as src / scripts / train / model / code
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Drop (any one of):
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - only non-code files like README.md / LICENSE / assets
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - API returns 404 (repo missing or empty)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - repo name / description contains &amp;#34;coming-soon&amp;#34;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Step 3&lt;/strong>, write a concise summary that is &lt;strong>understandable at a glance&lt;/strong>, then assign tags from a fixed set (so the frontend can filter):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">RL · Fine-tuning · Training-free · Long-context · VLM · MeM (Agent Memory) · API · Diffusion
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Finally it lands as one JSON record:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-json" data-lang="json">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;date&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2026-03-12&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;title&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Prism-Δ: Differential Subspace Steering for Prompt Highlighting in Large Language Models&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;arxiv_id&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;2603.10705&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;github&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;https://github.com/YuyaoGe/PRISM-DELTA&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;abstract&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;PRISM-Δ is a prompt-highlighting method that makes an LLM prioritize user-specified text spans during generation. The core idea is to decompose the difference between the positive and negative cross-covariance matrices to maximize discriminative energy and eliminate shared directions; each attention head gets a continuous softplus importance weight (weak-but-useful heads contribute at reduced strength), and the method is extended to the Value representation to capture content-channel signals. Across 4 benchmarks and 5 models, PRISM-Δ matches or surpasses the best existing methods in 19 of 20 configurations, with relative gains up to +10.6%, fluency loss halved, and up to +4.8% relative gain in long-context retrieval.&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nt">&amp;#34;tags&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="s2">&amp;#34;Training-free&amp;#34;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="agent-orchestration-design">Agent Orchestration Design&lt;/h2>
&lt;p>With the SKILL in place, the next question is the calling relationship between the Agent (or Subagent) and the SKILL.&lt;/p>
&lt;p>There are two choices:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Master–worker&lt;/strong>: one master Agent dispatches multiple sub-Agents.&lt;/li>
&lt;li>&lt;strong>Parallel&lt;/strong>: spin up an independent Agent for each day.&lt;/li>
&lt;/ul>
&lt;p>For master–worker, I implemented it with OpenClaw, capping sub-agents at 5. But timeouts kept happening: asking Claw to survey, say, 10 days of papers, it would launch several sub-agents, yet timeouts and context overruns appeared constantly — very unstable.&lt;/p>
&lt;p>So I went with the latter: one independent Agent per day, running in parallel, each writing its own JSON, then a main program merges them. One paper list per day — and as it turns out, the simpler it is, the more stable.&lt;/p>
&lt;p>Parallelism is just &lt;code>xargs -P&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># backfill_papers.sh — batch-backfill historical dates, default concurrency 6&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">printf&lt;/span> &lt;span class="s1">&amp;#39;%s\n&amp;#39;&lt;/span> &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="si">${&lt;/span>&lt;span class="nv">MISSING&lt;/span>&lt;span class="p">[@]&lt;/span>&lt;span class="si">}&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &lt;span class="p">|&lt;/span> xargs -P &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$CONCURRENCY&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> -I&lt;span class="o">{}&lt;/span> bash &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$SCRIPT_DIR&lt;/span>&lt;span class="s2">/run_kimi_one_day.sh&amp;#34;&lt;/span> &lt;span class="o">{}&lt;/span> &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PAPER_READER_DIR&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The merge script &lt;code>merge_batches.py&lt;/code> also only does deterministic work: scan &lt;code>paper_batches/*.json&lt;/code>, skip dates already present, and append the missing ones in date order.&lt;/p>
&lt;h2 id="choosing-the-agent-framework">Choosing the Agent Framework&lt;/h2>
&lt;p>There is one hard requirement for the Agent: &lt;strong>it must launch non-interactively from the command line&lt;/strong>, not require manual operation in a terminal.&lt;/p>
&lt;p>For example:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Cursor &amp;amp; Claude Code&lt;/strong>: require a GUI or terminal interaction.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://moonshotai.github.io/kimi-cli/" target="_blank" rel="noopener">Kimi CLI&lt;/a>&lt;/strong>: lets you pass the prompt as an argument to the launch command — easy to invoke, and Kimi is cheap.&lt;/li>
&lt;/ul>
&lt;p>So processing a single day is one line:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># run_kimi_one_day.sh — process a single day with Kimi CLI, write the batch JSON&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">kimi --print --quiet &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --work-dir &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PAPER_READER_DIR&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --add-dir /Users/yuyaoge/Project/Paper_Agent_Skill &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> -p &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PROMPT&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> &amp;gt; &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$LOG_FILE&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> 2&amp;gt;&lt;span class="p">&amp;amp;&lt;/span>&lt;span class="m">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Even so, the workflow still has to be started manually each day, whereas I want it to run transparently. Hence an auto-start script tailored for macOS.&lt;/p>
&lt;h2 id="auto-start-script-on-macos">Auto-start Script on macOS&lt;/h2>
&lt;p>The auto-start uses macOS &lt;code>launchd&lt;/code>, configured via &lt;code>com.yuyaoge.paper-daily-fetch.plist&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-xml" data-lang="xml">&lt;span class="line">&lt;span class="cl">&lt;span class="c">&amp;lt;!-- Run once at login / load --&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;key&amp;gt;&lt;/span>RunAtLoad&lt;span class="nt">&amp;lt;/key&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;true/&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c">&amp;lt;!-- Then run again every 2 hours --&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;key&amp;gt;&lt;/span>StartInterval&lt;span class="nt">&amp;lt;/key&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nt">&amp;lt;integer&amp;gt;&lt;/span>7200&lt;span class="nt">&amp;lt;/integer&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Two features in &lt;code>daily_fetch.sh&lt;/code> are worth mentioning:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>It does not process &amp;ldquo;today&amp;rsquo;s&amp;rdquo; papers&lt;/strong>: HuggingFace updates the same-day list in real time as authors submit, so crawling today would miss papers submitted later. By default it grabs &amp;ldquo;the 7 days before yesterday&amp;rdquo; rather than today, and conveniently backfills the days the machine was off.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Idempotency&lt;/strong>: it should not rely on a fixed daily trigger, since there is no guarantee the machine is on at that moment. So it runs every 2 hours after boot — skip if a day already has results, but run once for empty results; if git has no changes, do not push.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h1 id="quick-start-1">Quick Start&lt;/h1>
&lt;blockquote>
&lt;p>&lt;strong>Requirements&lt;/strong>: macOS, Python 3, and an installed &amp;amp; logged-in &lt;a href="https://moonshotai.github.io/kimi-cli/" target="_blank" rel="noopener">Kimi CLI&lt;/a>.&lt;/p>
&lt;/blockquote>
&lt;p>&lt;strong>1. Clone the repos&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">git clone https://github.com/YuyaoGe/Paper_Agent_Skill.git &lt;span class="c1"># Skill + scripts&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">git clone https://github.com/YuyaoGe/paper_reader.git &lt;span class="c1"># data + frontend&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>2. Install the Skill into Kimi&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="nb">cd&lt;/span> Paper_Agent_Skill
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">mkdir -p ~/.kimi/skills
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">ln -sfn &lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="nv">$PWD&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span> ~/.kimi/skills/hf-paper-filter
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>3. Verify the pipeline manually&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># run_kimi_one_day.sh YYYY-MM-DD [paper_reader path]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">./scripts/run_kimi_one_day.sh 2026-06-01 /path/to/paper_reader
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>4. (Optional) Backfill a historical range&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># backfill_papers.sh START_DATE END_DATE [CONCURRENCY] [paper_reader path]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">./scripts/backfill_papers.sh 2026-04-25 2026-05-26 &lt;span class="m">6&lt;/span> /path/to/paper_reader
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">python3 ./scripts/merge_batches.py /path/to/paper_reader
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>5. Install the scheduled job for unattended runs&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-bash" data-lang="bash">&lt;span class="line">&lt;span class="cl">cp scripts/com.yuyaoge.paper-daily-fetch.plist ~/Library/LaunchAgents/
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">launchctl load -w ~/Library/LaunchAgents/com.yuyaoge.paper-daily-fetch.plist
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h1 id="frontend">Frontend&lt;/h1>
&lt;p>The final list is aggregated into &lt;code>paper_list.md&lt;/code>, so the Agent can easily append to the end of the file and the frontend can easily parse it.&lt;/p>
&lt;p>The &lt;a href="https://yuyaoge.github.io/paper_reader/" target="_blank" rel="noopener">frontend&lt;/a> is a pure static page: at runtime it pulls the Markdown down and parses it into cards, supporting tag filtering and date search:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-js" data-lang="js">&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// The frontend fetches the Markdown data source and parses it at runtime
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="kr">const&lt;/span> &lt;span class="nx">resp&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kr">await&lt;/span> &lt;span class="nx">fetch&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;paper_list.md&amp;#39;&lt;/span>&lt;span class="p">);&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// Each entry: - **Title** `[Tag]` — [id](url) | [GitHub](url)
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">// &amp;gt; Chinese summary
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1">&lt;/span>&lt;span class="nx">currentPapers&lt;/span>&lt;span class="p">.&lt;/span>&lt;span class="nx">push&lt;/span>&lt;span class="p">({&lt;/span> &lt;span class="nx">title&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">tags&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">links&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nx">desc&lt;/span> &lt;span class="p">});&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>It is hosted on GitHub Pages; the scheduled script pushes the updated &lt;code>paper_list.md&lt;/code> to the cloud every day, and the page updates in sync.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="paper_reader UI" srcset="
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_37c8ed392cc707c0b52f7778a03fa896.webp 400w,
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_ca258df559b5167a6518656aa5b65f4c.webp 760w,
/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_1200x1200_fit_q99_h2_lanczos_3.webp 1200w"
src="https://geyuyao.com/post/auto_paper_reader/figures/paper_reader_hu9a8b7f4fe92c042fe5de7df5ae912fac_576522_37c8ed392cc707c0b52f7778a03fa896.webp"
width="760"
height="612"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>paper_reader: what it looks like after filtering, tagging, and Chinese-summary generation — filterable by tag and sortable by date.&lt;/em>&lt;/p>
&lt;h1 id="overall-pipeline">Overall Pipeline&lt;/h1>
&lt;p>Putting it all together, the full pipeline is:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl"> macOS launchd ──▶ daily_fetch.sh (every 2h, idempotent)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ split by day, run in parallel
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> run_kimi_one_day.sh × N (xargs -P 6)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> └─ Kimi CLI loads the hf-paper-filter Skill
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ├─ fetch_hf_papers.py Python pre-filter
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ├─ GitHub Contents API verify code presence
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> └─ write Chinese summary + tags
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ one JSON per day
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> paper_batches/YYYY-MM-DD.json
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> │ merge_batches.py
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> ▼
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> paper_list.md ──git push──▶ GitHub Pages (frontend fetch + render)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The pieces are all pretty ordinary: the Python standard library for pre-filtering, Kimi CLI for judgment, &lt;code>xargs -P&lt;/code> for parallelism, &lt;code>launchd&lt;/code> for scheduling, a single &lt;code>paper_list.md&lt;/code> as the data source, and a static page for display.&lt;/p>
&lt;/div>
&lt;script>
(function(){
function getInitialLang(){
try{
var q=new URL(window.location.href).searchParams.get('lang');
if(q==='en'||q==='zh')return q;
var s=localStorage.getItem('li_lang');
if(s==='en'||s==='zh')return s;
}catch(_){}
var n=(navigator.language||navigator.userLanguage||'en').toLowerCase();
return n.indexOf('zh')===0?'zh':'en';
}
function setLang(lang){
if(lang!=='en'&amp;&amp;lang!=='zh')lang='en';
document.body.setAttribute('data-li-lang',lang);
try{localStorage.setItem('li_lang',lang);}catch(_){}
try{var url=new URL(window.location.href);url.searchParams.set('lang',lang);window.history.replaceState({},'',url.toString());}catch(_){}
document.querySelectorAll('.li-lang-bar [data-lilang]').forEach(function(b){
b.classList.toggle('active',b.getAttribute('data-lilang')===lang);
});
}
window.__setLILang=setLang;
setLang(getInitialLang());
})();
&lt;/script></description></item></channel></rss>