Coding Agent | YuyaoGe's Website

更复杂的 Agent 能带来更好的性能吗？

Thu, 11 Jun 2026 00:00:00 +0000

TLDR: 更复杂的SWE-Agent在SWE-pro bench上相比于mini-swe-agent表现更差且出现了实例卡死

ps：本文AI率低于20%

首先抛出一个常见的直觉：agent 框架做得越完善，性能应该越强。 尽管直觉上大家都这么认为，但是目前没有人严谨地证明过。为此，我希望在SWE任务上验证这个直觉是否是正确的🤔。

💡思路如下：

在SWE Pro Bench上测试两个复杂度不同的Agent Framework，对比他们各自的得分。

结果却发现简单的Agent框架反而获得了更高的性能🤯。

背景知识

什么是 SWE 任务？ SWE（Software Engineering）任务衡量的是 agent 的端到端真实开发能力：给定一个真实代码仓库和一个 GitHub issue，让 agent 自主地读代码、定位问题、跨文件修改、写出 patch（代码补丁），最后由测试判定是否"解决"。SWE-bench Pro 就是针对这一任务的 benchmark。

SWE-agent 与 mini-swe-agent 是两个面向 SWE 场景的 agent 框架：

SWE-agent 的核心论点是 Agent-Computer Interface（ACI），即为 agent 精心设计一套专用工具：给它配多种自定义工具、每个工具各有接口；执行则交给独立的 SWE-ReX 后端，用持久 pexpect 交互式 shell（工作目录、环境变量跨命令保留），且每条命令先过 bashlex 预解析（切分、语法校验、精确抠退出码）。
mini-swe-agent 是 SWE-agent 的最小实现（整个 agent 类约 100 行 Python）：只有 bash 一个"工具"，连模型的 tool-calling 接口都不用；用 subprocess.run 执行每条命令，每个 action 完全独立。

介绍完背景，那么问题来了：在同模型、同 benchmark 下，相比于mini-SWE-Agent，SWE-Agent这套"更完善"的工程，是否能带来更高的收益呢？？

更完善的 SWE-agent 并没有更强

我使用 Claude Sonnet 4.5 分别在两个 Agent Framework 上，对 SWE-bench Pro 的全部 731 题进行全量测试，限制了最大调用次数为 50 次。

结果如下：

Agent Framework	N	resolved	通过率
mini-swe-agent	731	322	44.0%
SWE-agent	731	302	41.3%

参照：官方对 Sonnet 4.5 的测试结果约为 43.6%¹，mini 的 44.0% 与之吻合，说明了我们实验的可信性。

极简的 mini-swe-agent 比 swe-agent 还高了 2.7%。令人意外的是 SWE-agent 全量跑到 722/731 后，最后 9 个实例直接卡死——容器 Up 5–12 小时、日志连续几小时无更新，只能kill掉。而 mini-swe-agent 跑同样这 9 道题却没有此问题。

于我而言，比起结果上的意外，我更好奇为什么这九个实例会卡死🤔。

为什么会有 9 个被卡死的容器

Kill掉容器前，我抓了每个卡死容器的 docker logs，显示：

INFO ... 200 OK POST /run_in_session
🦖 ERROR Bashlex fail: here-document at line 0 delimited by end-of-file (wanted "'EOF'")

容器没死、swerex-remote 进程还在正常返回 200 OK——是 agent 在空转。顺着 agent 的 trace 看，它在用 heredoc 写大文件：

cat > some_file.go <<'EOF'
... 一大段 Go 代码 ...
EOF

根因找到了：SWE-agent 的执行后端 swe-rex，会先用 bashlex（一个纯 Python 写的 bash 解析器）把每条命令解析一遍，再送进容器执行。

麻烦正出在这一步。bashlex 对 heredoc 的支持并不完整，碰到 cat <<'EOF' … 一大段代码 … EOF 这种大块写文件，它会直接解析失败，抛出 Bashlex fail。

解析一旦垮掉，swe-rex 就判断不出这条命令到底有没有跑完、退出码是几；agent 收到一份残缺的反馈，又不会换种写法自救，只能把同一条命令一遍遍重试，容器就这样一卡就是 5 到 12 个小时。

SWE-Agent 为什么要多此一举地先解析命令？

swe-rex 维持着一个长期存活的 shell 会话，让工作目录、环境变量、激活的虚拟环境这些状态能在多条命令之间延续。

但代价是，当命令在一个不断流动的会话里执行时，“它到哪儿算结束、返回码是多少"就不再像跑完一个独立进程那样一目了然，只能靠 bashlex 把命令切开、再注入哨兵字符串去输出流里把退出码捞回来。

好处是：命令既然被解析成了结构，还能做安全检查、命令改写之类更精细的封装，也正是 SWE-agent 主打的 Agent-Computer Interface 思路。它用额外的复杂度，换来了更强的会话语义。

而 mini-swe-agent 则相反，走的是极简路线。它根本不维持会话，每条命令都用一句 subprocess.run(shell=True) 直接丢给系统真正的 shell：

# minisweagent/environments/local.py
result = subprocess.run(
 command,
 shell=True, # 交给系统 shell（/bin/sh -c）
 text=True, cwd=cwd, timeout=timeout,
 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
)

这么做丢掉了会话状态，每条命令都从头开始，agent 得自己把路径和环境写全；但也正因如此，它绕开了所有"自己解析 bash"的麻烦。heredoc 再大也是真 shell 的本职工作，命令跑完、进程一退出，退出码自然就有了。

于是同一条写大文件的命令，在 SWE-agent 撞上 bashlex 的短板、把容器拖死，在 mini-swe-agent 这边却平平无奇地跑了过去。这就是一组很典型的工程取舍——swe-rex 用更高的复杂度换更强的会话语义，也因此多背了一类失败面；mini 放弃了会话的便利，换来更小、更可控的出错空间。

结语

从广义的意义来说，这次的发现证明了奥卡姆剃刀原则：越简单的东西反而是越有效的（也可能是第一性原理）。

当然，我认为这只是一个非常简单的 toy experiment。这并不能说明更复杂、更精密的 agent 效果就不好，可能只是因为 SWE-agent 恰好有这么一个 bug。也许一个经过精细调教的、更复杂的 agent 可以比 mini agent 更好。

当然，anyway，这只是一些猜测。接下来我会用更严谨的实验深挖开头提到的问题，感兴趣的朋友欢迎持续关注。

附录

分语言对比

语言	mini	SWE-agent	谁优
go	95/280 = 34%	78/280 = 28%	mini +6pp
python	139/266 = 52%	143/266 = 54%	SWE-agent +2pp
js	77/165 = 47%	73/165 = 44%	mini +3pp
ts	11/20 = 55%	8/20 = 40%	mini（N=20，小样本不稳）

mini 的优势几乎全部来自 Go，高出 6 个百分点、差 17 题，而这里面光 gravitational/teleport 一个仓就占了大头（仅 mini 解出的有 16 道，仅 SWE-agent 解出的只有 5 道）。这其实并不意外：Go 题大多是体量大、改动多的重仓，正好是最容易触发 SWE-agent heredoc 卡死的地方。可一旦换到 Python，SWE-agent 反而还略高了 2 个百分点。

配对显著性

再看配对显著性。把 731 题按 instance_id 一一对齐，能分成四类：

	数量
都过	245
仅 mini 过	77
仅 SWE-agent 过	57
都没过	352

真正分出胜负的是那 134 道一边过、一边不过的题，其中 mini 占 77、SWE-agent 占 57，确实偏向 mini，但 McNemar 精确检验给出的 p 值是 0.10，谈不上显著。更能说明问题的是另外两类：245 道两边都解得出、352 道两边都解不出，这说明它们能覆盖的题其实高度重叠。

卡死的公平性核算

最后补一笔公平账，对于那 9 个卡死的实例（flipt 5 个、teleport 2 个，加上 vuls 和 tutanota 各 1 个）逐题核对下来，真正算得上"不公平丢分"的其实只有 2 个，也就是 mini 能解、而 SWE-agent 仅仅因为卡死被记了 0 分的 vuls e4728e38 和 teleport 47530e1f。就算把这 2 分补回去，SWE-agent 也不过从 302 升到 304（41.6%），mini 仍是 44.0%，差距反而更小，“不显著"的结论丝毫没变。况且换个角度想，执行后端稳不稳本来就是一个 agent 端到端能力的一部分，在"衡量整套 scaffold"的口径下，记 0 并不算冤枉它😅。

TL;DR. On SWE-bench Pro, the more elaborate SWE-agent underperforms the minimalist mini-swe-agent, and additionally suffers from instances that hang indefinitely.

A widely held intuition holds that the more complete an agent framework is, the better it should perform. Although this assumption is rarely questioned, to my knowledge it has never been rigorously verified. I therefore set out to test it on a software-engineering (SWE) task.

The design is simple: evaluate two agent frameworks of differing complexity on SWE-bench Pro and compare their scores. The result was counterintuitive: the simpler framework achieved the higher score.

Background

What is an SWE task? A software-engineering (SWE) task measures an agent’s end-to-end, real-world development ability: given a real code repository and a GitHub issue, the agent must autonomously read the code, localize the problem, make cross-file edits, and produce a patch, which is then judged “resolved” by a suite of hidden tests. SWE-bench Pro is a benchmark targeting exactly this kind of task.

SWE-agent and mini-swe-agent are two agent frameworks built for the SWE setting:

SWE-agent is organized around the notion of an Agent-Computer Interface (ACI): a carefully designed set of dedicated tools, each with its own interface. Execution is delegated to a separate SWE-ReX backend, which maintains a persistent pexpect interactive shell (the working directory and environment variables persist across commands) and pre-parses every command with bashlex (splitting, syntax validation, and precise exit-code extraction).
mini-swe-agent is a minimal reimplementation (the agent class is roughly 100 lines of Python): it exposes a single tool, bash, and does not even rely on the model’s tool-calling interface; each command is executed through subprocess.run, with every action fully independent.

With this background in place, the central question becomes: under the same model and the same benchmark, does SWE-agent’s heavier engineering actually translate into a higher payoff than mini-swe-agent?

The more elaborate SWE-agent is not stronger

Using Claude Sonnet 4.5, I evaluated both frameworks on the full 731-problem SWE-bench Pro public set, capping the call budget at 50 per problem.

The results are as follows:

Agent framework	N	resolved	resolve rate
mini-swe-agent	731	322	44.0%
SWE-agent	731	302	41.3%

For reference, the officially reported figure for Sonnet 4.5 is around 43.6%¹; mini’s 44.0% aligns closely with it, which lends credibility to the present setup.

The minimalist mini-swe-agent is, in fact, 2.7 percentage points higher than SWE-agent. More surprisingly, after SWE-agent had completed 722 of the 731 instances, its final 9 instances hung outright: the containers stayed up for 5 to 12 hours with no log activity for hours on end, and had to be killed manually (and thus scored 0). Running the very same 9 problems, mini-swe-agent exhibited no such behavior.

For me, beyond the surprise in the numbers, the more intriguing question was why these 9 instances hung in the first place.

Why did 9 containers hang?

Before killing the containers, I captured the docker logs of each one:

INFO ... 200 OK POST /run_in_session
🦖 ERROR Bashlex fail: here-document at line 0 delimited by end-of-file (wanted "'EOF'")

The containers were alive and swerex-remote was still returning 200 OK; in other words, the agent was merely spinning in place. Tracing the agent back, it was writing a large file via a heredoc:

cat > some_file.go <<'EOF'
... a large block of Go code ...
EOF

This pins down the root cause: SWE-agent’s execution backend, swe-rex, first parses every command with bashlex (a pure-Python bash parser) before dispatching it into the container.

That is precisely where things break. bashlex’s support for heredocs is incomplete; confronted with a large block-write such as cat <<'EOF' … a large block of code … EOF, it fails outright and raises Bashlex fail.

Once parsing collapses, swe-rex can no longer determine whether the command finished or what its exit code was. The agent receives a malformed observation, does not recover by attempting a different approach, and simply retries the same command over and over, leaving the container hung for 5 to 12 hours.

Why does SWE-agent bother parsing commands in the first place?

The answer follows from its design goals. swe-rex maintains a long-lived shell session, so that state such as the working directory, environment variables, and any activated virtual environment persists across commands.

The cost, however, is that when commands run inside a continuously flowing session, “where a command ends and what its return code is” is no longer as self-evident as it is when a standalone process exits. swe-rex must therefore rely on bashlex to split commands and inject sentinel strings in order to recover exit codes from the output stream.

The benefit is that, once a command has been parsed into a structured form, the backend can additionally perform safety checks, command rewriting, and other fine-grained wrapping. This is exactly the Agent-Computer Interface philosophy that SWE-agent champions; on its own terms, the design is sound, trading extra complexity for stronger session semantics.

mini-swe-agent takes the opposite, minimalist route. It maintains no session at all; every command is handed directly to the system’s real shell through a single subprocess.run(shell=True):

# minisweagent/environments/local.py
result = subprocess.run(
 command,
 shell=True, # hand it to the system shell (/bin/sh -c)
 text=True, cwd=cwd, timeout=timeout,
 stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
)

This sacrifices session state (each command starts afresh, and the agent must spell out paths and environment itself), but for exactly that reason it sidesteps all the trouble of parsing bash in-process. A heredoc, however large, is the real shell’s native job; once the command finishes and the process exits, the exit code is simply there.

As a result, the same large-file-writing command that crashes the container under SWE-agent, by hitting bashlex’s limitation, runs uneventfully under mini-swe-agent. This is a textbook engineering trade-off: swe-rex buys stronger session semantics at the price of an additional failure surface, whereas mini forgoes the convenience of a session in exchange for a smaller and more controllable space of errors.

Conclusion

Broadly speaking, this finding echoes Occam’s razor: when capability is comparable, the simpler solution is often the more effective, and arguably the more robust, one.

That said, this is admittedly a small toy experiment. It does not establish that more complex, more sophisticated agents are necessarily worse; SWE-agent’s shortfall here is, to a large extent, dragged down by one specific bug. A carefully tuned, more elaborate agent could well surpass mini.

These remain, for now, preliminary conjectures. In follow-up work I intend to investigate the opening question more rigorously: whether a more complete agent is genuinely worth it. Stay tuned.

Appendix

Per-language comparison

Language	mini	SWE-agent	Winner
go	95/280 = 34%	78/280 = 28%	mini +6pp
python	139/266 = 52%	143/266 = 54%	SWE-agent +2pp
js	77/165 = 47%	73/165 = 44%	mini +3pp
ts	11/20 = 55%	8/20 = 40%	mini (N=20, small sample)

mini’s advantage comes almost entirely from Go, where it leads by 6 percentage points (17 problems); within Go, a single repository, gravitational/teleport, accounts for most of it (16 solved only by mini versus 5 only by SWE-agent). This is unsurprising: Go problems tend to involve large, heavily modified repositories, which is precisely where SWE-agent’s heredoc hang is most likely to be triggered. On Python, by contrast, SWE-agent is actually 2 points higher.

Paired significance

Aligning all 731 problems by instance_id yields four categories:

	Count
Both solved	245
Only mini	77
Only SWE-agent	57
Neither	352

What actually separates the two are the 134 problems solved by exactly one side: 77 for mini and 57 for SWE-agent. The tilt favors mini, but McNemar’s exact test yields p = 0.10, which is not significant. More telling are the other two categories: 245 problems solved by both and 352 solved by neither, indicating that the two scaffolds cover a highly overlapping set of problems.

Fairness accounting for the hangs

Finally, a fairness check. Of the 9 hung instances (5 from flipt, 2 from teleport, and one each from vuls and tutanota), only 2 constitute genuinely “unfair” losses, i.e., problems that mini solved but on which SWE-agent was scored 0 purely because it hung: vuls e4728e38 and teleport 47530e1f. Even crediting those 2 back, SWE-agent rises only from 302 to 304 (41.6%), while mini remains at 44.0%; the gap narrows but the “not significant” conclusion is unchanged. Moreover, the robustness of the execution backend is itself part of an agent’s end-to-end capability, so scoring 0 is not unfair under a “whole-scaffold” evaluation.

SWE-bench Pro, https://arxiv.org/abs/2509.16941 ↩︎ ↩︎

利用 SKILL让 Agent 自动筛选 & 解读 Huggingface 的每日论文

Tue, 02 Jun 2026 00:00:00 +0000

起因

作为研究生，我们每天都要读很多论文。但除了读论文本身，挑选论文同样很耗费时间和精力。

能帮忙分析论文的软件其实不少，比如：

AlphaXiv：可以非常系统地精读某一篇论文。
papers.cool：由苏神开发，每天爬取 arXiv 新论文，用 Kimi 给出中文解读，并支持在 Kimi 里继续追问、深入分析。

尽管这些工具在论文解读上做得很深入，但它们有一个共同的缺点：没有起到筛选的作用。它们能把某一篇论文读得很细，可“从每天几十篇里挑出值得读的”仍然得我自己来做，它们也不会帮我打标签。此外，更麻烦的是，即便发现一篇不错的论文、想 follow 它的工作，也常常会发现它的 GitHub repo 是空的，甚至根本没有链接，最后白忙活一场。

这样的沉没成本其实很高：follow 一篇论文，可能写了很久的代码，最后才发现某个输入对不上，或者某一步根本复现不了。

所以我想把这个工具做得真正有用、能投入真正使用，可以把时间花在有意义的工作上。

思路

需求

具体来说，我希望它能帮我筛选每天 HuggingFace Daily Papers 里的论文，并且能按分类整理好。

Hugginface daily paper中的论文是论文作者自主上传的，这种积极性使得相比于arXiv，daily paper中的论文的完善度、可信度和质量更高

这个产品要做两件事：

自动分类：为每篇论文打上标签。
验证论文的真实性，以及“容易 follow”的程度。
自动生成论文摘要（类似于papers.cool）

其中“容易 follow”，定义为：

(a) 论文应该对应有 GitHub 的 repo；
(b) repo 里的代码必须是完善的——比如有些 repo 里只有一个 README，由于难以复现，因此应该被排除；
(c) 数据集等资源也应是开源的。

HuggingFace Daily Papers：每天几十篇论文需要我们自己点进去逐个挑选

Tex源码 or PDF

关于论文分析的真实性和技术实现，我有两点考虑：

真实性：论文发表的单位应当是frontier的高校/机构
流程：为了便于分析，应该让 Agent 直接读取论文的 tex 源码，而不是PDF。

之所以不建议读 PDF，是因为 PDF 在解析时容易出现各种问题，而且也难以进行字符串匹配（比如要在正文里找 github.com/ 这样的链接）。

实现

先Python初筛，再交给 Agent 自主执行

我希望它每天自动爬取，让我无感地拿到当天的论文。整体分两步：先用 Python 脚本初筛，再交给 Agent 做判断和整理。

对于初筛，HuggingFace 有公开接口，按日期就能取到当天列表，所以我写了个零依赖脚本 fetch_hf_papers.py，用关键词规则粗筛一遍：

# scripts/fetch_hf_papers.py —— 直接调用 HF 公开接口，无需 API Key
url = f"https://huggingface.co/api/daily_papers?date={date_str}"

# 标题命中这些关键词 → 排除
TITLE_EXCLUDE_KEYWORDS = [
 "benchmark", "benchmarking", "bench",
 "speech", "audio", "video", "3d",
 "compiler", "cuda", "kernel", "triton", "tpu", "xla",
 "quantization", "quantisation", "distillation",
]

# 摘要必须命中其一 → 确认是 LLM/VLM 领域
ABSTRACT_REQUIRE_ANY = [
 "large language model", "llm", "vision language model", "vlm",
 "multimodal", "reasoning", "reinforcement learning",
 "instruction tuning", "fine-tuning", "alignment", "agent",
 "chain-of-thought", "in-context learning",
]

只需纯标准库（urllib + json）就够，且不需要 API key。这一步可以把把几十篇筛选到十几篇，剩下的论文需要留给 Agent 来盘。

其次是对论文的判断&分析。考虑到目前的模型已经可以很好的平衡指令遵循能力与成本，因此决定不做过多的harmness，而是全权交给Agent自主判断。为此，我将需求写为 Skill。

SKILL Pipline Design

SKILL将每篇候选论文分为三个步骤：

第一步，提取 GitHub 链接：优先用 HF 接口的 githubRepo 字段，如果此字段为空，就去 arXiv 的 tex 源码里检索 github.com/。

第二步，调用 GitHub Contents API 验证仓库里有没有实质代码：

API: https://api.github.com/repos/{owner}/{repo}/contents

保留（满足其一）：
 - 根目录有 .py / .sh / .ipynb 文件
 - 有 src / scripts / train / model / code 等目录

丢弃（命中其一）：
 - 只有 README.md / LICENSE / assets 这类非代码文件
 - API 返回 404（仓库不存在或为空）
 - 仓库名 / 描述含 "coming-soon"

第三步，写一段一眼就能看懂的中文摘要，再从一套固定标签里打标签以便筛选：

RL · 微调 · 无需训练 · 长文本 · VLM · MeM(Agent记忆) · API · 扩散模型

最后落成一条 JSON：

{
 "date": "2026-03-12",
 "title": "Prism-Δ: Differential Subspace Steering for Prompt Highlighting in Large Language Models",
 "arxiv_id": "2603.10705",
 "github": "https://github.com/YuyaoGe/PRISM-DELTA",
 "abstract": "PRISM-Δ 是一种提示高亮方法，使 LLM 在生成时优先关注用户指定的文本片段。核心思路是分解正负交叉协方差矩阵的差值以最大化判别能量、消除共享方向，每个注意力头获得连续 softplus 重要性权重（弱但有用的头以降低强度贡献），并扩展到 Value 表示以捕获内容通道信号。在 4 个基准、5 个模型上，PRISM-Δ 在 20 个配置中的 19 个匹配或超越最佳现有方法，相对增益最高 +10.6%，流畅度损失减半，长文本检索场景相对增益最高 +4.8%。",
 "tags": ["无需训练"]
}

Harmness Engineering Design

有了SKILL接下来就是设计 Agent (或者是Subagent) 与 SKILL 的调用关系。

这里两个选择：

主从调度：一个主 Agent 去调度多个子 Agent
并行调度：每天各起一个独立 Agent

对于主从调度，我使用OpenClaw实现，最大subagent数设置为5。

然而经常出现超时的问题，具体而言，让claw调研某 10 天的论文，尽管它会起多个 sub-agent，但超时&超出上下文的问题频频出现，极不稳定。

因此，后来选择了后者的方案：每天一个独立 Agent 并行跑，各自写一个 JSON，最后用主程序合并。一天生成一个论文列表，事实证明，越简单越稳定。

并行用 xargs -P 就够：

# backfill_papers.sh —— 批量补录历史日期，默认并发 6
printf '%s\n' "${MISSING[@]}" \
 | xargs -P "$CONCURRENCY" -I{} bash "$SCRIPT_DIR/run_kimi_one_day.sh" {} "$PAPER_READER_DIR"

合并脚本 merge_batches.py 也只做确定的事：扫 paper_batches/*.json，跳过已有日期，按日期把缺的追加进去。

关于Agent框架的选择

选 Agent 有个硬条件：能在命令行非交互式启动，而不是用进入终端手动操作。

比如：

Cursor & Claude Code：需要进GUI界面或者终端操作
Kimi CLI：可以将 prompt 作为CLI启动命令中的一个形参，可以很方便的调用，而且KIMI价格便宜

所以对于每一天的处理就是一行命令：

# run_kimi_one_day.sh —— 用 Kimi CLI 处理单天，结果写入 batch JSON
kimi --print --quiet \
 --work-dir "$PAPER_READER_DIR" \
 --add-dir /Users/yuyaoge/Project/Paper_Agent_Skill \
 -p "$PROMPT" \
 > "$LOG_FILE" 2>&1

尽管如此，目前还是需要每天手动启动工作流，我们希望可以对于用户无感运行。因此，希望可以设置一个适配于Macos的自动启动脚本。

在Macos上的自动启动脚本

自动脚本用 macOS 的 launchd，配置 com.yuyaoge.paper-daily-fetch.plist：

<!-- 登录 / load 时立即跑一次 -->
<key>RunAtLoad</key>
<true/>

<!-- 之后每 2 小时再跑一次 -->
<key>StartInterval</key>
<integer>7200</integer>

daily_fetch.sh 里有两个Feature值得一提：

不处理今日论文：HuggingFace 在当天是会根据用户的上传而实时更新的，当天爬会漏掉当前时间点后面提交的论文。所以默认抓“昨天往前数 7 天”，而不包括当天的论文，顺便也补上关机那几天的论文。
幂等：不应该每日定时启动，因此不确定所定的时间点是否已经开机。因此，设定为自开机后，每 2 小时跑一次，如果已有结果就跳过，但若是空结果先执行一次；如果git没变化就不 push。

Quick Start

环境要求：macOS、Python 3，以及已安装并登录的 Kimi CLI。

1. 克隆仓库

git clone https://github.com/YuyaoGe/Paper_Agent_Skill.git # Skill + 脚本
git clone https://github.com/YuyaoGe/paper_reader.git # 数据 + 前端

2. 安装 Skill 到 Kimi

cd Paper_Agent_Skill
mkdir -p ~/.kimi/skills
ln -sfn "$PWD" ~/.kimi/skills/hf-paper-filter

3. 手动验证链路

# run_kimi_one_day.sh YYYY-MM-DD [paper_reader 路径]
./scripts/run_kimi_one_day.sh 2026-06-01 /path/to/paper_reader

4.（可选）批量补录历史区间

# backfill_papers.sh 起始日期 结束日期 [并发数] [paper_reader 路径]
./scripts/backfill_papers.sh 2026-04-25 2026-05-26 6 /path/to/paper_reader
python3 ./scripts/merge_batches.py /path/to/paper_reader

5. 安装定时任务，实现无人值守

cp scripts/com.yuyaoge.paper-daily-fetch.plist ~/Library/LaunchAgents/
launchctl load -w ~/Library/LaunchAgents/com.yuyaoge.paper-daily-fetch.plist

前端展示

最终列表汇总到 paper_list.md 中，以便于 Agent 可以很方便的在文件末尾追加，同时易于前端解析。

前端设计为纯静态页面，运行时把 Markdown 拉下来解析成卡片，支持按标签筛、按日期检索：

// 前端运行时直接拉取 Markdown 数据源并解析
const resp = await fetch('paper_list.md');
// 每条格式：- **Title** `[Tag]` — [id](url) | [GitHub](url)
// > 中文摘要
currentPapers.push({ title, tags, links, desc });

托管在 GitHub Pages，定时脚本每天把更新后的 paper_list.md push 到云端，页面同步更新。

paper_reader：筛选、打标签、生成中文摘要后的样子，可以按标签筛选、按日期排序。

整体流程

综上，整条流程如下：

 macOS launchd ──▶ daily_fetch.sh （每 2h 触发，幂等）
 │ 按天拆分，并行
 ▼
 run_kimi_one_day.sh × N (xargs -P 6)
 └─ Kimi CLI 加载 hf-paper-filter Skill
 ├─ fetch_hf_papers.py Python 初筛
 ├─ GitHub Contents API 验证有无代码
 └─ 写中文摘要 + 打标签
 │ 每天一个 JSON
 ▼
 paper_batches/YYYY-MM-DD.json
 │ merge_batches.py 合并
 ▼
 paper_list.md ──git push──▶ GitHub Pages（前端 fetch 渲染）

用到的东西也都很常规：Python 标准库做初筛、Kimi CLI 做判断、xargs -P 并行、launchd 定时、一个 paper_list.md 当数据源、一个静态页面做展示。

Motivation

As graduate students, we read a lot of papers every day. But beyond reading the papers themselves, deciding which papers to read is just as time- and energy-consuming.

There is no shortage of tools for analyzing papers, for example:

AlphaXiv: reads a single paper very systematically.
papers.cool: built by Su Jianlin; it crawls new arXiv papers daily, uses Kimi to produce Chinese explanations, and lets you keep asking follow-up questions inside Kimi.

These tools go deep on reading a paper, but they share one shortcoming: they do not help with filtering. They can read a given paper in great detail, yet “picking the few worth reading out of the dozens each day” is still on me, and they will not tag papers either. Worse, even when I find a promising paper and want to follow up on its work, I often discover that its GitHub repo is empty, or that there is no link at all — and the effort is wasted.

The sunk cost here is high: when following a paper, I might write code for a long time only to find that some input does not match, or that a reproduction step simply does not work.

So I wanted to build something genuinely useful and actually usable in daily work, so that time goes to work that matters.

Idea

Requirements

Concretely, I want it to filter the papers in HuggingFace Daily Papers every day, and organize them by category.

Papers in HuggingFace Daily Papers are submitted by the authors themselves. That initiative tends to make daily-paper submissions more complete, more credible, and higher in quality than arXiv at large.

The product needs to do three things:

Auto-classification: tag every paper.
Verify a paper’s authenticity and how “easy to follow” it is.
Auto-generate a paper summary (similar to papers.cool).

Here “easy to follow” is defined as:

(a) the paper should have a corresponding GitHub repo;
(b) the code in that repo must be substantial — a repo with only a README, for instance, is hard to reproduce and should be excluded;
(c) datasets and other resources should be open-sourced as well.

HuggingFace Daily Papers: dozens of papers a day that you have to click into and sift through one by one.

Tex Source vs. PDF

On authenticity and implementation, I had two considerations:

Authenticity: the publishing institution should be a frontier university / lab.
Pipeline: for ease of analysis, the Agent should read the paper’s tex source directly rather than the PDF.

The reason to avoid PDFs is that PDF parsing breaks in all sorts of ways, and string matching is hard (e.g., searching the body for a github.com/ link).

Implementation

Python Pre-filter First, Then Hand Off to the Agent

I want it to crawl automatically every day, so I get the day’s papers effortlessly. The whole thing is two steps: a Python script does a coarse pre-filter first, then an Agent handles judgment and organizing.

For the pre-filter: HuggingFace has a public API, and you can fetch a given day’s list by date, so I wrote a zero-dependency script fetch_hf_papers.py that filters coarsely by keyword rules:

# scripts/fetch_hf_papers.py — call the public HF API directly, no API key needed
url = f"https://huggingface.co/api/daily_papers?date={date_str}"

# Titles matching these keywords → excluded
TITLE_EXCLUDE_KEYWORDS = [
 "benchmark", "benchmarking", "bench",
 "speech", "audio", "video", "3d",
 "compiler", "cuda", "kernel", "triton", "tpu", "xla",
 "quantization", "quantisation", "distillation",
]

# Abstract must match at least one → confirm it is in the LLM/VLM space
ABSTRACT_REQUIRE_ANY = [
 "large language model", "llm", "vision language model", "vlm",
 "multimodal", "reasoning", "reinforcement learning",
 "instruction tuning", "fine-tuning", "alignment", "agent",
 "chain-of-thought", "in-context learning",
]

The standard library (urllib + json) is enough, and no API key is needed. This step narrows dozens of papers down to a dozen or so; the rest is left for the Agent to weigh.

Next comes judging and analyzing the papers. Since today’s models already balance instruction-following and cost well, I decided not to add much harness and instead hand full judgment to the Agent. To that end, I wrote the requirements as a Skill.

SKILL Pipeline Design

The SKILL splits each candidate paper into three steps:

Step 1, extract the GitHub link: prefer the githubRepo field from the HF API; if it is empty, search the paper’s arXiv tex source for github.com/.

Step 2, call the GitHub Contents API to verify whether the repo has substantial code:

API: https://api.github.com/repos/{owner}/{repo}/contents

Keep (any one of):
 - .py / .sh / .ipynb files in the repo root
 - directories such as src / scripts / train / model / code

Drop (any one of):
 - only non-code files like README.md / LICENSE / assets
 - API returns 404 (repo missing or empty)
 - repo name / description contains "coming-soon"

Step 3, write a concise summary that is understandable at a glance, then assign tags from a fixed set (so the frontend can filter):

RL · Fine-tuning · Training-free · Long-context · VLM · MeM (Agent Memory) · API · Diffusion

Finally it lands as one JSON record:

{
 "date": "2026-03-12",
 "title": "Prism-Δ: Differential Subspace Steering for Prompt Highlighting in Large Language Models",
 "arxiv_id": "2603.10705",
 "github": "https://github.com/YuyaoGe/PRISM-DELTA",
 "abstract": "PRISM-Δ is a prompt-highlighting method that makes an LLM prioritize user-specified text spans during generation. The core idea is to decompose the difference between the positive and negative cross-covariance matrices to maximize discriminative energy and eliminate shared directions; each attention head gets a continuous softplus importance weight (weak-but-useful heads contribute at reduced strength), and the method is extended to the Value representation to capture content-channel signals. Across 4 benchmarks and 5 models, PRISM-Δ matches or surpasses the best existing methods in 19 of 20 configurations, with relative gains up to +10.6%, fluency loss halved, and up to +4.8% relative gain in long-context retrieval.",
 "tags": ["Training-free"]
}

Agent Orchestration Design

With the SKILL in place, the next question is the calling relationship between the Agent (or Subagent) and the SKILL.

There are two choices:

Master–worker: one master Agent dispatches multiple sub-Agents.
Parallel: spin up an independent Agent for each day.

For master–worker, I implemented it with OpenClaw, capping sub-agents at 5. But timeouts kept happening: asking Claw to survey, say, 10 days of papers, it would launch several sub-agents, yet timeouts and context overruns appeared constantly — very unstable.

So I went with the latter: one independent Agent per day, running in parallel, each writing its own JSON, then a main program merges them. One paper list per day — and as it turns out, the simpler it is, the more stable.

Parallelism is just xargs -P:

# backfill_papers.sh — batch-backfill historical dates, default concurrency 6
printf '%s\n' "${MISSING[@]}" \
 | xargs -P "$CONCURRENCY" -I{} bash "$SCRIPT_DIR/run_kimi_one_day.sh" {} "$PAPER_READER_DIR"

The merge script merge_batches.py also only does deterministic work: scan paper_batches/*.json, skip dates already present, and append the missing ones in date order.

Choosing the Agent Framework

There is one hard requirement for the Agent: it must launch non-interactively from the command line, not require manual operation in a terminal.

For example:

Cursor & Claude Code: require a GUI or terminal interaction.
Kimi CLI: lets you pass the prompt as an argument to the launch command — easy to invoke, and Kimi is cheap.

So processing a single day is one line:

# run_kimi_one_day.sh — process a single day with Kimi CLI, write the batch JSON
kimi --print --quiet \
 --work-dir "$PAPER_READER_DIR" \
 --add-dir /Users/yuyaoge/Project/Paper_Agent_Skill \
 -p "$PROMPT" \
 > "$LOG_FILE" 2>&1

Even so, the workflow still has to be started manually each day, whereas I want it to run transparently. Hence an auto-start script tailored for macOS.

Auto-start Script on macOS

The auto-start uses macOS launchd, configured via com.yuyaoge.paper-daily-fetch.plist:

<!-- Run once at login / load -->
<key>RunAtLoad</key>
<true/>

<!-- Then run again every 2 hours -->
<key>StartInterval</key>
<integer>7200</integer>

Two features in daily_fetch.sh are worth mentioning:

It does not process “today’s” papers: HuggingFace updates the same-day list in real time as authors submit, so crawling today would miss papers submitted later. By default it grabs “the 7 days before yesterday” rather than today, and conveniently backfills the days the machine was off.
Idempotency: it should not rely on a fixed daily trigger, since there is no guarantee the machine is on at that moment. So it runs every 2 hours after boot — skip if a day already has results, but run once for empty results; if git has no changes, do not push.

Quick Start

Requirements: macOS, Python 3, and an installed & logged-in Kimi CLI.

1. Clone the repos

git clone https://github.com/YuyaoGe/Paper_Agent_Skill.git # Skill + scripts
git clone https://github.com/YuyaoGe/paper_reader.git # data + frontend

2. Install the Skill into Kimi

cd Paper_Agent_Skill
mkdir -p ~/.kimi/skills
ln -sfn "$PWD" ~/.kimi/skills/hf-paper-filter

3. Verify the pipeline manually

# run_kimi_one_day.sh YYYY-MM-DD [paper_reader path]
./scripts/run_kimi_one_day.sh 2026-06-01 /path/to/paper_reader

4. (Optional) Backfill a historical range

# backfill_papers.sh START_DATE END_DATE [CONCURRENCY] [paper_reader path]
./scripts/backfill_papers.sh 2026-04-25 2026-05-26 6 /path/to/paper_reader
python3 ./scripts/merge_batches.py /path/to/paper_reader

5. Install the scheduled job for unattended runs

cp scripts/com.yuyaoge.paper-daily-fetch.plist ~/Library/LaunchAgents/
launchctl load -w ~/Library/LaunchAgents/com.yuyaoge.paper-daily-fetch.plist

Frontend

The final list is aggregated into paper_list.md, so the Agent can easily append to the end of the file and the frontend can easily parse it.

The frontend is a pure static page: at runtime it pulls the Markdown down and parses it into cards, supporting tag filtering and date search:

// The frontend fetches the Markdown data source and parses it at runtime
const resp = await fetch('paper_list.md');
// Each entry: - **Title** `[Tag]` — [id](url) | [GitHub](url)
// > Chinese summary
currentPapers.push({ title, tags, links, desc });

It is hosted on GitHub Pages; the scheduled script pushes the updated paper_list.md to the cloud every day, and the page updates in sync.

paper_reader: what it looks like after filtering, tagging, and Chinese-summary generation — filterable by tag and sortable by date.

Overall Pipeline

Putting it all together, the full pipeline is:

 macOS launchd ──▶ daily_fetch.sh (every 2h, idempotent)
 │ split by day, run in parallel
 ▼
 run_kimi_one_day.sh × N (xargs -P 6)
 └─ Kimi CLI loads the hf-paper-filter Skill
 ├─ fetch_hf_papers.py Python pre-filter
 ├─ GitHub Contents API verify code presence
 └─ write Chinese summary + tags
 │ one JSON per day
 ▼
 paper_batches/YYYY-MM-DD.json
 │ merge_batches.py
 ▼
 paper_list.md ──git push──▶ GitHub Pages (frontend fetch + render)

The pieces are all pretty ordinary: the Python standard library for pre-filtering, Kimi CLI for judgment, xargs -P for parallelism, launchd for scheduling, a single paper_list.md as the data source, and a static page for display.