Memory | YuyaoGe's Website

Abstract: Long contexts challenge transformers as attention scores dilute across thousands of tokens and critical information is often lost in the middle. We reframe test-time adaptation as a budget-constrained memory consolidation problem and propose Gdwm (Gated Differentiable Working Memory), which introduces a write controller that estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, to allocate gradient steps efficiently.