A new paradigm for agentic LLMs

Skill Retrieval Augmentation
for Agentic AI

RAG retrieves knowledge. SRA retrieves capabilities.

Modern LLM agents rely on external skills: reusable capability packages. Today's frameworks enumerate every available skill in context, which breaks down as skill libraries grow. SRA retrieves, incorporates, and applies skills from a large external corpus on demand; SRA-Bench is the first benchmark for decomposed evaluation of the full pipeline.

Weihang Su^*1,2 Jianming Long¹ Qingyao Ai^†1 Qiaozhi He² Yichen Tang¹ Changyue Wang¹ Yiteng Tu¹ Yingbo Wang² Yiqun Liu¹

¹Department of Computer Science and Technology, Tsinghua University · ²ByteDance Inc. · ^*Work done during internship at ByteDance · ^†Corresponding author

Paper arXiv Code Dataset Leaderboard

5,400test instances

636gold skills

26,262total corpus

6capability domains

8LLMs evaluated

TL;DR

📐

New paradigm

We formulate Skill Retrieval Augmentation, a capability-centric counterpart to RAG. RAG retrieves knowledge; SRA retrieves capabilities.

🧪

SRA-Bench

5,400 capability-intensive instances spanning theorem application, formal logic, tool use, medical calculation, competition math, and code, paired with 636 gold skills mixed into a 26K-skill web-collected corpus.

🔬

Decomposed eval

Three coupled stages: retrieval, incorporation, application. Each is evaluated in isolation so we can pinpoint where SR-Agents actually fail.

💡

Surprising bottleneck

Better retrieval does not guarantee better answers. The real gap is controlled, need-aware skill utilization. Most LLMs load skills at similar rates whether the task actually needs external capability or not.

The Paradigm

From RAG to SRA

RAG retrieves knowledge; SRA retrieves capabilities. The pipeline has three tightly coupled stages: a skill must be retrieved from a large library, deliberately incorporated into the agent's active problem-solving state, and ultimately applied to solve the task.

The SRA paradigm: an agent retrieves candidate skills from a large skill corpus, selectively incorporates useful skills into context, and applies them for downstream reasoning and acting. — **Figure 1.** The SRA paradigm. Black arrows denote the standard workflow; blue arrows show iterative skill retrieval during reasoning and acting.

Skill Retrieval

Given a user query and a corpus of N skills, a retriever R returns a ranked list of k ≪ N candidates. Reduces a massive capability space to a manageable shortlist.

Skill Incorporation

The agent decides which retrieved skills, if any, should enter its problem-solving state, and in what form (rewritten, compressed, restructured, or model-adapted).

Skill Application

Conditioned on the incorporated skills, the agent solves the task. Successful incorporation does not guarantee successful application.

The Benchmark

SRA-Bench at a glance

SRA-Bench is the first benchmark for decomposed evaluation of the SRA pipeline. We curate capability-intensive instances from six existing benchmarks, manually construct one gold skill per annotation category, and embed them in a large web-collected corpus that approximates a real, open skill ecosystem.

Dataset	Capability Type	Instances	Skills	Mapping	Evaluation
TheoremQA	Theorem Application	747	320	Single	Rule-based
LogicBench	Logical Reasoning Patterns	760	19	Single	Rule-based
ToolQA	Tool-Use Workflows	1,430	14	Single	Rule-based
MedCalc-Bench	Medical Calculators	1,100	55	Single	Rule-based
CHAMP	Mathematical Concepts	223	89	Multi	Rule-based
BigCodeBench	Software Libraries	1,140	139	Multi	Execution
Total		5,400	636 gold

What does a gold skill look like?

A gold skill is a standalone Markdown artifact: a name, a one-line description, and procedural content covering applicability, the actual method, and worked examples. Many skills also ship an executable tool. The example below is one of 55 medical calculators in SRA-Bench.

medcalcbench_046

Creatinine Clearance (Cockcroft-Gault Equation)

Estimate creatinine clearance for renal drug dosing, with BMI-based body-weight adjustment and a sex-specific factor.

executable tools

The Cockcroft-Gault equation estimates creatinine clearance (CrCl) as a surrogate for glomerular filtration rate, widely used for drug-dose adjustment in renal impairment. The body weight that enters the formula depends on the patient's BMI category.

Required Inputs

Age: in years
Sex: male or female
Height: in cm (× 2.54 from inches if needed)
Weight: in kg (× 0.453592 from lbs if needed)
Serum creatinine: in mg/dL (÷ 88.42 from µmol/L if needed)

Computation

Step 1. Calculate BMI.

BMI = weight_kg / (height_m)²

Step 2. Pick the weight that enters the formula based on BMI category.

BMI < 18.5 (underweight) → use actual body weight
BMI 18.5–24.9 (normal) → use min(IBW, actual weight)
BMI ≥ 25 (overweight / obese) → use adjusted body weight (ABW)

IBW (male)   = 50   + 2.3 × (height_inches − 60)
IBW (female) = 45.5 + 2.3 × (height_inches − 60)
ABW          = IBW + 0.4 × (actual_weight − IBW)

Step 3. Apply the Cockcroft-Gault formula.

CrCl (mL/min) = ((140 − age) × adjusted_weight × gender_coef) / (serum_creatinine × 72)

where gender_coef = 1.0 for male and 0.85 for female.

Calculation Tools

compute_ibw(height_cm, is_female) → IBW in kg
compute_cg_crcl(age, weight_kg, height_cm, creatinine_mg_dl, is_female) → CrCl in mL/min

Example

65-year-old female, height 160 cm, weight 55 kg, serum creatinine 1.0 mg/dL.

Step 1. BMI = 55 / (1.60)² = 21.5 → normal-weight category.

Step 2. Compute IBW, then take min(IBW, actual weight).

TOOL_CALL: compute_ibw(height_cm=160, is_female=True)
TOOL_RESULT: 52.38197

IBW ≈ 52.38 kg < 55 kg, so adjusted weight = 52.38 kg.

Step 3. Apply the Cockcroft-Gault formula.

TOOL_CALL: compute_cg_crcl(age=65, weight_kg=55, height_cm=160, creatinine_mg_dl=1.0, is_female=True)
TOOL_RESULT: 46.37987

The patient's creatinine clearance is approximately 46.38 mL/min.

Empirical Study

Decomposing the SRA pipeline

We structure the study around six research questions covering the three stages of SRA: retrieval, incorporation, and application. RQ1 / RQ2 ask whether skill augmentation helps and how it copes with noise; RQ3 / RQ4 isolate retrieval quality and its effect on end-task performance; RQ5 / RQ6 examine whether agents incorporate skills in a relevance-aware and need-aware way.

RQ1 Does the SRA paradigm improve agent performance over skill-free baselines?

Yes, sometimes by a wide margin.

Oracle skills lift end-task accuracy by +14 to +23 points over the no-skill baseline. Among practical methods, LLM Selection most reliably converts retrieved candidates into downstream gains, often closing a large fraction of the gap to the oracle upper bound. Yet every practical method still trails Oracle, leaving substantial headroom for better incorporation and application.

Accuracy on SRA-Bench (case-weighted, %). Each row is one base model.

RQ2 How robust are current SR-Agents to retrieval noise?

Brittle: accuracy decays as distractors accumulate.

With the gold skill guaranteed in context, adding 2 / 4 / 8 hard-negative distractors still degrades accuracy monotonically across every model. Progressive Disclosure withholds full content until an explicit reveal and is consistently more robust than Full Skill Injection. Under heavy noise it matches or even surpasses Full-Skill Injection on roughly half the models.

End-task accuracy as the number of hard-negative distractors grows from 0 to 8, with the gold skill always present.

RQ3 How effective are existing retrieval methods at identifying relevant skills?

Feasible, but not solved.

Recall@1 swings from ~2% (TF-IDF on LogicBench) to ~92% (Rerank on MedCalc). Dense retrievers win where capabilities are expressed semantically; sparse wins where they share lexical or code-like surface patterns. LLM reranking on BM25's top-50 is the strongest overall, yet still under 32% on LogicBench, CHAMP, and BigCode. No first-stage retriever wins every domain.

Recall@1 (%) by retriever × dataset. Best per dataset framed. “Rerank ★” = best-of-six LLM rerankers on BM25 top-50.

RQ4 To what extent does retrieval quality influence end-to-end performance?

Necessary, but far from sufficient.

Averaged across the six models, LLM rerank lifts BM25's Recall@1 by +60 pp on MedCalc, yet the downstream end-task gain is only +21 pp. On CHAMP, retrieval improves by +7 pp but end-task accuracy drops. Downstream success also depends on the agent selecting, loading, and applying the right candidate.

Average gain from replacing BM25 with BM25 + LLM Rerank, across 6 models. Indigo bars: Recall@1 gain. Amber bars: end-task accuracy gain.

RQ5 Can current LLMs tell whether a gold skill is among the retrieved candidates?

Most cannot; only frontier models do.

Open-source models load skills at nearly the same rate whether a gold skill is among the top-50 candidates or not. Only frontier models like GLM-5.1 (+35 pp) and GPT-5.4 (+33 pp) show meaningful relevance-aware loading: they noticeably hold back when no gold candidate is in the pool.

Δ Load Rate (percentage points): load rate when gold skill is in top-50 minus when it is not.

RQ6 Do current LLMs load skills more often when the task actually needs external capability?

No: loading is largely unconditioned on need.

Splitting instances by whether the model can solve them natively, skill-loading rates remain remarkably similar. A need-aware agent's two bars would diverge sharply, with wrong > correct. Instead most gaps are within ±6 pp, and Llama-3.1-8B's 15-pp gap runs the wrong way (more loading on tasks it can already solve). Skill loading is not yet functioning as a targeted compensatory mechanism for missing capability.

Skill-loading rate (%) on instances the model solves natively without skills (orange) vs instances it cannot (purple). Connector lines emphasize the (lack of) gap.

Takeaway

SRA is real, but it is not just a retrieval problem. The next frontier is teaching agents to know what they don't know, expose retrieved skills with care, and incorporate them only when they actually help. Scalable SRA needs controlled exposure, need-aware incorporation, and reliable application.

Leaderboard

End-task performance on SRA-Bench

Six models × five skill-use strategies × six datasets, end-task accuracy (%). Best per (model, dataset) in bold; second-best underlined. Oracle is shown for reference. Toggle the controls to explore.

Submit a new result

We welcome new entries. Email oneal2000@126.com with:

A short name and description of the method.
Per-instance outputs on SRA-Bench.
A reproducibility pointer — a code link, a paper, or a brief setup note.

Model:

Methods:

Oracle Best practical Click or press Enter on a column to sort

Scroll horizontally to compare all datasets.

End-task accuracy by model, skill-use method, and dataset
Model	Method	TheoremQA	LogicBench	ToolQA	CHAMP	MedCalc	BigCode	Average

Practical methods use BM25 as the underlying retriever. Full-Skill Injection prepends the top-1 candidate; LLM Selection lets the model pick one from the BM25 top-50; Progressive Disclosure exposes a compact catalog and lets the agent load on demand. See our paper for full protocol details.

Research Agenda

Where SRA goes next

SRA should not be understood merely as a new retrieval problem over skills, but as a broader research agenda for scalable capability augmentation in agent systems. We sketch four concrete directions below, complemented by the empirical challenges SRA-Bench surfaces.

Structured skill libraries

Treating a corpus as a flat list breaks down at scale. Skills have rich relationships (prerequisites, alternatives, specializations, compositions) that future systems may represent as graphs, hierarchies, clusters, or explicit dependency structures.

Early work arXivGraph of Skills arXivGroup of Skills

Quality control & skill evolution

Open skill ecosystems are heterogeneous: many skills are incomplete, ambiguous, or outdated. Offline pipelines for validating, debugging, and refining skills, plus agentic self-improvement loops that revise skills from failure cases, turn the corpus into a continually maintained resource.

Early work arXivSkillRL arXivAutoSkill

Utility-aware skill retrieval

Skill retrievers should optimize for whether a skill can actually solve the task, not just topical similarity. Promising directions: training from end-task feedback, multi-skill retrieval for complementary capabilities, and retrieval conditioned on the agent's intermediate reasoning state.

Early work arXivSkillRouter

Parametric skill augmentation

Why re-inject the same skill as text on every call? Frequently used skills could be compressed offline into plug-in parameter modules that load into the model on demand, while rare, long-tail skills remain retrievable from the external corpus. A hybrid architecture for scalable SRA.

RAG precedents arXivParametric RAG arXivDyPRAG

Need-aware incorporation

A good agent should load external skills only when its native parametric knowledge falls short. Today's models load almost regardless of necessity. A reliable need-detection mechanism is wide open.

No SRA work yet; RAG analogues arXivDRAGIN arXivSearch-R1

Controlled skill exposure

Full-Skill Injection is brittle under noisy candidate sets; Progressive Disclosure helps but is inconsistent. Exposure policies that match the cost and detail of each candidate to its expected relevance are essential for robust SR-Agents.

In the wild WebAgentSkills.io RepoAnthropic Agent Skills

Cite us

BibTeX

If SRA-Bench or any of our findings inform your work, we would be grateful for a citation, and a ⭐ on the repo.

@article{su2026skill,
  title={Skill Retrieval Augmentation for Agentic AI},
  author={Su, Weihang and Long, Jianming and Ai, Qingyao and He, Qiaozhi and
          Tang, Yichen and Wang, Changyue and Tu, Yiteng and Wang, Yingbo and Liu, Yiqun},
  journal={arXiv preprint arXiv:2604.24594},
  year={2026}
}