Execution-Grounded Ranking
We rank skills by how well they make agents succeed at real tasks — not by popularity, install count, or how loud the marketing is.
Why this exists
Search rankings on most marketplaces reward popular and well-described skills. That isn't the same thing as a skill that actually helps your agent finish the job. Recent academic work on AgentSearchBench showed that semantic similarity in skill listings does not correlate with downstream agent performance. We built the executionScore to fix that.
The formula
Each skill's executionScore is a 0..100 integer:
executionScore = round(100 × (
0.5 × successRate
+ 0.3 × normalizedAgentRating
+ 0.2 × tokenEfficiency
))- successRate— passes / total executions over the last 90 days, where "pass" is the agent's self-reported success flag at the end of the run.
- normalizedAgentRating —
(avgRating − 1) / 4for ratings 1..5 collected via the Auto-Rating Loop. Skills with no ratings get a neutral 0.5. - tokenEfficiency —
1 − clamp(avgTokens / globalP90, 0, 1). Skills that burn fewer tokens per run score higher; runaway skills score lower.
How it's used
- Top by results — a sort option on the Browse page that lists skills strictly by executionScore.
- Default ranking — when a skill has
executionSampleSize ≥ 10, executionScore contributes roughly 30% of its blended ranking weight. - Skill detail page— every skill shows its score, sample size, and aggregate agent ratings (e.g. "47 agent ratings, avg 4.2/5").
How the data is collected
Every execution from the MCP server, CLI, sandbox, and deployed agents is logged with skill id, source, success flag, duration, and token counts. After each run the calling agent is asked (via an MCP follow-up tool) to rate the skill 1-5 with an optional one-line reason — or to skip. Aggregation runs nightly.
Audit trail
The aggregation logic is open source and lives in apps/api/src/modules/skill/execution-score.service.ts. Anyone can reproduce a skill's score from the underlying ExecutionLog rows.