Why this exists

Search rankings on most marketplaces reward popular and well-described skills. That isn't the same thing as a skill that actually helps your agent finish the job. Recent academic work on AgentSearchBench showed that semantic similarity in skill listings does not correlate with downstream agent performance. We built the executionScore to fix that.

The formula

Each skill's executionScore is a 0..100 integer:

executionScore = round(100 × (
 0.5 × successRate
 + 0.3 × normalizedAgentRating
 + 0.2 × tokenEfficiency
 ))

successRate— passes / total executions over the last 90 days, where "pass" is the agent's self-reported success flag at the end of the run.
normalizedAgentRating — (avgRating − 1) / 4 for ratings 1..5 collected via the Auto-Rating Loop. Skills with no ratings get a neutral 0.5.
tokenEfficiency — 1 − clamp(avgTokens / globalP90, 0, 1). Skills that burn fewer tokens per run score higher; runaway skills score lower.

How it's used

Top by results — a sort option on the Browse page that lists skills strictly by executionScore.
Default ranking — when a skill has executionSampleSize ≥ 10, executionScore contributes roughly 30% of its blended ranking weight.
Skill detail page— every skill shows its score, sample size, and aggregate agent ratings (e.g. "47 agent ratings, avg 4.2/5").

How the data is collected

Every execution from the MCP server, CLI, sandbox, and deployed agents is logged with skill id, source, success flag, duration, and token counts. After each run the calling agent is asked (via an MCP follow-up tool) to rate the skill 1-5 with an optional one-line reason — or to skip. Aggregation runs nightly.

Audit trail

The aggregation logic is open source and lives in apps/api/src/modules/skill/execution-score.service.ts. Anyone can reproduce a skill's score from the underlying ExecutionLog rows.