dragonsstash/.claude/skills/tdlib-telegram-workspace/iteration-1/benchmark.json

{
  "skill_name": "tdlib-telegram",
  "iteration": 1,
  "configs": [
    {
      "name": "with_skill",
      "pass_rate": {"mean": 1.0, "stddev": 0.0},
      "tokens": {"mean": 53200, "stddev": 14800},
      "time_seconds": {"mean": 123.5, "stddev": 16.7}
    },
    {
      "name": "without_skill",
      "pass_rate": {"mean": 0.857, "stddev": 0.134},
      "tokens": {"mean": 56467, "stddev": 12100},
      "time_seconds": {"mean": 156.4, "stddev": 39.7}
    }
  ],
  "delta": {
    "pass_rate": "+14.3%",
    "tokens": "-5.8%",
    "time": "-21.0%"
  },
  "evals": [
    {
      "name": "broadcast-to-all-users",
      "with_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 35365, "time_seconds": 107.6},
      "without_skill": {"pass_rate": 0.6, "passed": 3, "total": 5, "tokens": 69214, "time_seconds": 200.2}
    },
    {
      "name": "flood-wait-during-scan",
      "with_skill": {"pass_rate": 1.0, "passed": 4, "total": 4, "tokens": 63079, "time_seconds": 140.9},
      "without_skill": {"pass_rate": 1.0, "passed": 4, "total": 4, "tokens": 45601, "time_seconds": 122.3}
    },
    {
      "name": "download-and-reupload-file",
      "with_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 61157, "time_seconds": 122.1},
      "without_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 54587, "time_seconds": 146.7}
    }
  ],
  "analyst_notes": [
    "The skill's biggest impact was on Eval 1 (broadcast): the baseline MISSED both withFloodWait retry wrapping and inter-message delay — the two most critical patterns for avoiding rate limits during bulk sends. This is exactly the kind of bug the skill is designed to prevent.",
    "Eval 2 (FLOOD_WAIT debugging) was a near-tie. Both versions correctly diagnosed the problem and proposed adaptive backoff. The skill version was slightly more thorough: it added pagination-level retry with sleep(waitSec) instead of just re-throwing, meaning it can survive even after withFloodWait's retries are exhausted.",
    "Eval 3 (download/reupload) was also close. Both correctly composed existing primitives. The skill version was more explicit about WHY certain patterns matter (referencing the skill's documentation), which helps future maintainers understand the code.",
    "The skill version was faster on average (-21% time) and used fewer tokens (-5.8%), likely because the skill front-loaded the knowledge instead of requiring the agent to discover it by reading source files."
  ]
}