mirror of
https://github.com/xCyanGrizzly/DragonsStash.git
synced 2026-05-11 06:11:15 +00:00
47 lines
2.4 KiB
JSON
47 lines
2.4 KiB
JSON
{
|
|
"skill_name": "tdlib-telegram",
|
|
"iteration": 1,
|
|
"configs": [
|
|
{
|
|
"name": "with_skill",
|
|
"pass_rate": {"mean": 1.0, "stddev": 0.0},
|
|
"tokens": {"mean": 53200, "stddev": 14800},
|
|
"time_seconds": {"mean": 123.5, "stddev": 16.7}
|
|
},
|
|
{
|
|
"name": "without_skill",
|
|
"pass_rate": {"mean": 0.857, "stddev": 0.134},
|
|
"tokens": {"mean": 56467, "stddev": 12100},
|
|
"time_seconds": {"mean": 156.4, "stddev": 39.7}
|
|
}
|
|
],
|
|
"delta": {
|
|
"pass_rate": "+14.3%",
|
|
"tokens": "-5.8%",
|
|
"time": "-21.0%"
|
|
},
|
|
"evals": [
|
|
{
|
|
"name": "broadcast-to-all-users",
|
|
"with_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 35365, "time_seconds": 107.6},
|
|
"without_skill": {"pass_rate": 0.6, "passed": 3, "total": 5, "tokens": 69214, "time_seconds": 200.2}
|
|
},
|
|
{
|
|
"name": "flood-wait-during-scan",
|
|
"with_skill": {"pass_rate": 1.0, "passed": 4, "total": 4, "tokens": 63079, "time_seconds": 140.9},
|
|
"without_skill": {"pass_rate": 1.0, "passed": 4, "total": 4, "tokens": 45601, "time_seconds": 122.3}
|
|
},
|
|
{
|
|
"name": "download-and-reupload-file",
|
|
"with_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 61157, "time_seconds": 122.1},
|
|
"without_skill": {"pass_rate": 1.0, "passed": 5, "total": 5, "tokens": 54587, "time_seconds": 146.7}
|
|
}
|
|
],
|
|
"analyst_notes": [
|
|
"The skill's biggest impact was on Eval 1 (broadcast): the baseline MISSED both withFloodWait retry wrapping and inter-message delay — the two most critical patterns for avoiding rate limits during bulk sends. This is exactly the kind of bug the skill is designed to prevent.",
|
|
"Eval 2 (FLOOD_WAIT debugging) was a near-tie. Both versions correctly diagnosed the problem and proposed adaptive backoff. The skill version was slightly more thorough: it added pagination-level retry with sleep(waitSec) instead of just re-throwing, meaning it can survive even after withFloodWait's retries are exhausted.",
|
|
"Eval 3 (download/reupload) was also close. Both correctly composed existing primitives. The skill version was more explicit about WHY certain patterns matter (referencing the skill's documentation), which helps future maintainers understand the code.",
|
|
"The skill version was faster on average (-21% time) and used fewer tokens (-5.8%), likely because the skill front-loaded the knowledge instead of requiring the agent to discover it by reading source files."
|
|
]
|
|
}
|