feat(worker): per-account safeguards for second-account upload failures

Driven by a real production case: secondary account was attached to 17 source channels but ingesting only ~2-3 archives per cycle. Log analysis showed three distinct issues that this commit addresses. 1. Auto-retry cap (WORKER_MAX_SKIP_ATTEMPTS, default 5) processArchiveSets now filters out SkippedPackage rows whose attemptCount has reached the cap. Removing them from the working list means they are not tracked in minFailedId, so the watermark cap from d99a506 does not pin progress below them anymore. A bad file no longer blocks the rest of the channel forever; the user can manually retry via the UI to reset the count. 2. Account phone in error messages Every SkippedPackage row and SystemNotification produced from a failure is now prefixed with [<phone>] in errorMessage / message, and the JSON context includes accountPhone. When two accounts share a source channel and only one is failing, the UI tells you which one. 3. Explicit getChat for destination at run start loadChats only loads main/archive/folder chat lists. If an account archived or moved the destination chat, sendMessage failed silently per-archive. Now we getChat the destination once per cycle; on failure we record a SystemNotification and skip the account's entire ingestion cycle (no point downloading what we can't upload). 4. Retry on transient Telegram server errors The "Turnbase Delivery Folder.7z" failure on the secondary and "10. Kingdom of the Depth.part1.rar" on the main were both "Internal Server Error during file upload" — a TG-side hiccup, not a stall or FLOOD_WAIT. These now retry up to MAX_UPLOAD_RETRIES with linear backoff (15s, 30s, 45s + jitter) before giving up. 5. Channel-access-lost notification "Iridium 2 w/ Add-ons [Completed]" has been throwing "Can't access the chat" every cycle for the secondary. The worker now surfaces a CHANNEL_ACCESS_LOST notification (deduped to once per 24h per channel/account) so the admin sees it and can re-join or unlink the channel instead of just losing visibility into the loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-13 12:41:16 +00:00 · 2026-05-22 23:07:57 +02:00
parent 7a79b52baf
commit 379bf246cd
4 changed files with 162 additions and 4 deletions
--- a/worker/src/upload/channel.ts
+++ b/worker/src/upload/channel.ts
@@ -136,6 +136,28 @@ async function sendWithRetry(
        );
      }

+      // Transient Telegram server-side error (HTTP 5xx returned via
+      // updateMessageSendFailed). These are NOT FLOOD_WAIT, NOT stalls — just
+      // TG having a bad moment. They typically resolve on a short backoff, so
+      // retry up to MAX_UPLOAD_RETRIES with linear backoff before giving up.
+      const lowerMsg = errMsg.toLowerCase();
+      const isTransientServerError =
+        lowerMsg.includes("internal server error") ||
+        lowerMsg.includes("internal error") ||
+        lowerMsg.includes("server error") ||
+        lowerMsg.includes("bad gateway") ||
+        lowerMsg.includes("service unavailable") ||
+        lowerMsg.includes("gateway timeout");
+      if (isTransientServerError && !isLastAttempt) {
+        const backoffMs = 15_000 * (attempt + 1) + Math.random() * 5_000;
+        log.warn(
+          { fileName, attempt: attempt + 1, maxRetries: MAX_UPLOAD_RETRIES, backoffMs: Math.round(backoffMs) },
+          `Transient Telegram server error — retrying after backoff`
+        );
+        await sleep(backoffMs);
+        continue;
+      }
+
      throw err;
    }
  }