Commit Graph

11 Commits

Author SHA1 Message Date
e8daabd28d fix(tdlib): handle 1.8.64 renames in searchChatMessages + message reply_to
All checks were successful
continuous-integration/drone/push Build is passing
Audit of every TDLib call site against the live 1.8.64 schema in
node_modules/@prebuilt-tdlib/types/tdlib-types.d.ts surfaced three
additional silent breakages beyond the getForumTopics fix in 106700b.

1. searchChatMessages parameter restructure
   The top-level `message_thread_id` and `saved_messages_topic_id`
   request fields were collapsed into a single tagged-union
   `topic_id: MessageTopic$Input`. Three call sites affected:

   - topics.ts getTopicMessages — was passing message_thread_id, now
     sends topic_id with the messageTopicForum variant carrying
     forum_topic_id. Without this the topic scan returns the whole
     channel (or nothing) instead of just the topic.
   - download.ts getChannelMessages — used to pass message_thread_id: 0;
     just omit the topic_id field entirely for a flat scan.
   - rebuild.ts — same treatment.

2. message.reply_to_message_id replaced with reply_to tagged union
   On incoming messages, the flat `reply_to_message_id` field was
   replaced with `reply_to: MessageReplyTo` (messageReplyToMessage or
   messageReplyToStory). Our reply-chain grouping needs the message-ID
   case.

   Added extractReplyToMessageId() that reads both old and new shapes
   so a transition build or future downgrade still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:45:06 +02:00
106700b13f fix(topics): handle TDLib 1.8.64 renamed forum-topic fields
After the TDLib upgrade in 18a0efb, getForumTopicList returned 0 topics
for every forum channel. Confirmed in production logs:

  "title":"Model Printing Emporium","topicCount":0
  "title":"GB_Butler_Bot2","topicCount":0
  "title":"Darnascus 2 : Flamigos Miniatures","topicCount":0

Cycle results: messagesScanned=0, zipsFound=0 — main account's entire
ingestion pipeline was a no-op because all source channels are forums.

Root cause: TDLib 1.8.64 renamed three fields without bumping the
breaking-change indicator we'd notice:

  Request  offset_message_thread_id           → offset_forum_topic_id
  Response next_offset_message_thread_id      → next_offset_forum_topic_id
  Response topics[].info.message_thread_id    → topics[].info.forum_topic_id

The old field names became no-ops in the new TDLib, so every request
came back with an empty topic list and the "stuck pagination" detection
correctly bailed out.

Fix: send the new field name on the request side, read both old and
new names on the response side (so a future TDLib version change in
either direction stays handled).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:18:08 +02:00
7d39a13310 feat(worker): use TDLib remote.unique_id as zero-false-positive dedup signal
The fileName + size repost detection from ff4e150 works but has a
theoretical false-positive: two unrelated files in the same channel
with identical names and identical total sizes get treated as duplicates.

TDLib's document.remote.unique_id is a stable identifier per file
content — every repost of the exact same file across messages keeps
the same unique_id. Using it as the first dedup check eliminates the
false-positive risk entirely.

Schema:
  - Package.remoteUniqueId (nullable, since existing rows lack it)
  - Index on (sourceChannelId, remoteUniqueId)

Pipeline:
  1. Capture remoteUniqueId in getChannelMessages + getTopicMessages
  2. Pass through TelegramMessage type
  3. processOneArchiveSet checks findPackageByRemoteUniqueId FIRST
     (before packageExistsBySourceMessage / findRepostedPackage)
  4. createPackageStub stores it on the new Package row

Existing 19,952 Packages have remoteUniqueId = NULL — they fall through
to the existing checks (source-msg-id, name+size, content-hash). New
ingestions populate it and benefit from the strong signal immediately.
Old Packages get backfilled organically when their content is
re-encountered and a new Package would otherwise be created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:50:24 +02:00
77c26adb31 perf: set watermarks even when no archives found to prevent re-scanning
All checks were successful
continuous-integration/drone/push Build is passing
Previously, channels/topics with no new archives never had their
watermark updated. This meant every cycle re-scanned all messages from
scratch just to discover nothing new — especially costly for the 1079-
topic Model Printing Emporium forum.

- Add maxScannedMessageId to ChannelScanResult (highest msg ID seen)
- Set channel watermark to scan boundary when no archives are found
- Set topic watermark to scan boundary when no archives are found
- Fall back to scan watermark when archive processing doesn't advance it

After one full cycle, subsequent cycles will skip already-scanned
messages via the early-exit boundary check, dramatically reducing
TDLib API calls on channels with mostly non-archive content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 20:37:42 +02:00
4baf5aad83 feat: capture media_album_id from TDLib messages during scanning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:23:47 +01:00
xCyanGrizzly
d7bbb7587e Update tg issues 2026-03-16 16:51:30 +01:00
copilot-swe-agent[bot]
9adbdb2a77 Fix worker getting stuck during sync: add timeouts, stuck detection, and safety limits
- Add invokeWithTimeout wrapper for TDLib API calls (2min timeout per call)
- Add stuck detection to getChannelMessages: break if from_message_id doesn't advance
- Add stuck detection to getTopicMessages: same protection for topic scanning
- Add stuck detection to getForumTopicList: break if pagination offsets don't advance
- Add max page limit (5000) to all scanning loops to prevent infinite pagination
- Add mutex wait timeout (30min) to prevent indefinite blocking when holder hangs
- Add cycle timeout (4h default, configurable via WORKER_CYCLE_TIMEOUT_MINUTES)
- Fix end-of-page detection to use actual limit value instead of hardcoded 100

Co-authored-by: xCyanGrizzly <53275238+xCyanGrizzly@users.noreply.github.com>
2026-03-05 13:14:53 +00:00
copilot-swe-agent[bot]
22da4dfad2 Fix messagesScanned consistency: use totalScanned from scan results, remove double-counting
Co-authored-by: xCyanGrizzly <53275238+xCyanGrizzly@users.noreply.github.com>
2026-03-05 08:36:40 +00:00
copilot-swe-agent[bot]
22bcacf3bd Add live message scanning progress, channel/topic counters to worker activity
- Add progress callbacks to getChannelMessages and getTopicMessages that
  fire after each page of messages is fetched
- Worker now shows channel progress (e.g. "[2/5] Channel Name") when
  processing multiple source channels
- Worker now shows topic progress (e.g. "topic 3/12") when scanning forums
- Worker now shows live message scanning count during channel/topic scans
  (e.g. "Scanning Channel — 300 messages scanned")
- UI stats line now always shows messagesScanned count
- messagesScanned counter now increments during the scanning phase, not
  just during archive processing

Co-authored-by: xCyanGrizzly <53275238+xCyanGrizzly@users.noreply.github.com>
2026-03-05 08:33:26 +00:00
xCyanGrizzly
4d0df6b1a4 addd TG integration 2026-03-02 11:57:17 +01:00
xCyanGrizzly
b427193d17 feat: add Telegram integration with forum topic support and creator tracking
Adds full Telegram ZIP ingestion pipeline: TDLib worker service scans source
channels for archive files, deduplicates by content hash, extracts metadata,
uploads to archive channel, and indexes in Postgres. Forum supergroups are
scanned per-topic with topic names used as creator. Filename-based creator
extraction (e.g. "Mammoth Factory - 2026-01.zip") serves as fallback.

Includes admin UI for managing accounts/channels, simplified account setup
(API credentials via env vars), auth code/password submission dialog,
package browser with creator column, and live ingestion activity tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 16:02:06 +01:00