feat(worker): use TDLib remote.unique_id as zero-false-positive dedup signal

The fileName + size repost detection from ff4e150 works but has a
theoretical false-positive: two unrelated files in the same channel
with identical names and identical total sizes get treated as duplicates.

TDLib's document.remote.unique_id is a stable identifier per file
content — every repost of the exact same file across messages keeps
the same unique_id. Using it as the first dedup check eliminates the
false-positive risk entirely.

Schema:
  - Package.remoteUniqueId (nullable, since existing rows lack it)
  - Index on (sourceChannelId, remoteUniqueId)

Pipeline:
  1. Capture remoteUniqueId in getChannelMessages + getTopicMessages
  2. Pass through TelegramMessage type
  3. processOneArchiveSet checks findPackageByRemoteUniqueId FIRST
     (before packageExistsBySourceMessage / findRepostedPackage)
  4. createPackageStub stores it on the new Package row

Existing 19,952 Packages have remoteUniqueId = NULL — they fall through
to the existing checks (source-msg-id, name+size, content-hash). New
ingestions populate it and benefit from the strong signal immediately.
Old Packages get backfilled organically when their content is
re-encountered and a new Package would otherwise be created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-24 08:50:24 +02:00
parent 18a0efb3d4
commit 7d39a13310
7 changed files with 93 additions and 2 deletions

View File

@@ -0,0 +1,10 @@
-- AlterTable: capture TDLib's stable per-content identifier for new packages.
-- Existing rows are NULL; they fall through to the other dedup checks until
-- they're re-encountered organically.
ALTER TABLE "packages" ADD COLUMN "remoteUniqueId" TEXT;
-- CreateIndex: scoped to source channel because we want to dedup
-- per-channel (the same file appearing in two different channels is still
-- worth indexing twice — they're different ingestion sources).
CREATE INDEX "packages_sourceChannelId_remoteUniqueId_idx"
ON "packages"("sourceChannelId", "remoteUniqueId");

View File

@@ -472,6 +472,11 @@ model Package {
sourceChannelId String
sourceMessageId BigInt
sourceTopicId BigInt?
/// TDLib's `remote.unique_id` for the FIRST part's file. Stable across
/// reposts of identical content in the same channel — used as the
/// strongest pre-download dedup signal (no false positives unlike
/// fileName + size matching).
remoteUniqueId String?
destChannelId String?
destMessageId BigInt?
destMessageIds BigInt[] @default([])
@@ -503,6 +508,7 @@ model Package {
@@index([archiveType])
@@index([creator])
@@index([packageGroupId])
@@index([sourceChannelId, remoteUniqueId])
@@map("packages")
}