docs: add package grouping design spec and implementation plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-25 21:40:13 +01:00
parent d6386209be
commit 6eb7129637
4 changed files with 3410 additions and 0 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,241 @@
# Design: Search Match Indicators, Size Limit Increase, Skipped/Failed Files Overview
**Date:** 2026-03-24
**Status:** Approved
## Overview
Three related improvements to the STL packages system:
1. **Search match indicators** — Show which internal files matched a search query, with highlighted files in the drawer
2. **Size limit increase** — Raise the ingestion limit from 4 GB to 200 GB so large multipart archives aren't skipped
3. **Skipped/failed files overview** — Track and display archives that were skipped or failed, with retry capability
---
## Feature 1: Size Limit Increase
### Change
`worker/src/util/config.ts` line 6 — change default from `"4096"` to `"204800"`.
One-line change. The split/upload pipeline already handles arbitrary sizes. The 2 GB per-part Telegram API limit is a separate hard-coded constant and stays as-is.
### Impact
- Archives up to 200 GB will now be attempted
- Multipart archives where individual parts are under 2 GB (but total exceeds 4 GB) will no longer be skipped — these upload directly without any splitting
- Single files over 2 GB are automatically split into 2 GB parts (existing behavior)
- Temp disk usage during processing can now reach up to ~200 GB per archive
---
## Feature 2: Search Match Indicators
### Backend Changes
**File:** `src/lib/telegram/queries.ts``searchPackages()`
When `searchIn` is `"files"` or `"both"`, change the PackageFile query from `distinct` to a **grouped count**:
```typescript
// Current: findMany with select: { packageId }, distinct: ["packageId"]
// New: groupBy packageId with _count
const fileMatches = await prisma.packageFile.groupBy({
by: ["packageId"],
where: {
OR: [
{ fileName: { contains: q, mode: "insensitive" } },
{ path: { contains: q, mode: "insensitive" } },
],
},
_count: { _all: true },
});
```
This returns `{ packageId: string, _count: { _all: number } }[]`.
Note: `PackageRow` in `package-columns.tsx` mirrors `PackageListItem` and must also receive the two new fields.
**File:** `src/lib/telegram/types.ts``PackageListItem`
Add two fields:
- `matchedFileCount: number` — how many files inside matched (0 if matched by package name only)
- `matchedByContent: boolean` — true if any files inside matched
### Frontend Changes
**File:** `src/app/(app)/stls/page.tsx`
Pass the search term to `StlTable` as a new prop.
**File:** `src/app/(app)/stls/_components/stl-table.tsx`
Pass search term to columns via TanStack Table column meta.
**File:** `src/app/(app)/stls/_components/package-columns.tsx`
When search is active and `matchedByContent` is true, render a clickable badge below the filename: e.g., "3 file matches". Clicking opens the `PackageFilesDrawer` with a `highlightTerm` prop set to the search term.
**File:** `src/app/(app)/stls/_components/package-files-drawer.tsx`
- Accept optional `highlightTerm: string` prop
- Render full file tree as normal (all files visible)
- Files whose `fileName` or `path` case-insensitively contains `highlightTerm` get a subtle highlight (amber/yellow background on the row)
- Auto-expand folders that contain highlighted files
- The drawer's own search input remains independent
### Data Flow
1. User types search term in STL table search input
2. URL updates with `?search=value`, page reloads
3. `page.tsx` calls `searchPackages()` with `searchIn: "both"`
4. Query returns packages with `matchedFileCount` and `matchedByContent`
5. Table renders "N file matches" badge on content-matched rows
6. User clicks badge -> drawer opens with full tree, matching files highlighted
7. Folders containing matches auto-expanded
---
## Feature 3: Skipped/Failed Files Overview
### Database Schema
New model in `prisma/schema.prisma`:
```prisma
enum SkipReason {
SIZE_LIMIT
DOWNLOAD_FAILED
EXTRACT_FAILED
UPLOAD_FAILED
}
model SkippedPackage {
id String @id @default(cuid())
fileName String
fileSize BigInt
reason SkipReason
errorMessage String?
sourceChannelId String
sourceChannel TelegramChannel @relation(fields: [sourceChannelId], references: [id], onDelete: Cascade)
sourceMessageId BigInt
sourceTopicId BigInt?
isMultipart Boolean @default(false)
partCount Int @default(1)
accountId String
account TelegramAccount @relation(fields: [accountId], references: [id], onDelete: Cascade)
createdAt DateTime @default(now())
@@unique([sourceChannelId, sourceMessageId])
@@index([reason])
@@index([accountId])
@@map("skipped_packages")
}
```
Reverse relations must be added to `TelegramChannel` and `TelegramAccount` models:
```prisma
// In TelegramChannel:
skippedPackages SkippedPackage[]
// In TelegramAccount:
skippedPackages SkippedPackage[]
```
### Worker Changes
**File:** `worker/src/worker.ts`
Extend `PipelineContext` interface to include `accountId` (derived from the ingestion run's account).
At each skip/failure point, upsert a `SkippedPackage` record:
- **Size limit skip** (line 784): reason `SIZE_LIMIT`, no error message
- **Download failure** (catch in download loop): reason `DOWNLOAD_FAILED` + error text
- **Extract/metadata failure** (catch in extract): reason `EXTRACT_FAILED` + error text
- **Upload failure** (catch in upload): reason `UPLOAD_FAILED` + error text
On **successful ingestion** of a package, delete any existing `SkippedPackage` with the same `(sourceChannelId, sourceMessageId)` — so successful retries clean up after themselves.
**File:** `worker/src/db/queries.ts`
Add functions:
- `upsertSkippedPackage(data)` — create or update skip record
- `deleteSkippedPackage(sourceChannelId, sourceMessageId)` — remove on success
### Retry Mechanism
Retrying a skipped package:
1. Delete the `SkippedPackage` record
2. Find the `AccountChannelMap` record using both `accountId` and `sourceChannelId`, then reset its `lastProcessedMessageId` to `sourceMessageId - 1` (only if less than current watermark)
3. If `sourceTopicId` is non-null, also reset the corresponding `TopicProgress.lastProcessedMessageId` for that topic
4. The next ingestion cycle picks up the message and re-attempts processing
For "Retry All" (e.g., all `SIZE_LIMIT` skips after raising the limit):
- Delete all matching `SkippedPackage` records
- For each affected (account, channel) pair, reset `AccountChannelMap` watermark to the minimum `sourceMessageId - 1` among deleted records
- For each affected (account, channel, topic) triple, reset `TopicProgress` watermark similarly
**Note on behavioral distinction:** `DOWNLOAD_FAILED`, `EXTRACT_FAILED`, and `UPLOAD_FAILED` archives already naturally retry because the worker does not advance the watermark past failed sets. The `SkippedPackage` record provides visibility into these failures. The explicit retry/watermark reset is only strictly needed for `SIZE_LIMIT` skips (where the watermark does advance past the skipped message). The UI should present both types but the retry button is most impactful for `SIZE_LIMIT` skips.
**Performance note:** "Retry All" can cause the worker to re-scan large message ranges. The existing dedup logic (`packageExistsBySourceMessage`) ensures already-ingested packages are skipped quickly, but there is a scanning cost proportional to the number of messages between the reset watermark and the current position.
### Frontend Changes
**File:** `src/app/(app)/stls/_components/stl-table.tsx`
Add a "Skipped / Failed" tab alongside the main packages table.
**New file:** `src/app/(app)/stls/_components/skipped-packages-tab.tsx`
Table columns:
- **fileName** — archive name
- **fileSize** — formatted size
- **reason** — color-coded badge: `SIZE_LIMIT` (yellow), `DOWNLOAD_FAILED` (red), `EXTRACT_FAILED` (red), `UPLOAD_FAILED` (red)
- **errorMessage** — truncated with expandable tooltip/popover for full text
- **channel** — source channel title
- **createdAt** — when the skip/failure was recorded
Actions:
- **Retry** button per row — server action that deletes record + resets watermark
- **Retry All** button in the header — bulk retry, filterable by reason
**File:** `src/app/(app)/stls/page.tsx`
Fetch skipped packages count (for tab badge) alongside existing queries.
**File:** `src/data/` or `src/lib/telegram/queries.ts`
Add query functions:
- `listSkippedPackages(options)` — paginated list with reason filter
- `countSkippedPackages()` — for tab badge
- `retrySkippedPackage(id)` — delete record + reset watermark
- `retryAllSkippedPackages(reason?)` — bulk retry
**File:** `src/app/(app)/stls/actions.ts`
Add server actions:
- `retrySkippedPackageAction(id)`
- `retryAllSkippedPackagesAction(reason?)`
---
## Files to Create/Modify
### Create
- `src/app/(app)/stls/_components/skipped-packages-tab.tsx` — skipped packages table UI
- Prisma migration for `SkippedPackage` model
### Modify
- `worker/src/util/config.ts` — raise default max size
- `worker/src/worker.ts` — record skips/failures, clean up on success
- `worker/src/db/queries.ts` — add skip record CRUD functions
- `prisma/schema.prisma` — add `SkippedPackage` model and `SkipReason` enum
- `src/lib/telegram/queries.ts` — modify `searchPackages()` for match counts, add skipped package queries
- `src/lib/telegram/types.ts` — add `matchedFileCount`/`matchedByContent` to `PackageListItem`, add skipped package types
- `src/app/(app)/stls/page.tsx` — pass search term, fetch skipped count, add tab
- `src/app/(app)/stls/_components/stl-table.tsx` — accept search prop, render tabs
- `src/app/(app)/stls/_components/package-columns.tsx` — render match badge
- `src/app/(app)/stls/_components/package-files-drawer.tsx` — accept highlightTerm, highlight matching files, auto-expand matched folders
- `src/app/(app)/stls/actions.ts` — add retry server actions

View File

@@ -0,0 +1,246 @@
# Package Grouping Design
## Overview
Add the ability to group related packages that were posted together in a Telegram channel (e.g., "DUNGEON BLOCKS - Colossal Dungeon" with 6 separate archive files). Groups appear as collapsible rows in the STL files table, with support for both automatic detection via Telegram album IDs and manual grouping through the UI.
## Goals
- Automatically detect and group files posted together in Telegram (same `media_album_id`)
- Display groups as collapsed rows in the STL table with aggregated metadata
- Allow manual grouping/ungrouping of packages via the UI
- Support editable group names and preview images
- Enable "Send All" to deliver every package in a group via the bot
## Non-Goals
- Merging grouped packages into a single Package record (each stays independent)
- Time-proximity heuristics for grouping (too error-prone)
- Grouping across different source channels
---
## Data Model
### New `PackageGroup` Table
```prisma
model PackageGroup {
id String @id @default(cuid())
name String
mediaAlbumId String?
sourceChannelId String
previewData Bytes?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
packages Package[]
sourceChannel TelegramChannel @relation(fields: [sourceChannelId], references: [id], onDelete: Cascade)
@@unique([mediaAlbumId, sourceChannelId])
@@index([sourceChannelId])
@@map("package_groups")
}
```
### Package Model Changes
Add optional group membership:
```prisma
model Package {
// ... existing fields ...
packageGroupId String?
packageGroup PackageGroup? @relation(fields: [packageGroupId], references: [id], onDelete: SetNull)
@@index([packageGroupId])
}
```
### TelegramChannel Model Changes
Add back-relation for the new `PackageGroup` model:
```prisma
model TelegramChannel {
// ... existing fields and relations ...
packageGroups PackageGroup[]
}
```
### Key Decisions
- `mediaAlbumId` is `String?` (TDLib int64 stringified) — only used for dedup lookups, avoids BigInt complexity
- `@@unique([mediaAlbumId, sourceChannelId])` prevents duplicate album-derived groups when re-scanning. PostgreSQL treats NULLs as distinct in unique constraints, so manually-created groups (with `mediaAlbumId = null`) are not constrained by this — which is correct behavior
- Idempotency for album groups uses `findFirst({ where: { mediaAlbumId, sourceChannelId } })` + conditional `create`, not `upsert`, because Prisma does not support `upsert` on compound unique keys with nullable fields
- `onDelete: SetNull` on `Package.packageGroup` means dissolving a group automatically unlinks all members
- `onDelete: Cascade` on `PackageGroup.sourceChannel` means deleting a channel cleans up its groups
- `sourceTopicId` is omitted from `PackageGroup` — it can be inferred from member packages, and manual groups may span topics
- `@@map("package_groups")` follows the project's snake_case table naming convention
- `previewData` stores JPEG thumbnail bytes directly on the group (same pattern as Package)
---
## Worker Changes
### TelegramMessage Interface
Add optional `mediaAlbumId` field:
```typescript
export interface TelegramMessage {
id: bigint;
fileName: string;
fileId: string;
fileSize: bigint;
date: Date;
mediaAlbumId?: string; // Absent or "0" when not part of an album
}
```
The field is optional to minimize call-site changes. The grouping step treats `undefined` and `"0"` equivalently as "not part of an album."
### TelegramPhoto Interface
Add optional `mediaAlbumId` field:
```typescript
export interface TelegramPhoto {
id: bigint;
date: Date;
caption: string;
fileId: string;
fileSize: number;
mediaAlbumId?: string; // For album-to-preview correlation
}
```
### Channel Scanning
In `getChannelMessages()`, read `media_album_id` from the TDLib message object (already present in TDLib responses, just not captured today). Add `media_album_id?: string` to the `TdMessage` interface and pass through to both `TelegramMessage` and `TelegramPhoto`.
The document pass and photo pass already run as separate loops over `searchChatMessages`. Both loops capture `media_album_id` independently. Correlation happens at grouping time: album photos are matched to album documents by comparing their `mediaAlbumId` values, not at scan time.
### Group Creation (Post-Processing)
After each scan cycle's packages are individually processed (downloaded, hashed, uploaded, indexed), a post-processing step handles grouping:
1. Collect all packages from the current scan batch that share the same non-zero `mediaAlbumId`
2. For each distinct `mediaAlbumId`, check if a `PackageGroup` already exists via `findFirst({ where: { mediaAlbumId, sourceChannelId } })`
3. If no group exists, create one:
- **Name:** caption of the first message in the album (falls back to first file's base name)
- **Preview:** find a `TelegramPhoto` from the scan's `photos[]` array with the same `mediaAlbumId`. If found, download via `downloadPhotoThumbnail`. If not, the group starts with no preview (can be added in UI later)
4. Link all member packages via an idempotent `updateMany` — sets `packageGroupId` on all packages whose `sourceMessageId` is in the album's message set. This handles both newly-indexed packages and previously-indexed ones that were created in an earlier partial scan (e.g., if one package failed and was retried later)
The per-package pipeline is unchanged — each file is still downloaded, hashed, deduped, split, uploaded, and indexed independently. Grouping is a layer on top.
---
## Query Layer
### Paginated Listing with Groups
The STL table shows "display items" — either a group (collapsed) or a standalone package. Pagination operates on display items so that a group occupies exactly one slot regardless of member count.
**Two-step query approach** (handles filters correctly):
**Step 1 — Find matching display item IDs:**
```sql
-- Find all group IDs and standalone package IDs where at least one member matches filters
SELECT DISTINCT COALESCE(p."packageGroupId", p.id) AS display_id,
CASE WHEN p."packageGroupId" IS NOT NULL THEN 'group' ELSE 'package' END AS display_type,
MAX(p."indexedAt") AS sort_date
FROM packages p
LEFT JOIN package_groups pg ON pg.id = p."packageGroupId"
WHERE 1=1
-- Optional filters applied here (creator, tags, search text, channelId)
GROUP BY COALESCE(p."packageGroupId", p.id),
CASE WHEN p."packageGroupId" IS NOT NULL THEN 'group' ELSE 'package' END
ORDER BY sort_date DESC
LIMIT $1 OFFSET $2
```
**Step 2 — Fetch full data:**
For groups on the current page, fetch all member packages (including those that didn't match filters — the group appears because at least one member matched, but the expanded view shows all members). For standalone packages, fetch the full package data.
**Count query** (for pagination total):
```sql
SELECT COUNT(*) FROM (
SELECT DISTINCT COALESCE(p."packageGroupId", p.id)
FROM packages p
WHERE 1=1
-- Same filters as step 1
) AS display_items
```
### Group Row Aggregates
Computed in the step 2 fetch: total file size (sum), total file count (sum), combined tags (array union), member package count per group. These populate the collapsed group row.
### Search
`searchPackages` adds `PackageGroup.name` to search targets via a `LEFT JOIN` to `package_groups`. If any package in a group matches by name/file content, or the group name matches, the whole group appears.
### Filtering
Creator/tag filters apply to member packages. A group appears if any member matches the filter. The group row shows aggregates of all members (not just matching ones).
### New Query Functions
| Function | Purpose |
|----------|---------|
| `listDisplayItems(page, limit, filters)` | Two-step paginated query returning groups + standalone packages |
| `getDisplayItemCount(filters)` | Count of display items for pagination total |
| `getPackageGroup(groupId)` | Group metadata + all member packages |
| `updatePackageGroupName(groupId, name)` | Rename group |
| `updatePackageGroupPreview(groupId, previewData)` | Replace group preview |
| `addPackagesToGroup(packageIds, groupId)` | Manual grouping — add to existing group |
| `removePackageFromGroup(packageId)` | Ungroup single package |
| `createManualGroup(name, packageIds)` | Create new group from UI |
| `dissolveGroup(groupId)` | Ungroup all members, delete group record |
For manual grouping of packages that already belong to different groups: the UI first dissolves empty source groups (groups where all members were moved), then links the selected packages to the target group. Non-selected members of source groups remain in their original group.
---
## UI Changes
### STL Table — Group Rows
- **Collapsed (default):** Single row showing preview thumbnail, group name (editable inline), archive type badge ("Mixed" if heterogeneous), combined size, combined file count, combined tags (editable), source channel, latest `indexedAt`, actions
- **Expanded:** Chevron toggle reveals member packages as indented sub-rows with their existing columns and per-package actions
- Chevron icon on the left of the row toggles expand/collapse
**Loading strategy:** Member packages for all groups on the current page are prefetched in a single batched query during the step 2 fetch. This means expand/collapse is instant (no on-demand loading) and avoids per-row loading states.
### Group Row Actions
- **Send All** — Queues bot send requests for every package in the group. Checks for existing PENDING/SENDING requests per package to avoid duplicates.
- **View Files** — Opens file drawer showing all member packages' files, separated by package name headers
- **Dissolve Group** — Ungroups all members (confirmation required)
### Individual Package Actions (Within a Group)
- Existing: Send, View Files
- New: "Remove from group" in dropdown menu
### Manual Grouping
- Checkbox selection column on package rows
- When 2+ packages selected, a "Group Selected" button appears in the table toolbar
- Prompts for a group name, creates the group
- If selected packages belong to existing groups, those packages are moved to the new group. Source groups that become empty are automatically dissolved.
### Preview Editing
- Click the group's preview thumbnail to upload a replacement image
- Same upload flow as individual packages (existing component reuse)
### No Changes To
- Skipped/failed packages tab
- Package detail drawer internals
- Search UI (just broader matching behind the scenes)