feat: grouping phase 1 — schema, ungrouped tab, time-window grouping, hash verification

Schema:
- Add GroupingSource enum (ALBUM, MANUAL, AUTO_TIME, AUTO_PATTERN, etc.)
- Add groupingSource field to PackageGroup with backfill
- Add SystemNotification model for persistent alerts
- Add NotificationType and NotificationSeverity enums

Ungrouped staging tab:
- Add listUngroupedPackages/countUngroupedPackages queries
- Add "Ungrouped" tab to STL page showing packages without a group

Time-window auto-grouping:
- After album grouping, cluster ungrouped packages within configurable
  time window (default 5 min, AUTO_GROUP_TIME_WINDOW_MINUTES env var)
- Groups named from common filename prefix
- Groups created with groupingSource=AUTO_TIME

Hash verification after split:
- Re-hash split parts and compare to original contentHash
- Log error and create SystemNotification on mismatch
- Prevents silently corrupted split uploads

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-30 13:00:27 +02:00
parent 194c87a256
commit 9e78cc5d19
10 changed files with 415 additions and 5 deletions

View File

@@ -0,0 +1,67 @@
# Grouping Phase 1: Foundation + Time-Window Grouping
> **For agentic workers:** Use superpowers:subagent-driven-development to implement this plan.
**Goal:** Add grouping infrastructure (schema, enums, notifications model), an ungrouped staging queue in the UI, and time-window auto-grouping as the first automatic signal beyond album grouping.
**Architecture:** Schema changes lay the foundation. Ungrouped tab is a query filter. Time-window grouping runs as a post-processing pass after album grouping in the worker pipeline.
**Tech Stack:** Prisma schema + migration, worker TypeScript, Next.js App Router.
---
## Task 1: Schema Migration
**Files:**
- Modify: `prisma/schema.prisma`
- Create: migration SQL
Add:
1. `GroupingSource` enum: `ALBUM`, `MANUAL`, `AUTO_TIME`, `AUTO_PATTERN`, `AUTO_REPLY`, `AUTO_ZIP`, `AUTO_CAPTION`
2. `groupingSource GroupingSource @default(MANUAL)` on `PackageGroup`
3. `SystemNotification` model with `type`, `severity`, `title`, `message`, `context` (Json), `isRead`
4. `NotificationType` enum: `HASH_MISMATCH`, `MISSING_PART`, `UPLOAD_FAILED`, `DOWNLOAD_FAILED`, `GROUPING_CONFLICT`, `INTEGRITY_AUDIT`
5. `NotificationSeverity` enum: `INFO`, `WARNING`, `ERROR`
Backfill: `UPDATE package_groups SET "groupingSource" = 'ALBUM' WHERE "mediaAlbumId" IS NOT NULL`
---
## Task 2: Ungrouped Staging Tab in STL Page
**Files:**
- Modify: `src/lib/telegram/queries.ts` — add `listUngroupedPackages()` query
- Modify: `src/app/(app)/stls/page.tsx` — add tab parameter support
- Modify: `src/app/(app)/stls/_components/stl-table.tsx` — add "Ungrouped" tab
Add a tab next to the existing "Skipped" tab that shows packages where `packageGroupId IS NULL`. Uses the existing `PackageListItem` type and table rendering. This gives users a clear view of files that need manual grouping.
---
## Task 3: Time-Window Auto-Grouping in Worker
**Files:**
- Create: `worker/src/grouping.ts` — add `processTimeWindowGroups()` after existing `processAlbumGroups()`
- Modify: `worker/src/worker.ts` — call time-window grouping after album grouping
- Modify: `worker/src/util/config.ts` — add `autoGroupTimeWindowMinutes` config
After album grouping completes, find remaining ungrouped packages from the same channel scan. Cluster packages whose `sourceMessageId` timestamps are within the configured window (default 5 minutes). Create groups for clusters of 2+ with `groupingSource = AUTO_TIME` and name derived from the common filename prefix or first file's base name.
---
## Task 4: Hash Verification After Split
**Files:**
- Modify: `worker/src/worker.ts` — add hash re-check after concat+split
- Modify: `worker/src/archive/hash.ts` — (no changes needed, reuse `hashParts`)
After `concatenateFiles()` + `byteLevelSplit()`, re-hash the split parts and compare to the original `contentHash`. If mismatch, log error and create a `SystemNotification` (once that table exists). This closes the integrity gap identified in the audit.
---
## Task 5: Build & Deploy
Rebuild worker and app images. Deploy. Verify:
- Worker logs show `maxPartSizeMB` and new `autoGroupTimeWindowMinutes` in config
- Ungrouped tab visible in STL page
- Previously-skipped large archives begin processing

View File

@@ -0,0 +1,32 @@
-- CreateEnum GroupingSource
CREATE TYPE "GroupingSource" AS ENUM ('ALBUM', 'MANUAL', 'AUTO_TIME', 'AUTO_PATTERN', 'AUTO_REPLY', 'AUTO_ZIP', 'AUTO_CAPTION');
-- CreateEnum NotificationType
CREATE TYPE "NotificationType" AS ENUM ('HASH_MISMATCH', 'MISSING_PART', 'UPLOAD_FAILED', 'DOWNLOAD_FAILED', 'GROUPING_CONFLICT', 'INTEGRITY_AUDIT');
-- CreateEnum NotificationSeverity
CREATE TYPE "NotificationSeverity" AS ENUM ('INFO', 'WARNING', 'ERROR');
-- AlterTable: add groupingSource to package_groups
ALTER TABLE "package_groups" ADD COLUMN "groupingSource" "GroupingSource" NOT NULL DEFAULT 'MANUAL';
-- Backfill: mark album-based groups
UPDATE "package_groups" SET "groupingSource" = 'ALBUM' WHERE "mediaAlbumId" IS NOT NULL;
-- CreateTable: system_notifications
CREATE TABLE "system_notifications" (
"id" TEXT NOT NULL,
"type" "NotificationType" NOT NULL,
"severity" "NotificationSeverity" NOT NULL DEFAULT 'INFO',
"title" TEXT NOT NULL,
"message" TEXT NOT NULL,
"context" JSONB,
"isRead" BOOLEAN NOT NULL DEFAULT false,
"createdAt" TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP,
CONSTRAINT "system_notifications_pkey" PRIMARY KEY ("id")
);
-- CreateIndex
CREATE INDEX "system_notifications_isRead_createdAt_idx" ON "system_notifications"("isRead", "createdAt");
CREATE INDEX "system_notifications_type_idx" ON "system_notifications"("type");

View File

@@ -522,6 +522,7 @@ model PackageGroup {
name String
mediaAlbumId String?
sourceChannelId String
groupingSource GroupingSource @default(MANUAL)
previewData Bytes?
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@ -802,3 +803,45 @@ model KickstarterPackage {
@@id([kickstarterId, packageId])
@@map("kickstarter_packages")
}
// ── Grouping & Notifications ──
enum GroupingSource {
ALBUM
MANUAL
AUTO_TIME
AUTO_PATTERN
AUTO_REPLY
AUTO_ZIP
AUTO_CAPTION
}
enum NotificationType {
HASH_MISMATCH
MISSING_PART
UPLOAD_FAILED
DOWNLOAD_FAILED
GROUPING_CONFLICT
INTEGRITY_AUDIT
}
enum NotificationSeverity {
INFO
WARNING
ERROR
}
model SystemNotification {
id String @id @default(cuid())
type NotificationType
severity NotificationSeverity @default(INFO)
title String
message String
context Json?
isRead Boolean @default(false)
createdAt DateTime @default(now())
@@index([isRead, createdAt])
@@index([type])
@@map("system_notifications")
}

View File

@@ -38,7 +38,7 @@ import {
} from "@/components/ui/dialog";
import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs";
import { Badge } from "@/components/ui/badge";
import type { DisplayItem, IngestionAccountStatus } from "@/lib/telegram/types";
import type { DisplayItem, IngestionAccountStatus, PackageListItem } from "@/lib/telegram/types";
import type { SkippedRow } from "./skipped-columns";
import {
updatePackageCreator,
@@ -61,6 +61,9 @@ interface StlTableProps {
skippedData: SkippedRow[];
skippedPageCount: number;
skippedTotalCount: number;
ungroupedData: PackageListItem[];
ungroupedPageCount: number;
ungroupedTotalCount: number;
}
export function StlTable({
@@ -73,6 +76,9 @@ export function StlTable({
skippedData,
skippedPageCount,
skippedTotalCount,
ungroupedData,
ungroupedPageCount,
ungroupedTotalCount,
}: StlTableProps) {
const router = useRouter();
const pathname = usePathname();
@@ -379,6 +385,23 @@ export function StlTable({
const { table } = useDataTable({ data: tableRows, columns, pageCount });
const ungroupedRows: StlTableRow[] = useMemo(
() =>
ungroupedData.map((pkg) => ({
...pkg,
_rowType: "package" as const,
_groupId: null,
_isGroupMember: false,
})),
[ungroupedData]
);
const { table: ungroupedTable } = useDataTable({
data: ungroupedRows,
columns,
pageCount: ungroupedPageCount,
});
const activeTag = searchParams.get("tag") ?? "";
return (
@@ -401,6 +424,14 @@ export function StlTable({
</Badge>
)}
</TabsTrigger>
<TabsTrigger value="ungrouped" className="gap-1.5">
Ungrouped
{ungroupedTotalCount > 0 && (
<Badge variant="secondary" className="h-5 px-1.5 text-[10px]">
{ungroupedTotalCount}
</Badge>
)}
</TabsTrigger>
</TabsList>
<TabsContent value="packages" className="space-y-4">
@@ -472,6 +503,11 @@ export function StlTable({
totalCount={skippedTotalCount}
/>
</TabsContent>
<TabsContent value="ungrouped" className="space-y-4">
<DataTable table={ungroupedTable} emptyMessage="All packages are grouped!" />
<DataTablePagination table={ungroupedTable} totalCount={ungroupedTotalCount} />
</TabsContent>
</Tabs>
<PackageFilesDrawer

View File

@@ -1,6 +1,6 @@
import { auth } from "@/lib/auth";
import { redirect } from "next/navigation";
import { listDisplayItems, searchPackages, getIngestionStatus, getAllPackageTags, listSkippedPackages, countSkippedPackages } from "@/lib/telegram/queries";
import { listDisplayItems, searchPackages, getIngestionStatus, getAllPackageTags, listSkippedPackages, countSkippedPackages, listUngroupedPackages, countUngroupedPackages } from "@/lib/telegram/queries";
import { StlTable } from "./_components/stl-table";
import type { DisplayItem, PackageListItem } from "@/lib/telegram/types";
@@ -24,7 +24,7 @@ export default async function StlFilesPage({ searchParams }: Props) {
const tab = (params.tab as string) ?? "packages";
// Fetch packages, ingestion status, tags, and skipped count in parallel
const [result, ingestionStatus, availableTags, skippedCount] = await Promise.all([
const [result, ingestionStatus, availableTags, skippedCount, ungroupedCount] = await Promise.all([
search
? searchPackages({
query: search,
@@ -43,6 +43,7 @@ export default async function StlFilesPage({ searchParams }: Props) {
getIngestionStatus(),
getAllPackageTags(),
countSkippedPackages(),
countUngroupedPackages(),
]);
// For search results, wrap as DisplayItem[]; for non-search, already DisplayItem[]
@@ -55,6 +56,11 @@ export default async function StlFilesPage({ searchParams }: Props) {
? await listSkippedPackages({ page, limit: perPage })
: null;
// Fetch ungrouped packages only if on that tab
const ungroupedResult = tab === "ungrouped"
? await listUngroupedPackages({ page, limit: perPage })
: null;
return (
<StlTable
data={displayItems}
@@ -66,6 +72,9 @@ export default async function StlFilesPage({ searchParams }: Props) {
skippedData={skippedResult?.items ?? []}
skippedPageCount={skippedResult?.pagination.totalPages ?? 0}
skippedTotalCount={skippedCount}
ungroupedData={ungroupedResult?.items ?? []}
ungroupedPageCount={ungroupedResult?.pagination.totalPages ?? 0}
ungroupedTotalCount={ungroupedCount}
/>
);
}

View File

@@ -571,6 +571,72 @@ export async function countSkippedPackages(): Promise<number> {
return prisma.skippedPackage.count();
}
export async function listUngroupedPackages(options: {
page: number;
limit: number;
}) {
const { page, limit } = options;
const skip = (page - 1) * limit;
const where = { packageGroupId: null, destMessageId: { not: null } };
const [items, total] = await Promise.all([
prisma.package.findMany({
where,
orderBy: { indexedAt: "desc" },
skip,
take: limit,
select: {
id: true,
fileName: true,
fileSize: true,
archiveType: true,
creator: true,
fileCount: true,
isMultipart: true,
partCount: true,
tags: true,
indexedAt: true,
previewData: true,
sourceChannel: { select: { id: true, title: true } },
},
}),
prisma.package.count({ where }),
]);
return {
items: items.map((p) => ({
id: p.id,
fileName: p.fileName,
fileSize: p.fileSize.toString(),
contentHash: "",
archiveType: p.archiveType,
creator: p.creator,
fileCount: p.fileCount,
isMultipart: p.isMultipart,
partCount: p.partCount,
tags: p.tags,
indexedAt: p.indexedAt.toISOString(),
hasPreview: !!p.previewData,
sourceChannel: p.sourceChannel,
matchedFileCount: 0,
matchedByContent: false,
})),
pagination: {
total,
totalPages: Math.ceil(total / limit),
page,
limit,
},
};
}
export async function countUngroupedPackages(): Promise<number> {
return prisma.package.count({
where: { packageGroupId: null, destMessageId: { not: null } },
});
}
export async function getPackageGroup(groupId: string) {
return prisma.packageGroup.findUnique({
where: { id: groupId },

View File

@@ -587,3 +587,24 @@ export async function linkPackagesToGroup(
data: { packageGroupId: groupId },
});
}
export async function createTimeWindowGroup(input: {
sourceChannelId: string;
name: string;
packageIds: string[];
}): Promise<string> {
const group = await db.packageGroup.create({
data: {
sourceChannelId: input.sourceChannelId,
name: input.name,
groupingSource: "AUTO_TIME",
},
});
await db.package.updateMany({
where: { id: { in: input.packageIds } },
data: { packageGroupId: group.id },
});
return group.id;
}

View File

@@ -1,7 +1,8 @@
import type { Client } from "tdl";
import type { TelegramPhoto } from "./preview/match.js";
import { downloadPhotoThumbnail } from "./tdlib/download.js";
import { createOrFindPackageGroup, linkPackagesToGroup } from "./db/queries.js";
import { createOrFindPackageGroup, linkPackagesToGroup, createTimeWindowGroup } from "./db/queries.js";
import { config } from "./util/config.js";
import { childLogger } from "./util/logger.js";
import { db } from "./db/client.js";
@@ -77,3 +78,95 @@ export async function processAlbumGroups(
}
}
}
/**
* After album grouping, cluster remaining ungrouped packages from the same channel
* that were posted within a configurable time window.
* Only groups packages that were just indexed in this scan cycle (the `indexedPackages` list).
*/
export async function processTimeWindowGroups(
sourceChannelId: string,
indexedPackages: IndexedPackageRef[]
): Promise<void> {
if (config.autoGroupTimeWindowMinutes <= 0) return;
// Find which of the just-indexed packages are still ungrouped
const ungrouped = await db.package.findMany({
where: {
id: { in: indexedPackages.map((p) => p.packageId) },
packageGroupId: null,
},
orderBy: { sourceMessageId: "asc" },
select: {
id: true,
fileName: true,
sourceMessageId: true,
indexedAt: true,
},
});
if (ungrouped.length < 2) return;
const windowMs = config.autoGroupTimeWindowMinutes * 60 * 1000;
// Cluster by time proximity: walk through sorted list, start new cluster when gap > window
const clusters: typeof ungrouped[] = [];
let current: typeof ungrouped = [ungrouped[0]];
for (let i = 1; i < ungrouped.length; i++) {
const prev = current[current.length - 1];
const gap = Math.abs(ungrouped[i].indexedAt.getTime() - prev.indexedAt.getTime());
if (gap <= windowMs) {
current.push(ungrouped[i]);
} else {
clusters.push(current);
current = [ungrouped[i]];
}
}
clusters.push(current);
// Create groups for clusters with 2+ packages
for (const cluster of clusters) {
if (cluster.length < 2) continue;
// Derive group name from common filename prefix
const name = findCommonPrefix(cluster.map((p) => p.fileName)) || cluster[0].fileName;
try {
const groupId = await createTimeWindowGroup({
sourceChannelId,
name,
packageIds: cluster.map((p) => p.id),
});
log.info(
{ groupId, name, memberCount: cluster.length },
"Created time-window group"
);
} catch (err) {
log.warn({ err, clusterSize: cluster.length }, "Failed to create time-window group");
}
}
}
/**
* Find the longest common prefix among a list of filenames,
* trimming trailing separators and partial words.
*/
function findCommonPrefix(names: string[]): string {
if (names.length === 0) return "";
if (names.length === 1) return names[0];
let prefix = names[0];
for (let i = 1; i < names.length; i++) {
while (!names[i].startsWith(prefix)) {
prefix = prefix.slice(0, -1);
if (prefix.length === 0) return "";
}
}
// Trim trailing separators and partial words
const trimmed = prefix.replace(/[\s\-_.(]+$/, "");
return trimmed.length >= 3 ? trimmed : "";
}

View File

@@ -10,6 +10,8 @@ export const config = {
/** Maximum file part size for Telegram upload (in MiB). Default 1950 (under 2GB non-Premium limit).
* Set to 3900 for Premium accounts (under 4GB limit). */
maxPartSizeMB: parseInt(process.env.MAX_PART_SIZE_MB ?? "1950", 10),
/** Time window for auto-grouping ungrouped packages from the same channel (minutes). 0 = disabled. */
autoGroupTimeWindowMinutes: parseInt(process.env.AUTO_GROUP_TIME_WINDOW_MINUTES ?? "5", 10),
/** Maximum jitter added to scheduler interval (in minutes) */
jitterMinutes: 5,
/** Maximum time span for multipart archive parts (in hours). 0 = no limit. */

View File

@@ -47,7 +47,8 @@ import { readRarContents } from "./archive/rar-reader.js";
import { read7zContents } from "./archive/sevenz-reader.js";
import { byteLevelSplit, concatenateFiles } from "./archive/split.js";
import { uploadToChannel } from "./upload/channel.js";
import { processAlbumGroups, type IndexedPackageRef } from "./grouping.js";
import { processAlbumGroups, processTimeWindowGroups, type IndexedPackageRef } from "./grouping.js";
import { db } from "./db/client.js";
import type { TelegramAccount, TelegramChannel } from "@prisma/client";
import type { Client } from "tdl";
@@ -790,6 +791,9 @@ async function processArchiveSets(
indexedPackageRefs,
scanResult.photos
);
// Time-window grouping for remaining ungrouped packages
await processTimeWindowGroups(channel.id, indexedPackageRefs);
}
return maxProcessedId;
@@ -1053,6 +1057,43 @@ async function processOneArchiveSet(
uploadPaths = splitPaths;
}
// ── Hash verification after split ──
// If we split/repacked, verify the split parts hash matches the original
if (splitPaths.length > 0) {
const splitHash = await hashParts(splitPaths);
if (splitHash !== contentHash) {
accountLog.error(
{ fileName: archiveName, originalHash: contentHash, splitHash, parts: splitPaths.length },
"Hash mismatch after split — file may be corrupted"
);
// Record notification for visibility
try {
await db.systemNotification.create({
data: {
type: "HASH_MISMATCH",
severity: "ERROR",
title: `Hash mismatch after splitting ${archiveName}`,
message: `Expected ${contentHash.slice(0, 16)}… but got ${splitHash.slice(0, 16)}… after splitting into ${splitPaths.length} parts`,
context: {
fileName: archiveName,
originalHash: contentHash,
splitHash,
partCount: splitPaths.length,
sourceChannelId: channel.id,
},
},
});
} catch {
// Best-effort notification
}
throw new Error(`Hash mismatch after split for ${archiveName}: expected ${contentHash}, got ${splitHash}`);
}
accountLog.debug(
{ fileName: archiveName, hash: contentHash.slice(0, 16), parts: splitPaths.length },
"Split hash verified — matches original"
);
}
// ── Uploading ──
// Check if a prior run already uploaded this file (orphaned upload scenario:
// file reached Telegram but DB write failed or worker crashed before indexing)