Step 7 of 7

Continuous Observatory

Real-time updates: the database is continuously fed by a network of collaborative machines

                
                Last updated: 2026-06-03 00:43
            

30,100 Videos in database

28,121 Videos transcribed

9.6M+ Comments

33,746 Metadata snapshots

2,305 Deleted videos detected

Today

videos transcribed

comments extracted

Since January

videos transcribed

comments extracted

videos detected

metadata updated

channels scanned

Input Monitored YouTube & TikTok channels

Process Scanner + open source distributed workers

Output Continuously updated database

The scanner regularly checks all channels to detect new videos. Metadata is recorded at each pass, creating a longitudinal history of views, likes and subscribers.

Collaborators' machines connect to the central database via a secure SSH tunnel and atomically claim transcription tasks (SELECT FOR UPDATE SKIP LOCKED). Multiple machines work in parallel without conflicts.

Comments are continuously extracted for newly transcribed videos and periodically for existing ones. Deleted comments are detected and documented.

Deleted, privatized or terminated-channel videos are automatically detected during metadata scans and documented in the database with the deletion reason.

Click each card above to expand details

The YouPol corpus is not a static dataset: it is a living observatory of political content on YouTube and TikTok. The pipeline is designed to run continuously, automatically detecting new videos, extracting comments and recording metadata changes over time. The observatory currently tracks the francophone channels seeded in phase 1, and will welcome anglophone channels as soon as they are onboarded.

The processing infrastructure relies on a network of project collaborators' machines, connected via an open source distributed worker system (youpol-worker-node). Each machine automatically detects its resources (CPU, memory, GPU) and atomically claims tasks from the central database. Collaborators can make their machine available at any time via a simple menubar button.

The system records a complete longitudinal history: each metadata scan produces a snapshot of each video's views, likes and subscribers. Deleted comments or privatized videos are automatically detected and documented. This architecture enables analysis of the corpus's temporal dynamics.

youpol-worker-node (open source)

PostgreSQL

PostgREST

yt-dlp

#	Table	Description	Scale
1	videos	One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender.	26,396 rows
2	comments	All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column.	9.6M+ rows
3	video_transcripts	Full diarized transcripts with speaker labels and cleaned text versions.	28,121 rows
4	transcription_speakers	Individual speaker segments from diarization, ordered by position within each video.	1,021,611 rows
5	comments_processed	Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns.	15.3M+ rows
6	transcription_speakers_processed	Sentence-level speaker segments with NER extraction and full annotation suite.	4.8M+ rows

Previous Database & Analysis

All steps