Continuous Observatory
Real-time updates: the database is continuously fed by a network of collaborative machines
Continuous Update Pipeline
How It Works
The scanner regularly checks all channels to detect new videos. Metadata is recorded at each pass, creating a longitudinal history of views, likes and subscribers.
Collaborators' machines connect to the central database via a secure SSH tunnel and atomically claim transcription tasks (SELECT FOR UPDATE SKIP LOCKED). Multiple machines work in parallel without conflicts.
Comments are continuously extracted for newly transcribed videos and periodically for existing ones. Deleted comments are detected and documented.
Deleted, privatized or terminated-channel videos are automatically detected during metadata scans and documented in the database with the deletion reason.
Click each card above to expand details
The YouPol corpus is not a static dataset: it is a living observatory of political content on YouTube and TikTok. The pipeline is designed to run continuously, automatically detecting new videos, extracting comments and recording metadata changes over time. The observatory currently tracks the francophone channels seeded in phase 1, and will welcome anglophone channels as soon as they are onboarded.
The processing infrastructure relies on a network of project collaborators' machines, connected via an open source distributed worker system (youpol-worker-node). Each machine automatically detects its resources (CPU, memory, GPU) and atomically claims tasks from the central database. Collaborators can make their machine available at any time via a simple menubar button.
The system records a complete longitudinal history: each metadata scan produces a snapshot of each video's views, likes and subscribers. Deleted comments or privatized videos are automatically detected and documented. This architecture enables analysis of the corpus's temporal dynamics.
Tools Used
Database Schema
Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.
| # | Table | Description | Scale |
|---|---|---|---|
| 1 | videos | One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. | 26,396 rows |
| 2 | comments | All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. | 9.6M+ rows |
| 3 | video_transcripts | Full diarized transcripts with speaker labels and cleaned text versions. | 28,121 rows |
| 4 | transcription_speakers | Individual speaker segments from diarization, ordered by position within each video. | 1,021,611 rows |
| 5 | comments_processed | Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. | 15.3M+ rows |
| 6 | transcription_speakers_processed | Sentence-level speaker segments with NER extraction and full annotation suite. | 4.8M+ rows |
Continuous Observatory
The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.