Step 7 of 7

Continuous Observatory

Real-time updates: the database is continuously fed by a network of collaborative machines

Last updated: 2026-06-03 00:43
30,100 Videos in database
28,121 Videos transcribed
9.6M+ Comments
33,746 Metadata snapshots
2,305 Deleted videos detected
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Data Flow

Continuous Update Pipeline

Input Monitored YouTube & TikTok channels
Process Scanner + open source distributed workers
Output Continuously updated database
Methodology

How It Works

The scanner regularly checks all channels to detect new videos. Metadata is recorded at each pass, creating a longitudinal history of views, likes and subscribers.

Collaborators' machines connect to the central database via a secure SSH tunnel and atomically claim transcription tasks (SELECT FOR UPDATE SKIP LOCKED). Multiple machines work in parallel without conflicts.

Comments are continuously extracted for newly transcribed videos and periodically for existing ones. Deleted comments are detected and documented.

Deleted, privatized or terminated-channel videos are automatically detected during metadata scans and documented in the database with the deletion reason.

Click each card above to expand details

The YouPol corpus is not a static dataset: it is a living observatory of political content on YouTube and TikTok. The pipeline is designed to run continuously, automatically detecting new videos, extracting comments and recording metadata changes over time. The observatory currently tracks the francophone channels seeded in phase 1, and will welcome anglophone channels as soon as they are onboarded.

The processing infrastructure relies on a network of project collaborators' machines, connected via an open source distributed worker system (youpol-worker-node). Each machine automatically detects its resources (CPU, memory, GPU) and atomically claims tasks from the central database. Collaborators can make their machine available at any time via a simple menubar button.

The system records a complete longitudinal history: each metadata scan produces a snapshot of each video's views, likes and subscribers. Deleted comments or privatized videos are automatically detected and documented. This architecture enables analysis of the corpus's temporal dynamics.

Technology Stack

Tools Used

youpol-worker-node (open source)
PostgreSQL
PostgREST
yt-dlp
Data Architecture

Database Schema

Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.

# Table Description Scale
1 videos One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. 26,396 rows
2 comments All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. 9.6M+ rows
3 video_transcripts Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows
4 transcription_speakers Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows
5 comments_processed Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. 15.3M+ rows
6 transcription_speakers_processed Sentence-level speaker segments with NER extraction and full annotation suite. 4.8M+ rows

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.