Step 1 of 7

Channel Selection

Francophone seed (France and Quebec) set to expand to the anglophone world

Last updated: 2026-06-03 00:43
69 Channels tracked
4 Political orientations
2 Countries covered
Data Flow

Continuous Update Pipeline

Input YouTube & TikTok ecosystem
Process Identification by political content + audience metrics
Output Classified channels
Methodology

How It Works

YouTube and TikTok channel links are collected by specifically identifying channels recognized for their role in the ecosystem of political content creators. The current seed covers the francophone ecosystems (France and Quebec); the same procedure will be used to build the anglophone seed. Audience metrics (number of views, subscribers) serve as criteria to assess each channel's reach and influence.

Each channel is classified along two dimensions: political orientation (far right, left, manosphere, conspiracy) and country of origin. The corpus covers the full political spectrum, from niche creators to major political influencers, enabling comparative analysis across the ecosystems included. The same classification scheme will be carried over to anglophone channels added during the corpus expansion.

Click each card above to expand details

The first step of the pipeline consists of collecting videos from YouTube and TikTok channel links specifically identified for their role in the ecosystem of political content creators. The current corpus is a francophone seed: over 60 channels in France and Quebec covering the entire political spectrum, from the far left to the far right (Finlayson, 2022; Riedl et al., 2021). This seed is set to expand to the anglophone world, placing YOUPOL in an international, comparative perspective.

The channel selection relies on audience metrics such as the number of views and subscribers, making it possible to assess each channel's role and influence in the ecosystem. The selected channels are classified by political orientation (far right, left, manosphere, conspiracy) and country of origin. The resulting corpus is distinctive due to its scale, granularity (including speaker diarization), and especially its capacity to enable longitudinal and computational analysis of video content, where previous studies focused only on titles (Boursier, 2022, 2024). The methodology is replicable and will be applied as-is to the anglophone expansion.

Technology Stack

Tools Used

Python
YouTube Data API
TikTok API
Data Architecture

Database Schema

Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.

# Table Description Scale
1 videos One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. 26,396 rows
2 comments All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. 9.6M+ rows
3 video_transcripts Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows
4 transcription_speakers Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows
5 comments_processed Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. 15.3M+ rows
6 transcription_speakers_processed Sentence-level speaker segments with NER extraction and full annotation suite. 4.8M+ rows

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.