Channel Selection
Francophone seed (France and Quebec) set to expand to the anglophone world
Continuous Update Pipeline
How It Works
YouTube and TikTok channel links are collected by specifically identifying channels recognized for their role in the ecosystem of political content creators. The current seed covers the francophone ecosystems (France and Quebec); the same procedure will be used to build the anglophone seed. Audience metrics (number of views, subscribers) serve as criteria to assess each channel's reach and influence.
Each channel is classified along two dimensions: political orientation (far right, left, manosphere, conspiracy) and country of origin. The corpus covers the full political spectrum, from niche creators to major political influencers, enabling comparative analysis across the ecosystems included. The same classification scheme will be carried over to anglophone channels added during the corpus expansion.
Click each card above to expand details
The first step of the pipeline consists of collecting videos from YouTube and TikTok channel links specifically identified for their role in the ecosystem of political content creators. The current corpus is a francophone seed: over 60 channels in France and Quebec covering the entire political spectrum, from the far left to the far right (Finlayson, 2022; Riedl et al., 2021). This seed is set to expand to the anglophone world, placing YOUPOL in an international, comparative perspective.
The channel selection relies on audience metrics such as the number of views and subscribers, making it possible to assess each channel's role and influence in the ecosystem. The selected channels are classified by political orientation (far right, left, manosphere, conspiracy) and country of origin. The resulting corpus is distinctive due to its scale, granularity (including speaker diarization), and especially its capacity to enable longitudinal and computational analysis of video content, where previous studies focused only on titles (Boursier, 2022, 2024). The methodology is replicable and will be applied as-is to the anglophone expansion.
Tools Used
Database Schema
Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.
| # | Table | Description | Scale |
|---|---|---|---|
| 1 | videos | One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. | 26,396 rows |
| 2 | comments | All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. | 9.6M+ rows |
| 3 | video_transcripts | Full diarized transcripts with speaker labels and cleaned text versions. | 28,121 rows |
| 4 | transcription_speakers | Individual speaker segments from diarization, ordered by position within each video. | 1,021,611 rows |
| 5 | comments_processed | Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. | 15.3M+ rows |
| 6 | transcription_speakers_processed | Sentence-level speaker segments with NER extraction and full annotation suite. | 4.8M+ rows |
Continuous Observatory
The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.