Research Project

Welcome to the YOUPOL Database

A database of political influencers on YouTube and TikTok (2006–present), soon expanding to the anglophone world.

Over 71 channels tracked, thousands of transcribed videos, millions of comments and NLP annotations to analyze online political discourse. The database is continuously enriched by a distributed network of contributor machines. Join the network by deploying a worker on your machine, or contact us for access.

Explore the database

Pipeline & methodology Data analysis

Technical Paper Access the API

Live Corpus Stats

Videos in DB

Channels

Comments

Transcripts

Total Views

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-07-19 11:06

Today

videos transcribed

comments extracted

metadata updated

Since January

videos transcribed

comments extracted

videos detected

new comments

metadata updated

channels scanned

The Project

Understanding Political YouTube & TikTok
Through Its Content

Previous research on political YouTube and TikTok was limited to metadata (titles, tags, view counts). YOUPOL goes further by analyzing what creators actually say.

By transcribing and annotating over 30,878 videos from more than 71 political channels (2006 to today), we built the first database enabling computational analysis of political discourse at the content level — from far-right ideology to scientific rhetoric, from hate speech to audience engagement.

27,567 Videos transcribed

9.9M+ Comments extracted

71 Channels analyzed

20 Years of coverage

Collect & Transcribe

Videos and 9M+ comments are scraped, audio is preprocessed with Demucs, then transcribed with Whisper and speaker-diarized with pyannote.audio.

yt-dlpWhisperpyannote

Annotate & Classify

NLP classifiers detect far-right ideology, hate speech, scientific rhetoric, and political orientation at the sentence level. Annotation powered by LLM_Tool (technical paper).

TransformersNERLLM annotation

Analyze & Visualize

Entity networks, co-occurrence graphs, OLS regressions, and temporal trends reveal the evolution of political discourse across two decades.

NetworkXOLSECharts

Corpus Composition

Across the Political Spectrum

Initial seed of 71 channels spanning the entire francophone political spectrum (France and Quebec), categorized by political orientation, soon expanding to the anglophone world. The corpus deliberately oversamples far-right content to enable fine-grained analysis of radical discourse.

17,556 Far Right (FR) 57%

6,028 Left (FR) 20%

5,069 Far Right (QC) 16%

1,811 Masculinist 6%

414 Conspiracy (QC) 1%

Sentences: CamemBERTv2 classifier (FR) · comments: XLM-RoBERTa classifier (multilingual) — trained on LLM annotations validated by human coders

Annotation protocol & validation

The training corpus is pre-annotated by a large language model (LLM). The LLM's reliability is first measured on a random sample that is also coded independently by two human researchers. The metrics below compare the LLM's predictions against the consensus of the two annotators (agreement > n/2); once validated, that LLM labels the rest of the corpus, on which the final classifiers are trained (CamemBERTv2 for sentences, XLM-RoBERTa for comments).

Metric	Sentences	Comments
Macro F1 unweighted mean across classes	92.2%	93.5%
Weighted F1 weighted by each class support	92.0%	93.9%
Micro F1 global, across all decisions	92.0%	93.7%
Accuracy % of LLM decisions identical to the human consensus	88.1%	88.9%
Hamming loss % of mislabelled bits — lower is better	7.7%	6.0%

Validation sample: 1,000 sentences · 998 comments (random)
Human annotators: 2 researchers, independent coding
Reference: human consensus (agreement > n/2)
Schema: binary (political / non-political)

What these numbers measure: how well the LLM reproduces the humans' agreement. The training corpus is therefore built from LLM labels whose reliability is publicly auditable. The CamemBERTv2 and XLM-RoBERTa classifiers are then trained on this LLM-annotated corpus and served here.

Tools & references Annotation and validation pipeline: LLM_Tool (technical paper, OSF).

Welcome to the YOUPOL Database

Continuous Observatory

Understanding Political YouTube & TikTok
Through Its Content

Collect & Transcribe

Annotate & Classify

Analyze & Visualize

Across the Political Spectrum

Growth of Political YouTube & TikTok (2006–today)

Political Content Detection Over Time

Annotation protocol & validation

Welcome to the YOUPOL Database

Continuous Observatory

Understanding Political YouTube & TikTok Through Its Content

Collect & Transcribe

Annotate & Classify

Analyze & Visualize

Across the Political Spectrum

Growth of Political YouTube & TikTok (2006–today)

Political Content Detection Over Time

Understanding Political YouTube & TikTok
Through Its Content