Welcome to the YOUPOL Database
A database of political influencers on YouTube and TikTok (2006–present), soon expanding to the anglophone world.
Over 69 channels tracked, thousands of transcribed videos, millions of comments and NLP annotations to analyze online political discourse. The database is continuously enriched by a distributed network of contributor machines. Join the network by deploying a worker on your machine, or contact us for access.
Understanding Political YouTube & TikTok
Through Its Content
Previous research on political YouTube and TikTok was limited to metadata (titles, tags, view counts). YOUPOL goes further by analyzing what creators actually say.
By transcribing and annotating over 30,100 videos from more than 69 political channels (2006 to today), we built the first database enabling computational analysis of political discourse at the content level — from far-right ideology to scientific rhetoric, from hate speech to audience engagement.
Collect & Transcribe
Videos and 9M+ comments are scraped, audio is preprocessed with Demucs, then transcribed with Whisper and speaker-diarized with pyannote.audio.
Annotate & Classify
NLP classifiers detect far-right ideology, hate speech, scientific rhetoric, and political orientation at the sentence level. Annotation powered by LLM_Tool (technical paper).
Analyze & Visualize
Entity networks, co-occurrence graphs, OLS regressions, and temporal trends reveal the evolution of political discourse across two decades.
Across the Political Spectrum
Initial seed of 69 channels spanning the entire francophone political spectrum (France and Quebec), categorized by political orientation, soon expanding to the anglophone world. The corpus deliberately oversamples far-right content to enable fine-grained analysis of radical discourse.
Growth of Political YouTube & TikTok (2006–today)
Number of videos published per year, by political orientation
Political Content Detection Over Time
Monthly share of content classified as political — transcriptions (sentences) and comments
Annotation protocol & validation
The training corpus is pre-annotated by a large language model (LLM). The LLM's reliability is first measured on a random sample that is also coded independently by two human researchers. The metrics below compare the LLM's predictions against the consensus of the two annotators (agreement > n/2); once validated, that LLM labels the rest of the corpus, on which the final classifiers are trained (CamemBERTv2 for sentences, XLM-RoBERTa for comments).
| Metric | Sentences | Comments |
|---|---|---|
| Macro F1 unweighted mean across classes | 92.2% | 93.5% |
| Weighted F1 weighted by each class support | 92.0% | 93.9% |
| Micro F1 global, across all decisions | 92.0% | 93.7% |
| Accuracy % of LLM decisions identical to the human consensus | 88.1% | 88.9% |
| Hamming loss % of mislabelled bits — lower is better | 7.7% | 6.0% |
What these numbers measure: how well the LLM reproduces the humans' agreement. The training corpus is therefore built from LLM labels whose reliability is publicly auditable. The CamemBERTv2 and XLM-RoBERTa classifiers are then trained on this LLM-annotated corpus and served here.
Tools & references Annotation and validation pipeline: LLM_Tool (technical paper, OSF).