Database | Political YouTube & TikTok Database

31,054

Videos in DB

39,880

On platforms*

Channels

1.79B

Views Covered

9,870,075

Comments

27,745

Transcripts

701,597

Speaker Segments

2006–2026

Period Covered

* Videos detected on YouTube and TikTok, including those awaiting processing.

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-07-20 02:55

Today

videos transcribed

comments extracted

new comments

deleted comments

metadata updated

channels scanned

Since January

videos transcribed

comments extracted

videos detected

new comments

metadata updated

channels scanned

Access the API Technical Paper

Videos by Year

Total number of videos uploaded per year, 2006 to today

Videos by Political Orientation

Distribution across the four orientations

Videos by Country

France vs. Quebec

Views by Political Orientation Over Time

Stacked area chart of total views per year, grouped by orientation

Comments by Year

Total comments on videos uploaded each year

Average Duration by Orientation

Mean video length in minutes

Political Content Detection Over Time

Monthly share of content classified as political — transcriptions (sentences) and comments

Sentences: CamemBERTv2 classifier (FR) · comments: XLM-RoBERTa classifier (multilingual) — trained on LLM annotations validated by human coders

Annotation protocol & validation

The training corpus is pre-annotated by a large language model (LLM). The LLM's reliability is first measured on a random sample that is also coded independently by two human researchers. The metrics below compare the LLM's predictions against the consensus of the two annotators (agreement > n/2); once validated, that LLM labels the rest of the corpus, on which the final classifiers are trained (CamemBERTv2 for sentences, XLM-RoBERTa for comments).

Metric	Sentences	Comments
Macro F1 unweighted mean across classes	92.2%	93.5%
Weighted F1 weighted by each class support	92.0%	93.9%
Micro F1 global, across all decisions	92.0%	93.7%
Accuracy % of LLM decisions identical to the human consensus	88.1%	88.9%
Hamming loss % of mislabelled bits — lower is better	7.7%	6.0%

Validation sample: 1,000 sentences · 998 comments (random)
Human annotators: 2 researchers, independent coding
Reference: human consensus (agreement > n/2)
Schema: binary (political / non-political)

What these numbers measure: how well the LLM reproduces the humans' agreement. The training corpus is therefore built from LLM labels whose reliability is publicly auditable. The CamemBERTv2 and XLM-RoBERTa classifiers are then trained on this LLM-annotated corpus and served here.

Tools & references Annotation and validation pipeline: LLM_Tool (technical paper, OSF).

Search in

e.g. "immigration" in Content finds all videos mentioning it

The database is organized into 6 relational tables stored in PostgreSQL. Raw data (videos, comments, transcripts) is linked to sentence-level NLP annotations via processed tables.

1. videos

Primary table; one row per video. Contains channel metadata, engagement metrics, upload date, political orientation, country and gender of the creator. 31,054 rows.

Column	Type	Description
video_id	VARCHAR (PK)	Video identifier
channel_name	VARCHAR	Channel name
title	TEXT	Video title
upload_date	DATE	Date the video was published
duration	INTEGER	Video length in seconds
view_count	BIGINT	Total number of views
like_count	INTEGER	Total number of likes
comment_count	INTEGER	Total number of comments
tags	JSONB	Video tags as JSON array
ideas	VARCHAR	Political orientation (Far_right, Left, Masc, Comp)
country	VARCHAR	Country of origin (FR, QC)
gender	VARCHAR	Gender of channel creator (H, F, Mixte)

2. comments

All comments with author information, like counts, timestamps and nested reply structure. 9,870,075 rows.

Column	Type	Description
comment_id	VARCHAR (PK)	Unique comment identifier
video_id	VARCHAR (FK)	Reference to the video
author	VARCHAR	Comment author name
text	TEXT	Comment text
like_count	INTEGER	Number of likes on the comment
published_at	TIMESTAMP	When the comment was posted
parent_id	VARCHAR	Parent comment ID (for replies)

3. video_transcripts

Full diarized transcripts with speaker labels and cleaned text versions. 27,745 rows.

Column	Type	Description
transcript_id	SERIAL (PK)	Auto-incremented identifier
video_id	VARCHAR (FK)	Reference to the video
raw_transcript	TEXT	Full diarized transcript with speaker labels
cleaned_transcript	TEXT	Cleaned version without speaker tags
language	VARCHAR	Detected language code

4. transcription_speakers

Individual speaker segments from diarization, ordered by position within each video. 701,597 rows.

Column	Type	Description
segment_id	SERIAL (PK)	Auto-incremented identifier
video_id	VARCHAR (FK)	Reference to the video
speaker	VARCHAR	Speaker label (SPEAKER_00, SPEAKER_01, etc.)
text	TEXT	Transcribed speech segment
segment_order	INTEGER	Position in the video sequence
start_time	FLOAT	Segment start time in seconds
end_time	FLOAT	Segment end time in seconds

5. comments_processed

Sentence-level tokenized comments with named-entity recognition (PER, ORG, LOC) stored as JSONB, plus CamemBERT model prediction columns. 9,870,075+ rows.

Column	Type	Description
id	SERIAL (PK)	Auto-incremented identifier
comment_id	VARCHAR (FK)	Unique comment identifier
sentence	TEXT	Individual sentence from the comment
sentence_order	INTEGER	Position of the sentence in the comment
entities	JSONB	NER output: {PER: [], ORG: [], LOC: []}
[model]_prediction	BOOLEAN	Binary prediction from each classifier
[model]_confidence	FLOAT	Confidence score for each prediction

6. transcription_speakers_processed

Sentence-level tokenized speaker segments with NER and hate-speech / scientific-rhetoric annotations from the CamemBERT classifiers. 701,597+ rows.

Column	Type	Description
id	SERIAL (PK)	Auto-incremented identifier
segment_id	VARCHAR (FK)	Reference to speaker segment
sentence	TEXT	Individual sentence from the segment
sentence_order	INTEGER	Position of the sentence in the segment
entities	JSONB	NER output: {PER: [], ORG: [], LOC: []}
[model]_prediction	BOOLEAN	Binary prediction from each classifier
[model]_confidence	FLOAT	Confidence score for each prediction

The YOUPOL Database