The Data

The YOUPOL Database

A comprehensive PostgreSQL database of 30,100 videos from a seed of 69 channels covering political YouTube and TikTok content (France and Quebec), soon expanding to the anglophone world, with transcripts, metadata, comments and NLP annotations.

0%
30,100
Videos in DB
37,700
On platforms*
69
Channels
1.65B
Views Covered
9,554,876
Comments
28,121
Transcripts
1,021,611
Speaker Segments
2006–2026
Period Covered

* Videos detected on YouTube and TikTok, including those awaiting processing.

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Access the API Technical Paper

Videos by Year

Total number of videos uploaded per year, 2006 to today

Videos by Political Orientation

Distribution across the four orientations

Videos by Country

France vs. Quebec

Views by Political Orientation Over Time

Stacked area chart of total views per year, grouped by orientation

Comments by Year

Total comments on videos uploaded each year

Average Duration by Orientation

Mean video length in minutes

Political Content Detection Over Time

Monthly share of content classified as political — transcriptions (sentences) and comments

Sentences: CamemBERTv2 classifier (FR) · comments: XLM-RoBERTa classifier (multilingual) — trained on LLM annotations validated by human coders
Search in
e.g. "immigration" in Content finds all videos mentioning it
Orientation
Country
Gender
Views
Duration
Political rate ?
Period

The database is organized into 6 relational tables stored in PostgreSQL. Raw data (videos, comments, transcripts) is linked to sentence-level NLP annotations via processed tables.

1. videos

Primary table; one row per video. Contains channel metadata, engagement metrics, upload date, political orientation, country and gender of the creator. 30,100 rows.

ColumnTypeDescription
video_idVARCHAR (PK)Video identifier
channel_nameVARCHARChannel name
titleTEXTVideo title
upload_dateDATEDate the video was published
durationINTEGERVideo length in seconds
view_countBIGINTTotal number of views
like_countINTEGERTotal number of likes
comment_countINTEGERTotal number of comments
tagsJSONBVideo tags as JSON array
ideasVARCHARPolitical orientation (Far_right, Left, Masc, Comp)
countryVARCHARCountry of origin (FR, QC)
genderVARCHARGender of channel creator (H, F, Mixte)

2. comments

All comments with author information, like counts, timestamps and nested reply structure. 9,554,876 rows.

ColumnTypeDescription
comment_idVARCHAR (PK)Unique comment identifier
video_idVARCHAR (FK)Reference to the video
authorVARCHARComment author name
textTEXTComment text
like_countINTEGERNumber of likes on the comment
published_atTIMESTAMPWhen the comment was posted
parent_idVARCHARParent comment ID (for replies)

3. video_transcripts

Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows.

ColumnTypeDescription
transcript_idSERIAL (PK)Auto-incremented identifier
video_idVARCHAR (FK)Reference to the video
raw_transcriptTEXTFull diarized transcript with speaker labels
cleaned_transcriptTEXTCleaned version without speaker tags
languageVARCHARDetected language code

4. transcription_speakers

Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows.

ColumnTypeDescription
segment_idSERIAL (PK)Auto-incremented identifier
video_idVARCHAR (FK)Reference to the video
speakerVARCHARSpeaker label (SPEAKER_00, SPEAKER_01, etc.)
textTEXTTranscribed speech segment
segment_orderINTEGERPosition in the video sequence
start_timeFLOATSegment start time in seconds
end_timeFLOATSegment end time in seconds

5. comments_processed

Sentence-level tokenized comments with named-entity recognition (PER, ORG, LOC) stored as JSONB, plus CamemBERT model prediction columns. 9,554,876+ rows.

ColumnTypeDescription
idSERIAL (PK)Auto-incremented identifier
comment_idVARCHAR (FK)Unique comment identifier
sentenceTEXTIndividual sentence from the comment
sentence_orderINTEGERPosition of the sentence in the comment
entitiesJSONBNER output: {PER: [], ORG: [], LOC: []}
[model]_predictionBOOLEANBinary prediction from each classifier
[model]_confidenceFLOATConfidence score for each prediction

6. transcription_speakers_processed

Sentence-level tokenized speaker segments with NER and hate-speech / scientific-rhetoric annotations from the CamemBERT classifiers. 1,021,611+ rows.

ColumnTypeDescription
idSERIAL (PK)Auto-incremented identifier
segment_idVARCHAR (FK)Reference to speaker segment
sentenceTEXTIndividual sentence from the segment
sentence_orderINTEGERPosition of the sentence in the segment
entitiesJSONBNER output: {PER: [], ORG: [], LOC: []}
[model]_predictionBOOLEANBinary prediction from each classifier
[model]_confidenceFLOATConfidence score for each prediction
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.