The YOUPOL Database
A comprehensive PostgreSQL database of 30,100 videos from a seed of 69 channels covering political YouTube and TikTok content (France and Quebec), soon expanding to the anglophone world, with transcripts, metadata, comments and NLP annotations.
* Videos detected on YouTube and TikTok, including those awaiting processing.
Continuous Observatory
The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.
Videos by Year
Total number of videos uploaded per year, 2006 to today
Videos by Political Orientation
Distribution across the four orientations
Videos by Country
France vs. Quebec
Views by Political Orientation Over Time
Stacked area chart of total views per year, grouped by orientation
Comments by Year
Total comments on videos uploaded each year
Average Duration by Orientation
Mean video length in minutes
Political Content Detection Over Time
Monthly share of content classified as political — transcriptions (sentences) and comments
Annotation protocol & validation
The training corpus is pre-annotated by a large language model (LLM). The LLM's reliability is first measured on a random sample that is also coded independently by two human researchers. The metrics below compare the LLM's predictions against the consensus of the two annotators (agreement > n/2); once validated, that LLM labels the rest of the corpus, on which the final classifiers are trained (CamemBERTv2 for sentences, XLM-RoBERTa for comments).
| Metric | Sentences | Comments |
|---|---|---|
| Macro F1 unweighted mean across classes | 92.2% | 93.5% |
| Weighted F1 weighted by each class support | 92.0% | 93.9% |
| Micro F1 global, across all decisions | 92.0% | 93.7% |
| Accuracy % of LLM decisions identical to the human consensus | 88.1% | 88.9% |
| Hamming loss % of mislabelled bits — lower is better | 7.7% | 6.0% |
What these numbers measure: how well the LLM reproduces the humans' agreement. The training corpus is therefore built from LLM labels whose reliability is publicly auditable. The CamemBERTv2 and XLM-RoBERTa classifiers are then trained on this LLM-annotated corpus and served here.
Tools & references Annotation and validation pipeline: LLM_Tool (technical paper, OSF).
—
The database is organized into 6 relational tables stored in PostgreSQL. Raw data (videos, comments, transcripts) is linked to sentence-level NLP annotations via processed tables.
1. videos
Primary table; one row per video. Contains channel metadata, engagement metrics, upload date, political orientation, country and gender of the creator. 30,100 rows.
| Column | Type | Description |
|---|---|---|
| video_id | VARCHAR (PK) | Video identifier |
| channel_name | VARCHAR | Channel name |
| title | TEXT | Video title |
| upload_date | DATE | Date the video was published |
| duration | INTEGER | Video length in seconds |
| view_count | BIGINT | Total number of views |
| like_count | INTEGER | Total number of likes |
| comment_count | INTEGER | Total number of comments |
| tags | JSONB | Video tags as JSON array |
| ideas | VARCHAR | Political orientation (Far_right, Left, Masc, Comp) |
| country | VARCHAR | Country of origin (FR, QC) |
| gender | VARCHAR | Gender of channel creator (H, F, Mixte) |
2. comments
All comments with author information, like counts, timestamps and nested reply structure. 9,554,876 rows.
| Column | Type | Description |
|---|---|---|
| comment_id | VARCHAR (PK) | Unique comment identifier |
| video_id | VARCHAR (FK) | Reference to the video |
| author | VARCHAR | Comment author name |
| text | TEXT | Comment text |
| like_count | INTEGER | Number of likes on the comment |
| published_at | TIMESTAMP | When the comment was posted |
| parent_id | VARCHAR | Parent comment ID (for replies) |
3. video_transcripts
Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows.
| Column | Type | Description |
|---|---|---|
| transcript_id | SERIAL (PK) | Auto-incremented identifier |
| video_id | VARCHAR (FK) | Reference to the video |
| raw_transcript | TEXT | Full diarized transcript with speaker labels |
| cleaned_transcript | TEXT | Cleaned version without speaker tags |
| language | VARCHAR | Detected language code |
4. transcription_speakers
Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows.
| Column | Type | Description |
|---|---|---|
| segment_id | SERIAL (PK) | Auto-incremented identifier |
| video_id | VARCHAR (FK) | Reference to the video |
| speaker | VARCHAR | Speaker label (SPEAKER_00, SPEAKER_01, etc.) |
| text | TEXT | Transcribed speech segment |
| segment_order | INTEGER | Position in the video sequence |
| start_time | FLOAT | Segment start time in seconds |
| end_time | FLOAT | Segment end time in seconds |
5. comments_processed
Sentence-level tokenized comments with named-entity recognition (PER, ORG, LOC) stored as JSONB, plus CamemBERT model prediction columns. 9,554,876+ rows.
| Column | Type | Description |
|---|---|---|
| id | SERIAL (PK) | Auto-incremented identifier |
| comment_id | VARCHAR (FK) | Unique comment identifier |
| sentence | TEXT | Individual sentence from the comment |
| sentence_order | INTEGER | Position of the sentence in the comment |
| entities | JSONB | NER output: {PER: [], ORG: [], LOC: []} |
| [model]_prediction | BOOLEAN | Binary prediction from each classifier |
| [model]_confidence | FLOAT | Confidence score for each prediction |
6. transcription_speakers_processed
Sentence-level tokenized speaker segments with NER and hate-speech / scientific-rhetoric annotations from the CamemBERT classifiers. 1,021,611+ rows.
| Column | Type | Description |
|---|---|---|
| id | SERIAL (PK) | Auto-incremented identifier |
| segment_id | VARCHAR (FK) | Reference to speaker segment |
| sentence | TEXT | Individual sentence from the segment |
| sentence_order | INTEGER | Position of the sentence in the segment |
| entities | JSONB | NER output: {PER: [], ORG: [], LOC: []} |
| [model]_prediction | BOOLEAN | Binary prediction from each classifier |
| [model]_confidence | FLOAT | Confidence score for each prediction |