Step 2 of 7

Data Collection

Continuous metadata, comments and audio extraction via yt-dlp

Last updated: 2026-06-03 00:43
30,100 Videos collected
9.6M+ Comments extracted
28,121 Videos transcribed
Data Flow

Continuous Update Pipeline

Input Monitored YouTube & TikTok channels
Process yt-dlp + continuous scanner
Output Audio + metadata + comments → PostgreSQL
Methodology

How It Works

The scanner regularly checks channels to detect newly published videos. Each video's metadata (views, likes, subscribers) is recorded at each scan, producing a longitudinal history.

For each detected video, audio is downloaded in WAV format and comments are extracted via yt-dlp. Automatic cookie rotation handles authentication and access restrictions.

All data is stored in a normalized PostgreSQL database: video metadata, comments, transcripts, speaker segments and NLP annotations. The REST API (PostgREST) provides programmatic access to the data.

Click each card above to expand details

Metadata (channel name, views, comments, subscribers, etc.) and video audio files are extracted using the yt-dlp library. The pipeline regularly scans channels to detect new videos, extract comments and update metadata. Each detected video is automatically downloaded and integrated into the corpus.

The initial batch of over 15 TB of audio data was processed via the Digital Research Alliance of Canada. The processing infrastructure has since been transferred to a network of collaborators' local machines, enabling continuous processing and immediate responsiveness. New videos are processed within minutes of detection.

Technology Stack

Tools Used

yt-dlp
Python
PostgreSQL
PostgREST
Data Architecture

Database Schema

Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.

# Table Description Scale
1 videos One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. 26,396 rows
2 comments All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. 9.6M+ rows
3 video_transcripts Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows
4 transcription_speakers Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows
5 comments_processed Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. 15.3M+ rows
6 transcription_speakers_processed Sentence-level speaker segments with NER extraction and full annotation suite. 4.8M+ rows

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.