Step 4 of 7

Transcription & Diarization

Speaker diarization with pyannote.audio and transcription with Whisper

Last updated: 2026-06-03 00:43
28,121 Videos transcribed
1.0M+ Speaker segments
Data Flow

Continuous Update Pipeline

Input Isolated vocal tracks
Process pyannote.audio 3.1 + Whisper (MLX)
Output Structured transcripts → PostgreSQL
Methodology

How It Works

Speaker segmentation is performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video.

Whisper, OpenAI's open-source transcription model, is used to generate transcripts of the audio tracks. Each transcribed segment is reassigned to its respective speaker identified during diarization.

The transcripts are structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. The final transcriptions are then aggregated into the final SQL database, stored on a private server.

Diarization and transcription run continuously on the project researchers' collaborative machine network, with GPU acceleration.

Click each card above to expand details

Speaker segmentation was performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video. Whisper, OpenAI's open-source transcription model, was then used to generate transcripts of the audio tracks, with each segment reassigned to its respective speaker identified during diarization. The transcripts were subsequently structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. Processing runs continuously on the project researchers' collaborative machine network.

Technology Stack

Tools Used

pyannote.audio 3.1
Whisper (MLX)
WhisperX
Data Architecture

Database Schema

Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.

# Table Description Scale
1 videos One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. 26,396 rows
2 comments All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. 9.6M+ rows
3 video_transcripts Full diarized transcripts with speaker labels and cleaned text versions. 28,121 rows
4 transcription_speakers Individual speaker segments from diarization, ordered by position within each video. 1,021,611 rows
5 comments_processed Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. 15.3M+ rows
6 transcription_speakers_processed Sentence-level speaker segments with NER extraction and full annotation suite. 4.8M+ rows

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43
Today
videos transcribed
comments extracted
Since January
videos transcribed
comments extracted
videos detected
metadata updated
channels scanned
Technical Paper
Contact the Team
Have a question about the data, the API, or the project? Send us a message.
Suggest a Channel or Feature
Help us improve the YOUPOL corpus. Suggest a political YouTube or TikTok channel we should track, or a feature you'd like to see.