Step 4 of 7

Transcription & Diarization

Speaker diarization with pyannote.audio and transcription with Whisper

                
                Last updated: 2026-06-03 00:43
            

28,121 Videos transcribed

1.0M+ Speaker segments

Input Isolated vocal tracks

Process pyannote.audio 3.1 + Whisper (MLX)

Output Structured transcripts → PostgreSQL

Speaker segmentation is performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video.

Whisper, OpenAI's open-source transcription model, is used to generate transcripts of the audio tracks. Each transcribed segment is reassigned to its respective speaker identified during diarization.

The transcripts are structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. The final transcriptions are then aggregated into the final SQL database, stored on a private server.

Diarization and transcription run continuously on the project researchers' collaborative machine network, with GPU acceleration.

Click each card above to expand details

Speaker segmentation was performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video. Whisper, OpenAI's open-source transcription model, was then used to generate transcripts of the audio tracks, with each segment reassigned to its respective speaker identified during diarization. The transcripts were subsequently structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. Processing runs continuously on the project researchers' collaborative machine network.

pyannote.audio 3.1

Whisper (MLX)

WhisperX

#	Table	Description	Scale
1	videos	One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender.	26,396 rows
2	comments	All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column.	9.6M+ rows
3	video_transcripts	Full diarized transcripts with speaker labels and cleaned text versions.	28,121 rows
4	transcription_speakers	Individual speaker segments from diarization, ordered by position within each video.	1,021,611 rows
5	comments_processed	Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns.	15.3M+ rows
6	transcription_speakers_processed	Sentence-level speaker segments with NER extraction and full annotation suite.	4.8M+ rows

Previous Audio Preprocessing Next NLP & Annotation

All steps

Continuous Observatory

The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.

Last updated: 2026-06-03 00:43

Today

videos transcribed

comments extracted

Since January

videos transcribed

comments extracted

videos detected

metadata updated

channels scanned

Transcription & Diarization

Continuous Update Pipeline

How It Works

Speaker Diarization (pyannote.audio)

Transcription (Whisper)

Transcript Structuring

Continuous Distributed Processing

Tools Used

Database Schema

Continuous Observatory