Transcription & Diarization
Speaker diarization with pyannote.audio and transcription with Whisper
Continuous Update Pipeline
How It Works
Speaker segmentation is performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video.
Whisper, OpenAI's open-source transcription model, is used to generate transcripts of the audio tracks. Each transcribed segment is reassigned to its respective speaker identified during diarization.
The transcripts are structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. The final transcriptions are then aggregated into the final SQL database, stored on a private server.
Diarization and transcription run continuously on the project researchers' collaborative machine network, with GPU acceleration.
Click each card above to expand details
Speaker segmentation was performed using pyannote.audio for diarization, enabling the identification and delimitation of different speakers' interventions within each video. Whisper, OpenAI's open-source transcription model, was then used to generate transcripts of the audio tracks, with each segment reassigned to its respective speaker identified during diarization. The transcripts were subsequently structured with timestamps and speaker identifiers, ensuring alignment between text, diarization, and audio. Processing runs continuously on the project researchers' collaborative machine network.
Tools Used
Database Schema
Six tables in a normalized relational schema, from raw metadata to sentence-level NLP annotations.
| # | Table | Description | Scale |
|---|---|---|---|
| 1 | videos | One row per video: ID, channel metadata, views, likes, comments, tags, duration, upload date, political orientation, country, gender. | 26,396 rows |
| 2 | comments | All comments with author info, like counts, timestamps, nested reply structure, and JSONB analysis column. | 9.6M+ rows |
| 3 | video_transcripts | Full diarized transcripts with speaker labels and cleaned text versions. | 28,121 rows |
| 4 | transcription_speakers | Individual speaker segments from diarization, ordered by position within each video. | 1,021,611 rows |
| 5 | comments_processed | Sentence-level tokenized comments with NER entities (PER, ORG, LOC) and ML prediction columns. | 15.3M+ rows |
| 6 | transcription_speakers_processed | Sentence-level speaker segments with NER extraction and full annotation suite. | 4.8M+ rows |
Continuous Observatory
The database is continuously updated: channel scanning, video transcription and annotation, comment extraction, metadata updates (views, likes, subscribers). Each scan produces a longitudinal history accessible via the API.