Understanding Results

Learn how to interpret data from the standalone Speaker Detection API.

This frame-by-frame format applies to /api/vad/*. The reframe pipeline's active_speaker model returns a track-based shape instead. See the Pipeline API.

Frame Data Structure

Each frame in the response contains an array of detected faces with speaker detection information:

JSON

{
  "frame": 42,
  "faces": [
    {
      "bbox": [120, 80, 100, 120],
      "person_id": 1,
      "speaking_score": 0.92,
      "active": true
    }
  ]
}

Bounding Box

The bbox array contains [x, y, width, height] coordinates in pixels, relative to the original video dimensions:

x - Left edge of the face box
y - Top edge of the face box
width - Width of the face box
height - Height of the face box

Person Tracking

The person_id field tracks individuals across frames. The same person will have the same ID throughout the video, allowing you to:

Track speaking time per person
Build timeline visualizations
Calculate turn-taking metrics

Speaking Score

The speaking_score is a confidence value between 0 and 1:

0.0 - 0.3 - Not speaking
0.3 - 0.7 - Possibly speaking
0.7 - 1.0 - Likely speaking

The active boolean is derived from this score using an optimized threshold.

Example: Calculate Speaking Time

JavaScript

const result = await getVadResult(taskId);
const fps = result.frames.length / result.video_duration;

// Count frames where each person is speaking
const speakingFrames = {};

for (const frame of result.frames) {
  for (const face of frame.faces) {
    if (face.active) {
      speakingFrames[face.person_id] =
        (speakingFrames[face.person_id] || 0) + 1;
    }
  }
}

// Convert to seconds
for (const personId in speakingFrames) {
  const seconds = speakingFrames[personId] / fps;
  console.log(`Person ${personId}: ${seconds.toFixed(1)}s speaking`);
}