Nara Lens/Docs

Understanding Results

Learn how to interpret and use VAD analysis data.

Frame Data Structure

Each frame in the response contains an array of detected faces with speaker detection information:

JSON
{
  "frame": 42,
  "faces": [
    {
      "bbox": [120, 80, 100, 120],
      "person_id": 1,
      "speaking_score": 0.92,
      "active": true
    }
  ]
}

Bounding Box

The bbox array contains [x, y, width, height] coordinates in pixels, relative to the original video dimensions:

  • x - Left edge of the face box
  • y - Top edge of the face box
  • width - Width of the face box
  • height - Height of the face box

Person Tracking

The person_id field tracks individuals across frames. The same person will have the same ID throughout the video, allowing you to:

  • Track speaking time per person
  • Build timeline visualizations
  • Calculate turn-taking metrics

Speaking Score

The speaking_score is a confidence value between 0 and 1:

  • 0.0 - 0.3 - Not speaking
  • 0.3 - 0.7 - Possibly speaking
  • 0.7 - 1.0 - Likely speaking

The active boolean is derived from this score using an optimized threshold.

Example: Calculate Speaking Time

JavaScript
const result = await getVadResult(taskId);
const fps = result.frames.length / result.video_duration;

// Count frames where each person is speaking
const speakingFrames = {};

for (const frame of result.frames) {
  for (const face of frame.faces) {
    if (face.active) {
      speakingFrames[face.person_id] =
        (speakingFrames[face.person_id] || 0) + 1;
    }
  }
}

// Convert to seconds
for (const personId in speakingFrames) {
  const seconds = speakingFrames[personId] / fps;
  console.log(`Person ${personId}: ${seconds.toFixed(1)}s speaking`);
}