Understanding Results
Learn how to interpret and use VAD analysis data.
Frame Data Structure
Each frame in the response contains an array of detected faces with speaker detection information:
JSON
{
"frame": 42,
"faces": [
{
"bbox": [120, 80, 100, 120],
"person_id": 1,
"speaking_score": 0.92,
"active": true
}
]
}Bounding Box
The bbox array contains [x, y, width, height] coordinates in pixels, relative to the original video dimensions:
- x - Left edge of the face box
- y - Top edge of the face box
- width - Width of the face box
- height - Height of the face box
Person Tracking
The person_id field tracks individuals across frames. The same person will have the same ID throughout the video, allowing you to:
- Track speaking time per person
- Build timeline visualizations
- Calculate turn-taking metrics
Speaking Score
The speaking_score is a confidence value between 0 and 1:
- 0.0 - 0.3 - Not speaking
- 0.3 - 0.7 - Possibly speaking
- 0.7 - 1.0 - Likely speaking
The active boolean is derived from this score using an optimized threshold.
Example: Calculate Speaking Time
JavaScript
const result = await getVadResult(taskId);
const fps = result.frames.length / result.video_duration;
// Count frames where each person is speaking
const speakingFrames = {};
for (const frame of result.frames) {
for (const face of frame.faces) {
if (face.active) {
speakingFrames[face.person_id] =
(speakingFrames[face.person_id] || 0) + 1;
}
}
}
// Convert to seconds
for (const personId in speakingFrames) {
const seconds = speakingFrames[personId] / fps;
console.log(`Person ${personId}: ${seconds.toFixed(1)}s speaking`);
}