Back to Projects
Computer VisionActiveConfTrackNet

Research Project

Hybrid Architecture for Real-Time Query-Based Video Instance Segmentation

A real-time vision-language segmentation framework that combines heavy detection on key frames with lightweight mask propagation, achieving near-SOTA accuracy at 190+ FPS.

Vision TransformersDistilBERTMobileViT-SSwin-BDeformable ConvolutionCross-Modal Fusion

Project Overview

ConfTrackNet is a real-time vision-language segmentation framework designed to identify and segment a target object in video using a natural language query. Instead of performing expensive cross-modal reasoning on every frame, the framework intelligently combines accurate heavy detection on key frames with lightweight mask propagation on intermediate frames. This design enables strong segmentation accuracy while achieving real-time speed, making the model suitable for both desktop and edge deployment.

Research Problem

Existing query-based video segmentation methods usually apply the full transformer-based vision-language model to every frame. This gives strong accuracy, but it is too slow for real-time applications, often running well below 30 FPS. ConfTrackNet addresses this gap by introducing adaptive routing: it spends heavy computation only when necessary, and uses efficient tracking-style propagation for the rest of the video.

Pipeline

How the Model Works

1

Text Query Encoding

The input text query is encoded once using a frozen DistilBERT text encoder, so the language information can be reused across all frames — eliminating redundant NLP computation.

2

Heavy Key-Frame Detection

The first frame is processed by a heavy detection pathway built on Swin-B with cross-modal fusion and a transformer decoder, producing a high-quality initial segmentation mask.

3

Lightweight Propagation + Confidence Gate

For subsequent frames, a lightweight MobileViT-S pathway propagates the mask using previous mask, current-frame features, and text guidance. A confidence estimator decides whether to trust the result or re-trigger full detection.

Architecture Diagram

Model Architecture

ConfTrackNet — Full Pipeline Architecture
Model Architecture

Full pipeline of ConfTrackNet: the Baseline (heavy) detection pathway (top), Key Frame Detection Module with confidence-guided routing (middle), and the Lightweight Tracking Pipeline for intermediate frames (bottom).

Architecture

Core Architecture Components

Five tightly integrated modules that together enable temporal consistency with minimal compute.

Text Encoder

Frozen DistilBERT that converts the natural language query into a shared embedding reused across all frames.

Heavy Detection Pathway

Swin-B backbone with cross-modal vision-language fusion and a transformer decoder for accurate key-frame segmentation.

Lightweight Propagation Pathway

Frozen MobileViT-S with a mask encoder, deformable mask warping module, and multi-scale decoder. Only 3.16M total parameters (0.64M trainable), running at 5.0 ms per frame.

Multi-Signal Confidence Estimator

Fuses three complementary signals — feature similarity, embedding consistency, and mask quality estimation — into a single reliability score.

Confidence-Guided Re-Detection Gate

Dynamically triggers full inference only when confidence falls below threshold, replacing expensive fixed-interval keyframe schedules.

What Makes It Different

The main novelty of ConfTrackNet is that it does not use a fixed key-frame schedule. Instead, it uses confidence-guided adaptive routing. The system dynamically decides whether to continue propagating the mask or to perform a fresh full detection. This is more effective than running heavy detection every N frames, because video difficulty changes over time. The paper demonstrates that adaptive routing achieves a significantly better balance between accuracy and speed than any fixed-interval keyframe selection strategy.

Propagation Design

The lightweight branch uses a frozen MobileViT-S backbone, a mask encoder, a learned deformable mask warping module, and a multi-scale decoder. It is the primary reason the model achieves real-time speed.

Total Parameters3.16M
Trainable Parameters0.64M
Inference Speed5.0 ms / frame
Multi-Signal Fusion

Confidence Estimation Mechanism

These three signals are fused into a single confidence score. Frames with low confidence automatically trigger re-detection via the heavy pathway. The full three-signal configuration achieved the best results in ablation studies.

1

Feature Similarity

Checks whether the object appearance remains visually consistent between frames.

2

Embedding Consistency

Checks whether the instance identity (as encoded by the text query) is preserved across the propagated mask.

3

Mask Quality Estimation

Checks whether the predicted mask spatially aligns well with the current frame's features.

Benchmarks

Performance Results

~15.8× throughput improvement over the heavy baseline while preserving nearly identical segmentation quality

Ref-YouTube-VOS

190.7
FPS
68.53
J Score
69.45
F Score
68.99
J&F
Latency5.0 ms

Ref-DAVIS 2017

Zero-Shot
167.7
FPS
67.26
J Score
68.50
F Score
67.88
J&F
Latency6.0 ms
Visual Results

Qualitative Results

Side-by-side comparison of input frames, ground truth, and ConfTrackNet predictions on real benchmark videos.

ConfTrackNet — Segmentation Predictions vs Ground Truth
Input Frames
Ground Truth
ConfTrackNet Predictions
Qualitative Results

Qualitative segmentation results on Ref-YouTube-VOS. Each block shows (top to bottom): input frames, ground truth masks, and ConfTrackNet predictions. The model accurately tracks and segments the queried object across challenging motion and occlusion scenarios.

Edge AI Deployment

Edge Deployment

A major strength of ConfTrackNet is practical deployment. The model maintains strong segmentation quality on embedded hardware, making it suitable for real-world edge AI and robotics applications where both accuracy and power efficiency are critical.

HardwareNVIDIA Jetson AGX Orin 64GB
Power Mode30W
PrecisionFP16
~~100
FPS
10.0 ms
Latency
68.46
J Score
69.36
F Score
68.91
J&F
Real-Time Inference Demo

Live segmentation results running on NVIDIA Jetson AGX Orin at ~100 FPS under 30W. Each clip shows ConfTrackNet tracking and segmenting the queried object in real time.

Real-Time Demo — Clip 1Live
Real-Time Demo — Clip 2Live
Real-Time Demo — Clip 3Live

Technologies

Vision TransformersDistilBERTMobileViT-SSwin-BDeformable ConvolutionCross-Modal Fusion

Key Features

Confidence-guided adaptive routing (no fixed keyframe schedule)
Frozen MobileViT-S propagation branch at 5.0 ms per frame
Three-signal confidence estimator
~15.8× throughput improvement over heavy baseline
Edge deployment at ~100 FPS on NVIDIA Jetson AGX Orin