DCHM: Depth-Consistent Human Modeling

for Multiview Detection

1Australian National University, 2CSIRO's Data61
ICCV 2025

DCHM fuses sparse multiview images with depth-consistent, superpixel-wise Gaussian splatting to create accurate, label-free 3D pedestrian models that surpass prior methods.

Abstract

Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting.

Method

Human Modeling. We represent pedestrians as collections of segmented Gaussian primitives to enable multiview detection. Our pipeline reconstructs and segments pedestrians in challenging sparse-view, large-scale, and occluded environments.

Overview of the framework. The proposed multiview detection pipeline consists of separate training (left) and inference (right) stages. During training, human modeling optimization refines mono-depth estimation for multiview consistent depth prediction, via pseudo-depth generation, mono-depth fine-tuning, and detection compensation. Specifically, we leverage superpixel to improve the Gaussians optimization in sparse-view setting. During inference, the optimized mono-depth produces Gaussians to model humans that are segmented and clustered to detect pedestrians, shown as blue points on the BEV plane.

Comparison

DCHM fuses monocular depth estimates into a single, globally consistent point cloud for downstream task. Unlike stereo methods [1] that break down under heavy occlusion and low-resolution inputs, and monocular baselines [2], [3], [4], [5] that misalign across cameras, DCHM enforces cross-view depth consistency, yielding higher detection accuracy.

Results

Pedestrians, represented by segmented Gaussians, are assigned unique IDs indicated by distinct colors. Matching occurs in 3D and is projected onto 2D image marks across camera views, with correspondences highlighted in colored circles.

BibTeX

Coming soon...