3D Head Position via VIO
Build's research focus is converting world-scale unlabeled egocentric data into downstream models that predict what people do. Body position is an obvious choice for a prediction target.
One approach is bootstrapping supervised body pose labels with exocentric cameras (see egocentric.org/bodypose), but deploying extra cameras is expensive and brings up execution complexity, which fundamentally limits scaling. Most physical action understanding reduces to where the head is (global pose) and where the hands are (end effectors), both recoverable from just an egocentric device.
The optimization target is RPE (relative pose error) over a 3-minute window, not ATE (absolute trajectory error) over a full 8-hour shift. The challenge is to build a VIO pipeline that produces 6DoF camera poses from our egocentric video.
Build Gen 4 devices have a 30fps camera at 1080p with 163.48° diagonal FOV and a 30Hz IMU.
Intrinsics
Model: pinhole-equi
Resolution: 1920 x 1080
Reprojection error: mean 0.540430 px, median 0.471873 px, std 0.360496 px
Calculated FOVs: H 149.904944 deg, V 88.231169 deg, D 163.48 deg
K =
[[699.19397931, 0.00000000, 976.75087121], [ 0.00000000, 699.60395977, 565.79050329], [ 0.00000000, 0.00000000, 1.00000000]]
distortion =
[[-0.01803320], [ 0.06173989], [-0.05266772], [ 0.01903308]]
Extrinsics
description = CAD-derived mechanical estimate of IMU-to-camera transform
transform_direction = imu_to_camera
notation = ^cam T_imu
units = meters
T_cam_imu =
[[-1.0000000000, 0.0000000000, 0.0000000000, 0.0018216000], [ 0.0000000000, 0.9848000000, -0.1736000000, 0.0064001000], [ 0.0000000000, 0.1736000000, 0.9848000000, -0.0114679000], [ 0.0000000000, 0.0000000000, 0.0000000000, 1.0000000000]]
camera_frame: x right in image, y down in image, z forward along optical axis
gsutil -m cp -r gs://build-ai-egocentric-native-compression/worker_001 .
We're hiring! eddy@build.ai