3D Head Position via VIO

Build's research focus is converting world-scale unlabeled egocentric data into downstream models that predict what people do. Body position is an obvious choice for a prediction target.

One approach is bootstrapping supervised body pose labels with exocentric cameras (see egocentric.org/bodypose), but deploying extra cameras is expensive and brings up execution complexity, which fundamentally limits scaling. Most physical action understanding reduces to where the head is (global pose) and where the hands are (end effectors), both recoverable from just an egocentric device.

The optimization target is RPE (relative pose error) over a 3-minute window, not ATE (absolute trajectory error) over a full 8-hour shift. The challenge is to build a VIO pipeline that produces 6DoF camera poses from our egocentric video.

Build Gen 4 devices have a 30fps camera at 1080p with 163.48° diagonal FOV and a 30Hz IMU.

Intrinsics

Model: pinhole-equi

Resolution: 1920 x 1080

Reprojection error: mean 0.540430 px, median 0.471873 px, std 0.360496 px

Calculated FOVs: H 149.904944 deg, V 88.231169 deg, D 163.48 deg

K =

[[699.19397931,   0.00000000, 976.75087121],
 [  0.00000000, 699.60395977, 565.79050329],
 [  0.00000000,   0.00000000,   1.00000000]]

distortion =

[[-0.01803320],
 [ 0.06173989],
 [-0.05266772],
 [ 0.01903308]]

Extrinsics

description = CAD-derived mechanical estimate of IMU-to-camera transform

transform_direction = imu_to_camera

notation = ^cam T_imu

units = meters

T_cam_imu =

[[-1.0000000000,  0.0000000000,  0.0000000000,  0.0018216000],
 [ 0.0000000000,  0.9848000000, -0.1736000000,  0.0064001000],
 [ 0.0000000000,  0.1736000000,  0.9848000000, -0.0114679000],
 [ 0.0000000000,  0.0000000000,  0.0000000000,  1.0000000000]]

camera_frame: x right in image, y down in image, z forward along optical axis

gsutil -m cp -r gs://build-ai-egocentric-native-compression/worker_001 .

We're hiring! eddy@build.ai

home