Eye & Gaze Tracking with Wearables: Technology, Methods, and the Pipeline End to End

Wearable eye trackers have quietly become one of the most capable sensors you can put on a human body. A pair of glasses that weighs less than a sandwich can tell you, sixty to two hundred times a second, exactly where a person is looking in the world in front of them. That single stream of data underpins foveated rendering in VR headsets, hands-free interaction for people with motor impairments, surgical training, sports coaching, reading research, consumer-attention studies, and a growing slice of clinical diagnostics.

But "where are you looking?" is a deceptively hard question to answer from a camera pointed at an eyeball. This essay walks the whole pipeline — the physics on the glasses, the computer vision that turns pixels into pupils, the geometry that turns pupils into gaze, the calibration that ties it to the world, and the analysis that turns a noisy gaze stream into fixations, saccades, and meaning. The goal is that by the end you understand not just that it works, but why each stage exists and where each one fails.

The shape of the problem

Start with the anatomy, because the hardware is designed around it. The eye is, optically, a roughly spherical camera with a fixed lens (the cornea + crystalline lens) and a sensor (the retina). The fovea — a tiny pit about 1.5 mm across at the center of the retina — is the only region with high enough cone density for sharp, color-rich vision. It subtends roughly 1–2 degrees of your visual field, about the width of your thumbnail at arm's length. Everything outside it is progressively blurrier.

This is the entire reason eye tracking is useful: because the fovea is so small, your brain must constantly aim it. You don't perceive a wide sharp scene; you perceive a narrow sharp spotlight that your eyes whip around two to three times a second, and your brain stitches the result into the illusion of a stable, detailed world. Eye tracking measures the aiming. When you know where the fovea is pointed, you know — to a first approximation — what the person is attending to.

Two more facts shape everything downstream:

The eye is never still. Even during a "fixation," it drifts, tremors, and makes microsaccades. Real measured gaze is jittery by nature, not just because of sensor noise.
The mapping from eye orientation to a point in the world depends on where the eye is, not just how it's rotated. On a wearable, the cameras move with the head, which is both a blessing (the geometry between eye and camera is roughly fixed) and a curse (any slippage of the glasses breaks that assumption).

Hardware: what's actually on the glasses

A modern wearable eye tracker is a small constellation of cameras and light sources mounted on a frame.

Eye cameras. One per eye, tucked into the lower rim or temple, pointed up at the eyeball. These are typically small global-shutter monochrome sensors running at 60–200 Hz, sometimes more in research rigs. Global shutter matters: a rolling shutter would smear a fast-moving pupil across the frame and corrupt its measured position.

Infrared illumination. Near-infrared LEDs (around 850–940 nm) light the eye. Three reasons IR rather than visible light: it's invisible to the wearer so it doesn't distract or constrict the pupil; the iris reflects IR fairly uniformly so the dark pupil stands out with high contrast; and it works in the dark. The LEDs do double duty — they illuminate the eye and create the corneal reflections (glints) that several gaze methods depend on. The eye cameras have IR-pass filters so they see mostly this controlled illumination and ignore ambient color.

Scene (world) camera. A forward-facing RGB camera capturing what the wearer sees, usually wide-angle, often 1080p at 30–60 Hz. Gaze on a wearable is ultimately reported in the scene camera's image — a pixel coordinate on the world view — which is what makes the data interpretable. ("They looked at the exit sign," not "their eye rotated 14° left.")

IMU and compute. An inertial measurement unit tracks head motion; an onboard or tethered processor runs the detection and estimation pipeline. The whole challenge of the hardware design is to fix the geometric relationship between the eye cameras, the LEDs, and the scene camera rigidly enough that calibration done once stays valid — while keeping the thing light enough to wear comfortably for hours.

Two families of measurement

Almost every eye tracker in use today is video-oculography (VOG): it infers gaze from camera images of the eye. Within VOG there are two dominant approaches, and the distinction drives everything about robustness and calibration.

Pupil-only (appearance / pupil-center) tracking. Find the pupil in the eye image; its position and shape encode eye orientation. Simple, works with a single light source, but the pupil center in the image moves both when the eye rotates and when the camera moves relative to the eye. So pupil-only methods are exquisitely sensitive to the glasses slipping on your nose. Modern appearance-based deep-learning trackers fall loosely in this family — they regress gaze directly from the eye image — and inherit the same slippage fragility unless trained or corrected for it.

Pupil-Corneal Reflection (P-CR). Add the glints — the bright specular reflections of the IR LEDs on the cornea. The cornea is a near-spherical mirror, so a glint's position in the image barely moves when the eye rotates but moves with the eye when the whole eye translates. The pupil center moves with both. Therefore the vector from glint to pupil center largely cancels out small translations of the camera-eye geometry and isolates rotation. This is why P-CR is the workhorse of robust eye tracking: that difference vector is far more stable against slippage than the pupil center alone. With multiple glints you can do better still and reconstruct the cornea's 3D position.

There are older and niche methods worth knowing exist — electro-oculography (EOG), which measures the cornea-retinal standing potential with skin electrodes and is cheap, light-insensitive, and used in some sleep and EOG-glasses applications but drift-prone; and scleral search coils, the lab gold standard for accuracy that requires a contact lens with an embedded coil and is far too invasive for wearables. For wearables, VOG with P-CR dominates.

Stage 1 — Eye-feature detection

Given a frame from the eye camera, the first job is to extract the geometric primitives: the pupil and the glints.

Pupil detection. Classic pipelines threshold the dark pupil, find connected components, reject blobs that aren't pupil-shaped, and fit an ellipse to the boundary. An ellipse, not a circle, because a circular pupil viewed off-axis projects to an ellipse — and that ellipse's orientation and eccentricity actually carry information about gaze direction. Robust fitters (e.g. RANSAC ellipse fitting, or algorithms in the lineage of Starburst, ExCuSe, ElSe, PuRe) cope with the things that wreck naive thresholding: eyelids and lashes occluding the pupil, mascara, droopy lids, the dark pupil merging with dark iris in some people, and — most painfully — glints sitting on top of the pupil and punching bright holes in it.

Today, learned segmentation has largely taken over the hard cases. A small convolutional network (architectures in the spirit of U-Net, or purpose-built ones like RITnet) labels each pixel as pupil / iris / sclera / background, and the ellipse is fit to the predicted pupil mask. This is dramatically more robust to occlusion and lighting, at the cost of needing representative training data and a bit more compute.

Glint detection. Glints are small, bright, near-saturated specular spots. They're found by looking for local intensity maxima of roughly the right size, then matched to the LEDs that produced them. Matching is its own puzzle: glints disappear when they slide off the cornea onto the sclera, they multiply off eyelashes or glasses, and with several LEDs you must figure out which glint came from which light. Getting this correspondence right is what lets you use glint geometry quantitatively rather than as a single fragile point.

The output of Stage 1, per frame, is a compact feature set: pupil center and ellipse parameters, and one or more glint positions — all in eye-camera pixel coordinates.

Stage 2 — Gaze estimation

Now turn those 2D features into a gaze direction or a 3D line of sight. Two philosophies.

Regression (interpolation) models

Treat it as a black-box mapping: learn a function from eye features (pupil-glint vector, pupil ellipse) to a gaze point, using calibration samples as training data. The classic form is a low-order polynomial regression — a second-order polynomial mapping the 2D pupil-glint vector to scene-image coordinates is the textbook P-CR method and is remarkably effective. It needs no model of the eye's anatomy; it just fits the observed relationship. The price is that the fit is only valid near the conditions it was calibrated under — same headset position, similar distances — and it generalizes poorly if the geometry shifts.

Appearance-based deep models are the modern extreme of this philosophy: a network maps the raw eye image (or crop) straight to a gaze vector, having learned the mapping from large datasets. Powerful, person- and slippage-robust if trained well, but data-hungry and harder to reason about when they fail.

Model-based (geometric) estimation

Build an explicit 3D model of the eye and the camera setup, then solve for the eye orientation that explains the observed features. With known LED positions, known camera intrinsics, and multiple glints, you can reconstruct the center of corneal curvature in 3D (each glint constrains it via the law of reflection), locate the pupil center in 3D by refraction through the cornea, and define the optical axis as the line through those two points. This gives a true 3D line of sight in space, naturally handles distance, and degrades gracefully under slippage because it's grounded in physical geometry rather than a fitted curve. The cost is hardware complexity (multiple calibrated LEDs and cameras) and sensitivity to errors in those calibration parameters.

The kappa problem — optical vs. visual axis

Here's a subtlety that catches newcomers. The geometry above gives you the optical axis — the symmetry axis of the eye's optics, through cornea and pupil. But the fovea is not on the optical axis. It sits a few degrees off to the side, so the line you actually see along — the visual axis connecting fovea to fixation point — is offset from the optical axis by an angle called kappa, typically around 4–8 degrees and different for each person and each eye.

No amount of clever image processing recovers kappa, because it's an internal anatomical offset invisible from outside the eye. The only way to find it is to have the person look at a known target and measure the discrepancy. This is the deep reason every eye tracker — no matter how good its hardware — needs at least a brief per-person calibration. It's not fixing sensor error; it's measuring an offset that is fundamentally unobservable otherwise.

Stage 3 — Calibration

Calibration ties the eye-feature-to-gaze mapping to reality, and on a wearable it also ties the eye cameras' frame of reference to the scene camera's frame so gaze can be reported as a point in the world view.

The standard procedure: the wearer fixates a sequence of known targets — a 5-, 9-, or 13-point grid, or a single moving dot — while the system records eye features at each. For a regression model, these pairs directly fit the polynomial coefficients. For a model-based system, they pin down kappa and any residual geometric offsets. More points and a wider spread improve accuracy across the field of view but cost the user time and patience; there's a real engineering trade-off between calibration burden and quality.

Several refinements matter in practice:

Single-point and implicit calibration. Good P-CR geometry needs so little tuning that one point can suffice for kappa, which is why some consumer headsets feel "calibration-free." Research also exploits implicit calibration: smooth-pursuit targets, or using known salient points in the scene (people reliably look at faces, or at a cursor they're controlling) to calibrate without an explicit step.
Binocular calibration and convergence. With both eyes tracked, the two visual axes intersect at the fixation point, and that vergence yields a depth estimate. It also lets the system sanity-check the two eyes against each other.
Drift and slippage correction. Over a session the glasses slip, skin moves, and accuracy decays. Robust systems re-anchor opportunistically — recalibrating on known scene targets, or detecting slippage from changes in the eye model and compensating. This is exactly where P-CR's translation-invariance earns its keep.

Stage 4 — From a gaze stream to events

The raw output is now a time series: a gaze point in the scene image (plus often a 3D direction and confidence) at every frame. But a list of coordinates isn't yet behavior. The eye moves in a small vocabulary of motions, and the analysis layer's job is to segment the stream into them.

Fixations — periods (roughly 100–400 ms) where gaze is held nearly stationary on a target. This is when visual information is actually taken in. Fixation location and duration are the bread-and-butter measures of attention.

Saccades — the ballistic flicks between fixations, lasting 30–80 ms and reaching angular velocities up to 500–900°/s. Vision is largely suppressed during saccades, so you see almost nothing while your eyes are in flight, which is why the world doesn't blur as your gaze darts around.

Smooth pursuit — the slow, continuous tracking of a moving object, the one motion that can hold a moving target on the fovea. It only works with something to follow; you can't smoothly sweep your eyes across a static scene on command.

Blinks — eyelid closures that knock out the pupil signal entirely for 100–400 ms, leaving gaps that must be detected and bridged, not mistaken for movement.

The standard segmentation tools are I-VT (velocity-threshold: label samples above some angular-velocity threshold as saccades, the rest as fixations) and I-DT (dispersion-threshold: group consecutive samples that stay within a small spatial window and last long enough). I-VT is fast and natural for high-frame-rate data; I-DT is more robust on noisier, lower-rate streams. More sophisticated approaches add adaptive thresholds, microsaccade detection, or learned classifiers — but the velocity/dispersion intuition underlies almost all of them.

From these events come the higher-level products: scan-paths (the ordered sequence of fixations, the "route" the eye took), heatmaps (fixation density aggregated over a stimulus, ideal for showing where a group looked), areas of interest (AOIs — regions you define, so you can measure time-to-first-fixation, dwell time, and revisits), and pupillometry (pupil diameter as a proxy for cognitive load and arousal — with the heavy caveat that the pupil reacts far more strongly to light than to thought, so luminance must be controlled before any cognitive interpretation is credible).

The wearable's extra hard problem: mapping gaze onto a moving world

On a screen-based tracker the stimulus is fixed, so a gaze point maps trivially onto known content. On a wearable, the scene camera moves with the head, so the same real-world object lands at a different image pixel every frame. To say "they looked at the product on the shelf" rather than "they looked at pixel (840, 612) in frame 1180," you must register gaze against the world. That means computer vision on the scene video: detecting fiducial markers (AprilTags/ArUco) placed in the environment, or markerless SLAM / visual mapping that builds a model of the scene and localizes the camera within it, or object/face detection to label what was fixated. This world-registration step is unique to wearables and is often where most of the engineering effort — and error — actually lives.

Validation: accuracy, precision, and trust

You cannot interpret eye data without knowing its quality, and two metrics matter most.

Accuracy — the average angular error between reported gaze and the true target, in degrees of visual angle. Good wearable systems land around 0.5–1.5° under favorable conditions. One degree sounds tiny but at arm's length it's roughly a centimeter, and at across-the-room distances it's the difference between two adjacent words or two adjacent faces. Accuracy is not constant: it's typically best near where you calibrated (often screen/scene center) and worse toward the periphery and at distances far from calibration.

Precision — the consistency of measurement when gaze is genuinely still, usually reported as RMS sample-to-sample deviation or as the spatial spread of samples during a fixation. A tracker can be precise but inaccurate (tightly clustered around the wrong spot — a calibration offset) or accurate but imprecise (centered correctly but noisy). They're independent failure modes and you need both.

Real-world quality is degraded by a familiar rogues' gallery: glasses slippage during a session; pupil size artifacts (the pupil center shifts slightly as the pupil dilates, injecting gaze error that correlates with lighting and arousal); difficult eye physiologies (droopy lids, long lashes, heavy eye makeup, some contact lenses); bright sunlight whose broadband IR swamps the LEDs' controlled glints; and data loss from blinks and off-screen looks. A crucial discipline, especially in research, is reporting the proportion of valid samples and validating accuracy after the task, not just trusting the pre-task calibration screen. A beautiful scan-path computed from 60%-valid data is fiction.

Where this is going

Three trends are reshaping wearable gaze tracking. Deep learning is absorbing the brittle hand-tuned detection stages — end-to-end and segmentation-based methods now handle occlusion, makeup, and lighting that broke classical pipelines, and they shrink or remove explicit calibration by learning person-invariant features. Event cameras — sensors that report per-pixel brightness changes with microsecond latency instead of full frames — promise to track saccades at effectively kilohertz rates with tiny power and data budgets, which is exactly what always-on wearables need. And the entire field is being pulled forward by VR/AR: foveated rendering (render sharply only where the fovea points, save the GPU everywhere else) and gaze-as-input have turned eye tracking from a niche research instrument into a mass-market component, which in turn funds the sensors, datasets, and silicon that make the next generation of wearables better.

The throughline across all of it is the same chain this essay walked: light the eye, find the pupil and the glints, turn those features into a line of sight, anchor that line to a per-person offset and a moving world, and only then — carefully, with the error bars attached — call it attention. Every reliable result downstream depends on respecting every link in that chain.