Sleep Tracker Accuracy — Body Handbook

Sleep · §196

Sleep Tracker Accuracy

Your sleep tracker knows roughly when you slept and roughly for how long. It does not know how much deep sleep or REM you got — those numbers are educated guesses with error bars wider than most people's night-to-night swings. The data is useful as a long-run trend mirror and, on newer Apple and Samsung watches, as a screen for sleep apnea — a condition most people who have it don't know they have. But treated as a daily report card, it can also do the thing nobody buys it for: make sleep worse.

Know · As-needed Evidence Moderate Chapter Sleep

The wrist or ring on your hand has no brain on it — it's reading movement, pulse, and a few autonomic side-effects, then guessing at sleep architecture. It gets total sleep time about right. Sleep stages, especially deep sleep, it doesn't. The biggest real win sitting in this category is the FDA-cleared sleep-apnea notification on the latest Apple Watch; the biggest real risk is reading the score as a verdict on a night that felt fine.

Polysomnography — the sleep-lab gold standard — sticks electrodes to your scalp to read your brain's electrical activity, electrodes near your eyes to catch the eye movements that mark REM, electrodes under your chin to check muscle tone, plus heart-rate and breathing channels. Stage names like deep sleep and REM are defined on what those traces show every 30 seconds. Consumer wearables have none of that.

A watch or ring sees your wrist or finger moving (or not), and reads your pulse through the skin with a green light. From those two signals — plus skin temperature and blood-oxygen on some devices — a proprietary algorithm guesses at what an EEG would have shown de Zambotti et al. 2024. The pulse signal does carry information about sleep stages, because your nervous system shifts the heart's beat-to-beat rhythm across light, deep, and REM sleep Altini & Kinnunen 2021. But that rhythm is a downstream echo, not the staging substrate itself. The device is reading the side effects of sleep on the body, not sleep itself — that's the whole reason the numbers are imperfect.

What it actually gets right

Sleep versus wake — the basic "were you asleep at all" question — modern wearables nail. The top rings and watches clear 95% sensitivity for sleep on healthy adults Chee et al. 2024. They lean toward calling quiet wakefulness "sleep," so total time runs slightly long on individual readings; pooled across two dozen validation studies, the average bias is the other way — about 17 minutes less than the lab measures Lee et al. 2024. Either way, you can trust the rough number.

Stage breakdown is where the picture gets ugly.

Different devices fail in different directions. The original Oura ring underestimated deep sleep by about 20 minutes per night and overestimated REM by 17 de Zambotti et al. 2019; the Fitbit Charge 4, tested in chronic-insomnia patients, underestimated deep sleep by 41 minutes Liang et al. 2022. The newest Oura algorithm, validated across more than 420,000 sleep epochs in healthy adults, is genuinely closer — above 90% sensitivity for sleep stages Svensson et al. 2024 — but that's the best published case, and it's a ring on healthy people sleeping normally.

For heart-rate variability measured overnight, the news is better. Validated against a medical chest-strap ECG across 536 nights, the latest Oura ring landed within 6% of the medical reference; Whoop within 8%; Garmin and Polar substantially worse Dial et al. 2025. The trend you see night by night on a good device is roughly what a clinical sensor would say.

What people read into the numbers that isn't there

The "deep sleep" line on the dashboard looks like a measurement. It's a guess with a 20-to-40-minute error bar per night. A drop from 1h45m to 55 minutes from Tuesday to Wednesday might mean nothing changed — the noise floor of the device swallows most of the variance you'll see week to week de Zambotti et al. 2019 Liang et al. 2022 Cho et al. 2023.

Sleep scores — Oura Readiness, Whoop Recovery, Fitbit Sleep Score — are proprietary blends of duration, stage guesses, heart rate, HRV, and movement. The weights are not published. None has been validated in a peer-reviewed trial against next-day cognition, mood, or athletic performance Goldstein et al. 2021. Treat the score as a personal trend signal, not a verdict on the night.

And the move that brought "orthosomnia" into the medical vocabulary: trusting the device over your own felt experience. In the original case series, patients came to sleep clinics convinced they barely slept. Sent for an actual lab study, the recordings showed normal sleep — and several patients continued to believe the tracker Baron et al. 2017. The wearable's specificity for wake is the weak point: when you're lying still awake it routinely calls it sleep, and the reverse happens too. Your felt experience is not noise.

What happens if you start chasing the numbers

At first it's interesting. You see your nights laid out for the first time and you make a few small changes — earlier bedtime, less wine on a weekday — and the score moves up. Real wins.

Then a bad number lands on a night that felt fine, and the question shifts from "how do I feel?" to "what did the device say?" A week in, you're spending an extra forty minutes in bed trying to hit a sleep-duration target. The number doesn't move; you lie there awake, which the device also doesn't see clearly. The bed starts to feel like somewhere you go to perform.

This is the failure mode behavioral sleep medicine named in 2017 — orthosomnia, the unhealthy pursuit of perfect sleep, first described in patients who trusted their wearables over their clinicians Baron et al. 2017. It looks like classic insomnia from the outside, because extra time in bed chasing better sleep is exactly what produces insomnia. The first general-population estimate puts prevalence at somewhere between 3% and 14% of tracker users, concentrated in people already trending anxious about sleep Jahrami et al. 2024.

Six months in, the social signal: friends ask why you keep talking about your sleep score. A year in: the score is what you check before you check whether you feel rested. The people who land badly here end up with a worse relationship with sleep than before they put the device on. The people who land well are barely glancing at it.

What you actually get out of one, when it works

In the first week: a calibration. You find out your average is forty minutes shorter than you thought, you notice you go to bed an hour later than you tell yourself. Tracking is a mirror — most of the value is in being seen.

Over the first month, if the numbers nudge a real change — moving bedtime earlier, cutting the second evening drink, holding a consistent wake time — modest gains land. The one randomized trial on wearable use itself (Whoop, one week, healthy adults) showed improved subjective sleep quality with the feedback turned on Berryhill et al. 2020. Better-rested days look like sharper meetings and fewer mid-afternoon crashes — the standard payoff of consistent sleep, not a wearable-specific gift. Don't expect dramatic. Expect the small, real version of consistent.

Over months, the trend view matters more than any single night. Apple Watch with the FDA-cleared apnea-notification feature checks 30 nights of breathing patterns and tells you if you're showing consistent signs of moderate-to-severe sleep apnea — the algorithm was cleared on a 1,448-person trial Apple 2024. Roughly 30 million American adults have sleep apnea and most don't know it; for the fraction of wearers who get a notification and book the follow-up, the real payoff isn't a better sleep score. It's not having a heart attack at 55.

Years in, used loosely, a tracker becomes background information — checked on bad weeks, ignored on good ones, occasionally surfacing a pattern (jet-lag recovery time, training-load effect on HRV) that explains something you were already feeling. Useful, modest, durable. That's the right relationship.

How to use one without making things worse

Three rules cover most of it. Trust the long-run trend, ignore the single night. Trust the duration and timing numbers; ignore the deep-sleep and REM breakdown. Take any apnea notification seriously.

Cost varies. Mid-range watches and rings sit around $150–$400 up front; subscription devices like Whoop ($239/year) and Oura's premium tier (about $72/year on top of the ring) add ongoing cost. The work itself is small — wear it, charge it. Battery life runs from about a day on Apple Watch to about a week on Oura and Whoop, so devices that need nightly charging will miss more nights than ones that don't.

Who actually benefits, who should skip

Healthy adults curious about their patterns, athletes adjusting training around recovery trends, people whose partner has noticed snoring and gasping, shift workers who want objective data on their disrupted schedules — these are the groups for whom a wearable's numbers add real information.

The people who should think twice: anyone already in behavioral treatment for insomnia, anyone who notices their mood shift when they see a bad score, anyone with a clinical history of health anxiety or perfectionism. The data feeds the loop the treatment is trying to break Baron et al. 2017 Jahrami et al. 2024. The American Academy of Sleep Medicine's position is the same in clinician language: consumer trackers cannot diagnose or treat sleep disorders, but they can be useful for opening a conversation with a clinician Khosla et al. 2018.

Three adjacent topics worth knowing about. Sleep debt — what reliably short sleep actually costs you, separate from any device counting the hours. Cognitive behavioral therapy for insomnia (CBT-I), which is the strongest evidence-backed treatment for chronic insomnia and which most wearable users have never heard of. And sleep apnea itself — if you got a screening notification, the notification is the start of a clinical workup, not the end of one.

Related in the handbook

Diagnosis

— The biggest real win in sleep trackers is the FDA-cleared apnea alert — if it flags you, get a real test.

— Same lesson: the device's number is off, but the trend still teaches you something. Don't treat it as gospel.
— Same trap as a glucose sensor: great for the trend, misleading as a daily score. Read the weeks, not last night.
— Same trap as HRV: the wrist reads a noisy signal, so trust the weekly trend, not the daily number.
— Trackers estimate duration well enough to spot a chronic shortfall — read the trend, not today's score.

Substance + claimed effects

Consumer wearable sleep tracking covers wrist-worn watches and bands (Apple Watch, Fitbit, Garmin, Whoop), rings (Oura), and nearables (under-mattress mats, bedside radar) that estimate sleep parameters from a combination of triaxial accelerometry and photoplethysmography (PPG) — with some devices adding skin temperature, SpO₂, and consumer-grade ECG. From these inputs, devices report total sleep time (TST), sleep onset latency (SOL), wake after sleep onset (WASO), sleep efficiency, and four-stage estimates (wake, light/N1+N2, deep/N3, REM), plus nocturnal heart rate, heart-rate variability (HRV), respiratory rate, and aggregated "sleep scores" or "readiness scores." This entry covers (a) measurement accuracy versus the polysomnography (PSG) reference standard for TST, stages, and HRV; (b) the downstream behavioral consequences — orthosomnia and sleep anxiety on one side, beneficial behavior change and apnea screening on the other; and (c) the current clinical posture toward consumer devices, including the FDA-cleared sleep-apnea notification on Apple Watch Apple 2024 and the AASM's position that consumer trackers cannot diagnose or treat sleep disorders but may augment clinical conversation Khosla et al. 2018.

Evidence by addressing question

Mechanism — how a watch infers what an EEG measures

Science / mechanism. Clinical PSG scores sleep from synchronized EEG (cortical activity), EOG (eye movements), EMG (chin/leg muscle tone), and ECG/respiratory channels — the AASM staging rules (wake, N1, N2, N3, REM) are defined on these traces in 30-second epochs. Consumer wearables have none of these channels. Wrist and ring devices rely on a 3-axis accelerometer (movement / micro-movements) and a PPG sensor that derives pulse rate and beat-to-beat variability from skin reflectance; rings and some watches add skin temperature and SpO₂. Sleep is inferred — wake detected from movement; stages estimated from autonomic surrogates (heart-rate-variability features track parasympathetic drive in NREM and shift in REM), respiratory modulation, body temperature drift, and motion patterns, fed into proprietary machine-learning classifiers de Zambotti et al. 2024. Critically, because brain activity is never observed, distinguishing N1 from N2, or N3 from a quiet REM period with low movement, is intrinsically harder than for EEG — the classifier is reading a downstream signal, not the staging substrate itself Altini & Kinnunen 2021. Accelerometer-only devices can do wake/sleep at reasonable accuracy but cannot resolve stages; PPG adds the autonomic features that make multi-stage estimation feasible at all.

Evidence — accuracy versus PSG

Total sleep time and efficiency. The largest synthesis to date is Lee et al. (2024), a meta-analysis of 24 validation studies (798 participants) covering Fitbit, Whoop, Garmin, Apple Watch, Xiaomi, and several research-grade wristbands. Pooled mean difference for TST versus PSG was −16.9 minutes (95% CI −26.3 to −7.4) — i.e., wearables systematically underestimate sleep duration on average — and sleep efficiency also differed significantly. Authors conclude that wrist-worn trackers "are not as reliable as polysomnography in measuring key sleep parameters such as total sleep time, sleep efficiency, and sleep latency" but remain useful for tracking general sleep patterns Lee et al. 2024. The error direction is not universal: in healthy adults, several devices have shown the opposite bias (small overestimation of TST due to confusing quiet wakefulness with sleep — a low-specificity failure mode actigraphy has shared for decades). In the largest single-night five-device comparison, every device except the Garmin Vivosmart estimated TST comparably to research-grade actigraphy Chinoy et al. 2024.

Sleep/wake detection (the 2-class problem). This is where consumer wearables do best. Three-device study of Oura Gen3, Fitbit Sense 2, and Apple Watch Series 8 reported sensitivity for detecting sleep (vs. wake) ≥95% for all three Chee et al. 2024. Specificity for wake is the harder direction — when the participant lies still awake, the device tends to call it sleep — and is typically lower (~50–70% across devices). This sensitivity-specificity asymmetry is the long-standing actigraphy pattern and explains the small TST overestimation seen in low-WASO participants.

Sleep stages (the 4-class problem). Performance drops sharply. The largest multicenter validation (Cho et al. 2023) ran 11 consumer trackers against PSG across 543 hours of recordings and 349,114 epochs. Macro F1 score for 4-stage classification ranged from 0.69 (best) down to 0.26 (worst) across devices — i.e., the difference between "usable as a coarse trend" and "near-random within stages." Wearables outperformed nearables for deep-sleep detection; nearables were better for REM and wake on some metrics Cho et al. 2023. Device-specific patterns: the original de Zambotti (2019) Oura validation showed underestimation of N3 (~20 min) and overestimation of REM (~17 min) versus PSG de Zambotti et al. 2019; with the Gen3 ring and Oura's OSSA 2.0 algorithm, accuracy improved markedly — Svensson et al. (2024) reported sleep-stage agreement exceeding 90% sensitivity and 70% specificity over 421,045 epochs across 96 participants, the strongest published performance for a ring-form device Svensson et al. 2024. Fitbit Charge 4 in chronic-insomnia patients underestimated deep sleep by ~41 minutes and overestimated light sleep by ~38 minutes versus PSG Liang et al. 2022. The summary picture: deep-sleep numbers from consumer devices should be treated as soft estimates with wide error bars, not as a measurement.

Nocturnal HRV and resting heart rate. Dial et al. (2025) validated five devices (Garmin Fenix 6, Oura Gen3, Oura Gen4, Polar Grit X Pro, Whoop 4.0) against single-lead ECG over 536 nights. For RMSSD-derived HRV: Oura Gen4 (CCC = 0.99, MAPE 5.96%), Oura Gen3 (CCC = 0.97, MAPE 7.15%), Whoop (CCC = 0.94, MAPE 8.17%), Garmin (CCC = 0.87, MAPE 10.52%), Polar (CCC = 0.82, MAPE 16.32%). For resting heart rate, agreement was tighter — Oura devices both within ~2% of ECG, Whoop within 3% Dial et al. 2025. The nocturnal context (low motion, stable peripheral perfusion) is the best case for PPG-derived HRV; daytime values are substantially less reliable. Within-person trends across consecutive nights — the actual use case for most wearers — track ECG closely even on devices with worse absolute agreement, because measurement bias is fairly stable per individual.

Practice / clinical use

Position statement. The AASM 2018 position statement is explicit: "Given the lack of validation and United States Food and Drug Administration (FDA) clearance, consumer sleep technologies cannot be utilized for the diagnosis and/or treatment of sleep disorders at this time. However, CSTs may be utilized to enhance the patient-clinician interaction when presented in the context of an appropriate clinical evaluation" Khosla et al. 2018. The 2021 update reiterated the position with elaboration on AI/ML algorithms and emphasized the need for raw-data access and external validation Goldstein et al. 2021.

Apnea detection. The boundary case is sleep-apnea screening. In September 2024 Apple received FDA 510(k) clearance for sleep-apnea notifications on Apple Watch Series 9, 10, and Ultra 2 — the algorithm uses wrist-worn accelerometry to detect breathing-disturbance patterns, accumulates 30 nights, and notifies users showing consistent moderate-to-severe signs. Clearance trial: N=1,448, AHI range 0 to ≥30 Apple 2024. Samsung Galaxy Watch received similar clearance earlier in 2024. This is genuinely consequential: ~30 million US adults have OSA and the majority are undiagnosed; a screening notification on a device most people already wear is a meaningful funnel into actual diagnostic testing (home sleep apnea test or in-lab PSG). The Apple/Samsung notifications are not diagnostic — a positive notification is meant to prompt a clinical evaluation, not to replace one.

State of the science. The 2024 consensus paper from de Zambotti and the Sleep Research Society working group lays out where the field has landed: wearables are a credible tool for population-scale and longitudinal sleep research (the multi-night, naturalistic data they generate is something PSG cannot produce), can co-record autonomic and circadian features that actigraphy never could, but require careful interpretation, ideally validated algorithms, and explicit framing of what each device can and cannot resolve de Zambotti et al. 2024.

Misconceptions

"Deep sleep" as reported is a precise measurement. It is not. Consumer devices estimate N3 from movement and autonomic features; the average device error versus PSG is on the order of 20–40 minutes per night, and night-to-night within-person noise is substantial de Zambotti et al. 2019 Liang et al. 2022 Cho et al. 2023. A reported drop from "1h45m of deep sleep" to "55 minutes" is well inside the device's measurement noise and may correspond to no actual physiological change.

"Sleep scores" mean what the brand implies. Composite scores (Oura Readiness, Whoop Recovery, Fitbit Sleep Score) are proprietary blends of TST, stage estimates, HRV, RHR, and movement — none of the input weights are published, none externally validated against an outcome (e.g., next-day cognitive performance, mood, athletic performance) in peer-reviewed trials. They are useful as personal trend signals, not as measurements of anything specific Goldstein et al. 2021.

"My tracker says I didn't sleep" overrides clinical evidence. Baron's original cases included patients who, after a clinical PSG demonstrated they had slept normally, continued to believe the device over the lab study Baron et al. 2017. Consumer devices' false-negative rate for sleep (calling true sleep "wake") is a known low-specificity-for-wake artifact, not a clinical finding about the patient.

Failure modes — orthosomnia and sleep anxiety

The original case series. Baron et al. (2017) introduced the term "orthosomnia" to describe a perfectionistic quest for ideal sleep driven by tracker data — patients seeking treatment for self-diagnosed insufficient sleep or insomnia based on light/restless sleep observations from their wearable, often unresponsive to reassurance and resistant to cognitive behavioral therapy for insomnia (CBT-I) because the device's data outweighs clinician input Baron et al. 2017. The behavioral signature: spending more time in bed chasing higher sleep scores, which (as CBT-I has demonstrated for decades) is exactly what worsens insomnia by weakening the bed-sleep association.

Prevalence. The first general-population estimate (Jahrami et al. 2024, n=523) found 35.8% of the sample regularly used a sleep-tracking device. Using a 4-criterion algorithm (device ownership + Athens Insomnia Scale ≥6 + GAD-7 ≤14 to exclude general anxiety + APSQ above varying thresholds for sleep-specific preoccupation), the prevalence of algorithm-identified orthosomnia was 3.0% (conservative), 8.6% (moderate), or 14.0% (lenient) of the full sample. Within tracker users specifically, orthosomnia cases consistently had higher insomnia-symptom scores Jahrami et al. 2024. The figure is one cross-section in one cohort, not a definitive population estimate, but it bounds the problem: meaningful single-digit-percent prevalence among device-using adults, concentrated in people already trending insomniac.

The cross-cutting harm. Even outside formal orthosomnia, wearable use can sharpen the link between perceived short sleep and anxiety: in a Canadian nationally representative survey, sleep-wearable users reported longer sleep onset latency, shorter sleep, and more severe insomnia symptoms than non-users; ~45% reported a positive subjective effect and ~4.5% a negative one. The differential persisted partially after adjusting for diagnosed sleep disorders.

Audience / population variability

Accuracy is worst in the populations who most want answers. In chronic-insomnia patients, Fitbit Charge 4 had substantial bias for stages and efficiency relative to PSG Liang et al. 2022; in clinical populations more broadly (apnea, restless legs, parasomnias), validation is sparse and most devices' algorithms were trained on healthy controls. Performance is also weaker in older adults (more fragmented sleep, more quiet wake epochs) and in shift workers (irregular sleep timing breaks algorithmic priors).

Practicalities

Costs: Apple Watch SE ~$250, Apple Watch Series 10 ~$400+, Fitbit Charge 6 ~$160, Whoop $239/year subscription (no upfront), Oura Ring Gen4 $349 + $5.99/month subscription, Garmin watches $200–$1000+. All require nightly wear, a charged device, and a paired smartphone. None require any active effort once worn. Battery life ranges from ~18 hours (Apple Watch without low-power mode) to 7 days (Oura, Whoop) — devices that need daily charging often miss nights.

The credibility range

The optimist case

Modern consumer wearables are a genuine measurement technology, not a toy. The Oura Gen3 with OSSA 2.0 hits >90% sensitivity and >70% specificity for sleep stages across hundreds of thousands of PSG-validated epochs Svensson et al. 2024; nocturnal HRV from the top-end rings is within 6% of ECG Dial et al. 2025; sleep/wake detection clears 95% sensitivity on the leading devices Chee et al. 2024. These are clinically interesting numbers, especially for tracking trends within a single person across weeks and months — a regimen PSG cannot provide because it is one to three nights in a lab. The FDA-cleared apnea-notification feature on Apple Watch Apple 2024 opens a screening funnel that didn't previously exist for the 80%+ of OSA cases that go undiagnosed — a clear public-health upside. Behaviorally, in a randomized crossover trial (n=32), participants who wore a Whoop with feedback reported improved subjective sleep quality and showed accurate measurement of sleep and cardiorespiratory variables Berryhill et al. 2020. The 2024 SRS state-of-the-science paper frames wearables as essential infrastructure for the next generation of sleep and circadian research de Zambotti et al. 2024. The optimist call: imperfect but rapidly improving, useful for trends and screening, and a net positive for most users.

The skeptic case

The meta-analytic pooled bias for TST is real (~17 minutes underestimate, with wide CI) and accuracy varies dramatically by device, algorithm version, and population — the Cho et al. multicenter study found macro-F1 from 0.69 down to 0.26 across 11 commercial trackers Cho et al. 2023. Deep-sleep estimates are routinely off by 20–40+ minutes de Zambotti et al. 2019 Liang et al. 2022; composite "sleep scores" are unaudited proprietary blends with no peer-reviewed validation against outcomes Goldstein et al. 2021; performance drops in the very populations (insomnia, apnea, older adults, shift workers) that most need accurate data. The AASM is unambiguous that these devices cannot diagnose or treat sleep disorders Khosla et al. 2018. And the downstream behavioral harms are documented: orthosomnia is a recognized clinical phenomenon with measurable prevalence (3–14% of users in the first general-population study) Jahrami et al. 2024, and Baron's case series shows it can override CBT-I, the most effective insomnia treatment Baron et al. 2017. The skeptic call: useful trend toy with real harm potential; consumer-grade numbers should not be treated as measurements, scores should not be treated as diagnoses, and clinicians should expect to manage a steady trickle of tracker-induced sleep anxiety.

The author's call

The truth sits clearly between but closer to the optimist on infrastructure and closer to the skeptic on interpretation. The hardware and algorithms have improved enough that the leading rings and watches give credible trend data for an individual user — sleep/wake timing, total sleep approximate within tens of minutes, nocturnal HRV trends within a few percent of ECG, and a usable screening signal for apnea on FDA-cleared devices. They are not credible as moment-to-moment measurements of sleep architecture: deep-sleep and REM numbers carry error bars wide enough to swallow most night-to-night variation. The right use is longitudinal trend-watching plus apnea screening; the wrong use is reading any single night's stage breakdown as physiology. The behavioral side adds a real cost: a non-trivial minority of users develop sleep anxiety, and people prone to perfectionism, health anxiety, or pre-existing insomnia are at the highest risk and gain the least clinical benefit. Evidence is rated 4 — multiple validation studies, a meta-analysis, FDA-cleared apnea features, AASM position statements — strong literature with the caveat that "the literature" here mixes solid measurement studies with the still-emerging behavior literature. Controversy is rated 3 — active debate among sleep clinicians about whether benefits exceed harms at the population level, with reasonable people on both sides. Net entry posture: know what you're actually getting before you wear one; for most people who don't have apnea risk and aren't anxious about sleep, the upside is modest and the downside is small but real.

Stakeholder + incentive map

Commercial — wearable manufacturers (Apple, Fitbit/Google, Whoop, Oura, Garmin, Samsung). Strong incentive to overstate measurement precision and "score" meaningfulness; growing incentive to seek FDA clearance for specific features (apnea notification, AFib) because medical-device positioning unlocks new markets and insurance pathways.
Professional — AASM, sleep medicine clinicians. Cautious-positive posture: the AASM's published position is that consumer technology cannot diagnose or treat sleep disorders but may aid patient engagement Khosla et al. 2018. Behavioral sleep clinicians (the people who run CBT-I) carry the orthosomnia burden in practice and have published on it most loudly Baron et al. 2017.
Research community — Sleep Research Society, academic sleep labs. Increasingly enthusiastic: wearables enable multi-night, naturalistic, large-cohort sleep and circadian studies that PSG cannot scale to; the 2024 SRS state-of-the-science working group is a measured endorsement with usage recommendations de Zambotti et al. 2024.
Regulators — FDA. Most devices ship as "lifestyle / wellness" products outside FDA jurisdiction. Specific features that step into medical territory (AFib detection, sleep-apnea notification, ECG) require 510(k) clearance and trigger meaningful validation requirements.
Counter — sleep psychologists, CBT-I community. Wary; the orthosomnia construct came from this corner, and the CBT-I evidence base is the strongest treatment we have for chronic insomnia — patients who trust their tracker over their clinician have lower treatment engagement.

Population variability

Healthy adults without sleep disorders. Best-case population for accuracy. Sleep/wake detection ≥95%; stage estimates noisy but trendable; HRV from top devices within ~6% of ECG.
Chronic insomnia. Worst case for the human and middling for the device. Fitbit Charge 4 in insomnia patients shows substantial stage bias Liang et al. 2022; behaviorally, this is the highest-risk group for orthosomnia and the group that gains least.
Suspected sleep apnea. The one population for which wearables now have a defined screening role on FDA-cleared devices Apple 2024. Notification is not a diagnosis; positive results should funnel to a clinical sleep evaluation.
Older adults (60+). Fragmented sleep, more quiet wake epochs, often more comorbidities — devices over-call sleep in quiet wake, under-detect arousals. Limited validation data.
Shift workers, irregular sleepers, frequent travelers. Algorithms trained on regular nocturnal sleep schedules degrade. Devices that infer sleep windows from time-of-day priors do worst here.
Adolescents. Validated mostly in healthy young adults; some Oura studies in adolescents show different stage-bias patterns than the adult literature.
Anxiety-prone, perfectionistic, or health-anxious users. Highest-risk population for orthosomnia regardless of device Jahrami et al. 2024 Baron et al. 2017. The data feeds the anxiety loop.

Knowledge gaps

No peer-reviewed validation of any major device's composite "sleep score" or "readiness score" against next-day outcomes (cognitive performance, mood, athletic performance, accident risk). Brands publish internal reports; independent trials don't exist at scale.
Long-term behavioral RCTs are scarce. Berryhill (n=32, 1 week) is among the only randomized trials on wearable use itself Berryhill et al. 2020; we have no good answer to "does wearing a tracker for a year improve or worsen sleep on average."
Orthosomnia prevalence has one general-population estimate Jahrami et al. 2024; replication across cultures and longer-duration studies don't yet exist. Incidence and trajectory (does it resolve when the device is removed?) are unknown.
The validity of apnea-notification features outside the populations enrolled in clearance trials (very obese, severe COPD overlap, complex parasomnias) is largely untested.
Algorithm transparency is the field-level gap: proprietary updates can change reported metrics overnight without notice, and a device validated in 2021 may behave differently in 2024. Continuous re-validation infrastructure is informal at best.
The downstream clinical impact of widespread apnea screening is unknown — will the screening funnel overwhelm sleep clinics, will follow-through to confirmatory testing happen, will treatment rates actually rise?

Scope vs the brief. The brief named total sleep, stages, HRV, accuracy vs PSG, sleep behavior, sleep anxiety, and clinical utility. All five are covered: total sleep / stages / HRV in evidence; behavior and anxiety in stakes and protocol's warning callout; clinical utility in payoff (apnea screening) and audience (AASM position). No narrowing.

Scope choices made.

Named devices selectively (Oura, Apple Watch, Fitbit, Whoop, Garmin, Samsung) — the ones with the deepest validation literature and the FDA clearance — rather than surveying every brand on the shelf. Listing every smartwatch dates fast.
Treated "nearables" (under-mattress mats, bedside radar, Withings Sleep) only in passing in the dossier; the topic title is wearables and the audience for a non-wearable tracker is small enough to warrant its own entry if it ever matters.
Apple Watch is absent from the Dial et al. (2025) HRV-vs-ECG validation. Flagged in the dossier; the article makes the HRV claim only about devices that were actually tested.

Rating calls worth noting.

Action type. Considered know vs decide vs test. Landed on know: the entry's job is teaching the reader what the data actually means, not steering them toward or away from buying one. decide was close runner-up.
Sleep dimension (2). Bimodal effect: modest positive for trend-watchers and the apnea-screening subgroup, real negative for the orthosomnia-susceptible minority. Net 2 is defensible; could be argued down to 1 by someone weighting the harm side more heavily. The dossier's credibility-range section spells out the split.
Controversy (3). Genuine — AASM is cautious-positive, behavioral sleep medicine clinicians are more wary, manufacturer claims outrun validation. Not a 4 because the disagreement is on net benefit rather than on fundamentals (everyone agrees the stage estimates are imprecise).

Future-link candidates. Once they exist: sleep-debt, cbt-i (or insomnia-treatment), sleep-apnea / sleep-apnea-screening. The out-of-scope section already names them in prose so a future editor can wire them in.

Separate-entry candidates. Orthosomnia may warrant its own entry as the general-population evidence base grows beyond Jahrami et al. (2024). Consumer-device sleep apnea screening is currently a single feature on a single device class but is the kind of thing that could become its own entry as more clearances land.

Stale-by-date risk. Algorithm versions change without notice — the dossier flags this as a field-level gap. The Oura OSSA 2.0 numbers cited (Svensson 2024) and the Apple apnea clearance (2024) are current but a 2026/2027 re-read should expect new validation data.