How Accurate Are AI-Generated Clinical Notes? What We Measured and How We Think About It

When a physician asks us "how accurate is the note?", they're usually asking a deceptively simple question that contains at least four distinct sub-questions. Is the note factually faithful to what was said in the encounter? Is the SOAP structure complete? Are the ICD-10 codes correct? And — the one that matters most for downstream PA workflows — does the note contain the clinical specificity that billing and authorization require?

We want to be honest about how we think about accuracy, what we measure, where Prioriq performs well, and where we still have gaps. The clinical stakes of an inaccurate note are too high for marketing-speak here.

What "Accuracy" Means in This Context

Clinical note accuracy isn't a single metric. In published literature on ambient scribing systems, researchers typically measure several dimensions independently because they have different clinical and operational implications.

Factual faithfulness is the most basic dimension: does the note contain statements that are inconsistent with what was said in the encounter? A note that says a patient denies chest pain when the physician never asked about chest pain is a hallucination. A note that records the patient as a 54-year-old when the physician said 45-year-old is a factual error. These are distinct failure types — hallucination fabricates something absent from the conversation; factual error misrepresents something present.

Completeness measures whether clinically significant content from the encounter made it into the note. A patient who mentioned knee pain during the visit's HPI but whose note records only the primary complaint of lower back pain has an incomplete note. Completeness is harder to measure than faithfulness because it requires knowing what "should" be in the note — which is itself a judgment call that varies by encounter type and clinical context.

Structural correctness covers SOAP organization, appropriate section allocation, and documentation standards like E/M coding level support. A subjective section that includes physician examination findings is structurally incorrect. A plan section that lacks follow-up instructions may fail E/M level documentation requirements for the billed visit.

ICD-10 code precision is the downstream-facing dimension with the most direct revenue cycle impact. Suggesting M17.1 (primary osteoarthritis of knee, unspecified) when the encounter documented right knee only is an acceptable first-pass suggestion for physician review, but signing a note with the unspecified code instead of M17.11 (right knee) can affect claim acceptance and PA documentation.

How We Test Our Output

Our evaluation process uses a set of synthetic test encounters — audio transcripts paired with physician-authored "ground truth" notes written by clinicians who reviewed the same transcript. We don't use real patient data for evaluation; all test cases are synthetic or de-identified under our IRB protocol.

Each test note is scored by a clinical reviewer against the ground truth on a structured rubric covering the four dimensions above. We run this evaluation on a rolling basis as we update the model, so we have a continuous record of how changes to the underlying system affect note quality across encounter types.

Current performance on our internal benchmark set: factual faithfulness is our strongest dimension. Across 340 synthetic encounters covering internal medicine, orthopedics, cardiology, and neurology visit types, we measure a factual error rate of under 2% at the sentence level. Hallucination rate — statements in the generated note with no grounding in the transcript — is under 0.5% for trained encounter types.

Completeness is more variable. For structured encounters with clear clinical dialogue, completeness against ground truth notes runs 88–92%. For less structured encounters — family medicine visits with multiple concurrent concerns, pediatric encounters with parent-provided history, visits conducted partly in a language other than English — completeness drops to 75–82%. We're explicit about this because completeness gaps tend to be invisible to the physician reviewing the note: you see a complete-looking note and don't know what the conversation contained that didn't make it in.

ICD-10 Code Suggestion Accuracy

This is where we'll be the most direct about current limitations. ICD-10 code suggestion from note content is a task where first-pass accuracy varies significantly by code category and clinical context.

For primary diagnosis codes in common conditions — hypertension, type 2 diabetes, osteoarthritis, GERD, depression — our first-pass suggestion accuracy against physician-chosen final codes runs approximately 84–88%. This means roughly 1 in 8 primary diagnosis suggestions requires physician correction before the note is signed.

For secondary and chronic condition codes, accuracy is lower — approximately 68–73% — because secondary codes depend on what the physician decides is worth documenting as an active problem on a given visit, which is a clinical judgment that doesn't have a single right answer. The model tends to suggest fewer secondary codes than physicians document when given the time to be thorough; it tends to suggest speculative codes when the conversation contained ambiguous references to comorbid conditions.

Laterality codes (right versus left, bilateral versus unilateral) have high accuracy when mentioned explicitly in conversation — 94%+ — but drop significantly when laterality was implied by context rather than stated. A physician who says "the knee" throughout an encounter without specifying right or left is not giving the model enough information, and the model will typically default to the unspecified code rather than guessing.

We're not saying our ICD-10 suggestion accuracy is insufficient to be useful — it is useful, and it saves coding time even when it requires review. What we're saying is that the physician review step before signoff is a genuine checkpoint, not a formality. The accuracy figures above are why the review step matters.

Where Ambient Notes Outperform Manual Notes

There's a dimension of note quality that the accuracy metrics above don't capture: the quality of the starting point relative to what a physician would produce under time pressure.

A manually drafted note written at end-of-day after 20 patients tends to be thin in the HPI and assessment sections — bullet points rather than narrative, minimal treatment history, absent functional limitation language. An ambient note drafted during the encounter, even with the completeness gaps we described above, contains more clinical narrative than the rushed manual alternative because it's capturing the conversation in real time rather than asking the physician to reconstruct it from memory 6 hours later.

This matters most for prior authorization. When we look at our test encounter set, ambient notes contain enough PA-relevant clinical detail — conservative treatment history, functional limitation description, radiographic finding references, prior medication trials — to support PA submission drafting in approximately 78% of cases without requiring the physician to write a supplemental addendum. Manual notes from the same encounter scenarios, written after a simulated time-pressured day, support PA drafting without supplemental documentation in approximately 52% of cases.

That 26-point gap is where the PA workflow efficiency gain comes from. The ambient note isn't perfect; it is substantially more useful for downstream clinical and administrative work than the manual alternative produced under realistic conditions.

Known Failure Modes and How We Mitigate Them

Clinicians and administrators considering ambient scribing should know the failure patterns most likely to cause problems, not just the headline accuracy numbers.

Multi-speaker confusion. When two physicians are in a room together, or when a medical assistant conducts rooming and the physician conducts the visit, speaker attribution errors increase. Currently, our system flags multi-speaker scenarios for additional physician review rather than silently attributing statements to the wrong speaker.

Medication name misrecognition. Phonetically similar medication names — metoprolol versus metformin, for instance — are a known risk in transcription-based systems. We apply a pharmacological entity recognition layer that catches most common confusions, but physicians reviewing notes that include medication changes should verify medication names explicitly.

Numeric value errors. Vital signs, lab values, and dosages mentioned in conversation can be transcribed with digit-level errors that are difficult to detect on review because the number looks plausible. We flag numeric values for review in the note editor interface rather than presenting them without attention cues.

Negation handling. Clinical notes require accurate representation of negatives: "denies chest pain" versus "reports chest pain." Negation in casual speech is linguistically complex and our system handles it well in standard clinical phrasing but has higher error rates in colloquial or indirect statements. This is an area of active improvement in our evaluation pipeline.

The Honest Bottom Line

Ambient scribing with current technology is accurate enough to be a useful clinical documentation tool that saves time and improves note quality relative to the rushed manual alternative. It is not accurate enough to be signed without physician review. Those two statements are both true, and any vendor that presents only the first one is not giving you the full picture.

The right mental model for ambient scribing is not "automated note creation" — it's "high-quality first draft that requires physician review and correction." The review time is shorter than drafting time, the starting quality is higher than a rushed manual note, and the downstream utility for billing and PA is meaningfully better. That's the accurate value proposition, and it's substantial enough to justify the tool on its own terms.