✢ ARTICLES & RESOURCES

✢ ARTICLES & RESOURCES

Industry Thought

Industry Thought

April 27, 2026

April 27, 2026

Accurate-ish: What's Good Enough AI for Emergency Management

Accurate-ish: What's Good Enough AI for Emergency Management

Published by

Published by

Justin Snair

Justin Snair

The conversation about AI accuracy in emergency management is finally happening. After three years of demos, conference panels, and procurement decisions that are largely made on vibes, I'm starting to see real talk about AI product benchmarks, evaluation methodology, performance disclosure, and golden-answer tests. Practitioners are asking sharper questions. Researchers are working towards frameworks. Vendors are making public commitments to transparency.

This is overdue. It's also, mostly, marketing copy.

The words being used — "accurate," "validated," "high-performance," "AI-powered," "human in the loop" — don't have shared definitions in this field yet. Two vendors can both claim "validated accuracy" and mean wildly different things. A procurement officer can ask "is your AI accurate?" and receive a confident yes from systems with no published methodology, no benchmark, no eval, and no architectural review mechanism. The conversation is happening at a level of abstraction where the words sound technical without doing technical work.

This post is about what the words should mean, what the actual accuracy ceiling is for AI in emergency management, why that ceiling isn't uniform across the preparedness cycle, and what serious disclosure looks like before the marketing layer collapses under its own weight.

What my honest take: AI in emergency management is, at best, accurate-ish. The real question is where accurate-ish is good enough — and where it isn't.

The vocabulary problem

A few working definitions and use that you should know (particularly how to spot BS):

Accuracy. Useful only when paired with a task definition and an evaluation method. "90% accurate" tells you nothing without "on what task — drafting a wireless emergency alert, summarizing a section of a county emergency operations plan, identifying gaps in a continuity of operations plan — against what golden answers, scored how, with what inter-rater reliability." A system that's 95% accurate at retrieving the right section of an EOP can still be 70% accurate at summarizing what that section actually says. Accuracy without decomposition is a number that means whatever the marketer wants it to mean.

Benchmark. A published, reproducible evaluation dataset paired with a scoring methodology, used to compare systems. The defining feature is that it's external to any single vendor. A vendor running their own questions against their own demo plan library and reporting their own scores is not a benchmark — it's self-graded homework on a corpus only they have access to.

Golden answer test. A set of questions paired with reference answers authored by domain experts, used as ground truth for evaluation. The quality of a golden answer test is bounded by the inter-rater reliability of the experts who authored it. If three emergency managers or planners would give three different answers to the same question about evacuation timing, shelter activation, or resource ordering, the "golden" answer doesn't exist — and the test set should flag the question as contested rather than score against an arbitrary version. Emergency management is full of contested questions; a benchmark that doesn't acknowledge this is a benchmark that's been silently simplified.

Retrieval accuracy vs. generation accuracy. Retrieval Augmented Generation (RAG)-based AI systems do two things: find relevant content (retrieval — pulling the right sections from a plan library, doctrine corpus, or incident archive) and generate an answer using that content (generation — summarizing, comparing, recommending). Both can fail independently. A system can retrieve the correct section of an EOP and then generate a summary that contradicts what the section actually says. Reporting a single accuracy number conflates these two failure modes, and the conflation is convenient for vendors because it hides where their system actually breaks.

Hallucination. A generation error where the model produces content not supported by the retrieved sources. In emergency management, this looks like an AI-drafted alert that includes an evacuation route the source plan never specified, a recommendation citing a doctrine reference that doesn't exist, or a generated AAR finding that misattributes a decision to the wrong section chief. Distinct from a retrieval failure (didn't find the right document) and from a refusal failure (should have said "I don't know" but answered anyway). Vendors who use the word loosely tend to also be vendors who don't track which type of failure is occurring. LLM hallucinations are mathematically inevitable, not just engineering flaws. Any vendor who tells you otherwise is uninformed or misleading you.

Grounded. A claim that generated output is supported by retrieved evidence. The meaningful version requires showing the retrieved passage and the relationship between it and the generated text — for instance, an AI-suggested resource decision tied to a specific section of the local resource ordering policy, with the section visible to the operator at the moment of approval. The marketing version requires nothing.

Closed circuit / closed system / closed loop. A phrase used by some AI vendors in emergency management to suggest data isolation, security, or controlled operation — and a phrase with no shared technical definition in this space. It might mean the AI doesn't send customer data to a third-party model provider. It might mean the model is hosted in the vendor's own cloud rather than a public API. It might mean the system operates only on a customer-uploaded corpus rather than the open web. It might mean a sandboxed deployment, an air-gapped instance, or simply that the marketing team liked how it sounded. Without specifying which one, "closed" is doing reassurance work, not technical work. The questions that produce a real answer: closed to what, closed from what, where does data flow, where does it stop, who has access at each point, and what happens when the model is updated or the vendor changes infrastructure providers. If a vendor uses the term, those are the follow-ups. If they can't answer them, the term is marketing.

Human in the loop. A phrase that does enormous work in marketing materials and very little work in practice if the loop is designed to rubber-stamp. The meaningful version is architectural: software gates, mandatory review steps, logged interactions, and what I call intentional friction — design choices that deliberately slow the user down to force thinking, rather than easing them through to acceptance. In emergency management the loop is a duty officer, planner, alerting authority, or exercise designer reviewing AI-drafted content under their actual cognitive conditions — often time-pressured, multi-tasking, sleep-deprived. The architecture has to be designed for that user in that environment, which is where most "human in the loop" claims fall apart. The marketing version is a sentence in a slide deck and a user copy/pasting content out of a chatbot.

Validated. By whom, against what doctrine or methodology, with what authority. Emergency management has its own validation traditions (and problems with those) — HSEEP for exercises, EMAP for accreditation, peer review for plans, Joint Commission standards for hospital EM. AI systems imported into the field shouldn't get to use the word "validated" unless they're submitting to comparable rigor. Same problem as accuracy, with extra confidence.

These aren't arcane distinctions. They're the vocabulary emergency management has to share before it can have a real conversation about what its AI tools actually do — and where they fail.

The accuracy ceiling

No large language model has achieved perfect accuracy. Not OpenAI's. Not Anthropic's. Not Google's. The labs that built these models, with billions of dollars and the best AI researchers on earth, publish accuracy numbers below 100% on every benchmark they release — and they publish those numbers openly, with methodology, because they understand that transparency is the precondition for trust.

So when an emergency management AI vendor claims "accurate" without qualification, one of two things is true. Either they've solved a problem the entire AI research community hasn't. Or they're using the word loosely.

It's the second one. It's always the second one.

The honest answer for AI in emergency management is that on a good day, with a clean corpus and well-formed questions, you might hit 90-95% accuracy. That's a reasonable target. State-of-the-art systems in adjacent fields — medical AI, legal AI — operate in that range and publish stratified numbers showing where they perform well and where they don't.

There's a second dynamic that distorts how accuracy gets evaluated. Any AI failure tends to be compared against the perfect alternative, not the expected one.

When an AI system gets a hard question wrong, the implicit benchmark becomes "what an expert with full information and unlimited time would have answered" — not "what a Google search would have returned, what the average duty officer would have produced under time pressure, what the existing documentation would have surfaced."

The AI gets graded against the ceiling of human performance; perfection. The status quo gets graded against itself. That asymmetry hides where AI is genuinely additive and exaggerates where it falls short. The honest comparison is against the realistic alternative the user actually had access to, not against the idealized version of the answer that exists in retrospect. Both numbers matter — absolute accuracy against ground truth, and relative accuracy against the available alternative — and neither alone tells you whether the tool is improving outcomes.

That's the ceiling, and that's the comparison problem. The question is what to do about the gap between the ceiling and 100% — and how to evaluate it honestly against what users actually had before.

The risk isn't uniform across the cycle

Here's where the current conversation goes wrong. The field is treating AI accuracy in emergency management as if it were a single bar — one accuracy number, one benchmark, one disclosure standard. But the consequence space is wildly different across the preparedness cycle, and the accuracy bar has to match the consequence space of the decision the AI is supporting.

Emergency managers already think this way about every other piece of equipment. A handheld HazMat detector advertised as 90% accurate is fine for initial perimeter screening, dangerous for clearance after a release. A weather radio with 90% reliability gives general situational awareness; the same accuracy is inadequate for life-safety decisions. Same device, same accuracy number, completely different risk profile based on the decision it's informing. AI is no different.

Consider:

A 90% accurate AI drafting an after-action report is a productivity win. Errors are caught in review, the document goes through edit cycles, the time horizon allows iteration. The cost of a missed correction is a slightly less precise lessons-learned document — recoverable, low-stakes.

A 90% accurate AI helping a planner identify gaps in a county emergency operations plan is fine. The planner reads the AI's output, weighs it against their own judgment, decides what to act on. The AI is a thought partner, not a decision-maker.

A 90% accurate AI generating a wireless emergency alert during an active incident is a liability event waiting to happen. One in ten alerts wrong. The alert reaches hundreds of thousands of phones in seconds. There is no recall. There is no correction window before the public acts. The same model, with the same accuracy number, becomes a fundamentally different product based on where in the cycle it sits.

This is the requisite-variety problem at the center of EM that I like to preach stated in plain terms: the accuracy bar of an AI tool has to match the consequence variety of the decision it's informing. A flat field-wide benchmark — "EM AI is 92% accurate" — collapses that distinction and is itself a form of doctrinal malpractice. It tells emergency managers nothing about whether the tool is safe for their specific use.

The implication for benchmarks and golden answer tests is structural. The field doesn't need a benchmark. It needs a benchmark family — different golden answer sets, different accuracy targets, different failure tolerance, different evaluation rigor — matched to specific use cases across the cycle. Preparedness use cases need their own. Response use cases need their own, and the bar should be much higher. Active alerting needs its own again, with stratified evaluation against the specific failure modes that matter (wrong zone, wrong action verb, wrong contact information, wrong timing).

A single benchmark is the wrong shape because emergency management is the wrong shape for a single benchmark. The architecture of the evaluation has to match the architecture of the work.

What 90-95% means in active response

Run the math. One in ten outputs is wrong at 90% accuracy. If a vendor's tool generates 100 alerts in a real incident, ten are wrong in some material way — wrong zone, wrong timing, wrong action verb, wrong contact information. The Wireless Emergency Alert system reaches hundreds of thousands of phones in seconds. There is no recall. There is no correction window before the public acts. The error doesn't sit in a email inbox waiting to be cleaned up next Monday— it flies out the door and changes behavior.

Even 95% accuracy means one wrong alert in twenty. In a multi-day disaster with hundreds of public communications, that's not a rounding error. That's a patterned harm.

Software errors in most domains are recoverable. A wrong entry in a restaurant CRM means a guest gets too much salt on their pasta. A wrong line in a sales pipeline tool means someone follows up at the wrong time. These are correctable inconveniences.

Software errors in active emergency response are different in kind, not degree. A wrong evacuation zone in an alert doesn't get corrected — it gets acted on. A wrong shelter location sends people toward the hazard, not away from it. A wrong dosage in a mass casualty triage protocol kills patients. The accuracy bar can't be the same. And adding AI doesn't magically fix this. In fact, it can increase the scale and speed of harm that comes from wrong information.

What's missing in emergency management

Emergency management knows how to do this kind of work. HSEEP defines a shared methodology for exercise evaluation — published criteria, peer review, after-action improvement processes (though HSEEP itself remains unevaluated as a methodology, which is a gap to deal with another time). EMAP gives the field a third-party accreditation standard for emergency management agencies. NIMS provides shared incident command vocabulary, certification, and training. Joint Commission standards govern hospital EM programs. The pattern is the same in each: published methodology, external assessment, ongoing refinement. The field built that infrastructure because ad hoc evaluation didn't scale and didn't earn trust across jurisdictions.

The same pattern shows up in adjacent high-consequence domains that have adopted AI: published methodology, stratified disclosure, scope limitations, mandatory human oversight wrapped around the model output, audit trails capturing what the AI proposed and what the human did with it. The 5% failure rate isn't ignored. It's designed for.

Emergency management AI vendors operating in the response space have, by and large, not done this work. Accuracy claims are unqualified. Methodology is undisclosed. Failure modes are unpublished. Human review is described in marketing copy rather than enforced in the software. There's no stratification by corpus type, and limited published golden-answer test sets — academic work like Texas A&M's DisastQA evaluates LLM question-answering against disaster information, but the operational use cases that actually drive risk in this field — alert drafting, AAR generation, plan analysis, exercise design, resource decisions — sit outside any published benchmark, and even DisastQA hasn't been adopted into vendor evaluations or procurement requirements. There's no inter-rater reliability on the ground truth. There's a number — sometimes — and a confident assertion. The field has the playbook for what shared evaluation infrastructure looks like — it just hasn't applied it to AI yet.

What disclosure should actually look like

I've written elsewhere about the specific disclosure framework I think the field should adopt. Briefly: model and version; accuracy stratified by the type of corpus the system is operating over (jurisdictional plans, federal doctrine, historical incident data, SOPs); performance against a published golden-answer test set; inter-rater reliability on the ground truth itself; failure modes and known edge cases; human review mechanisms — not policies, mechanisms — enforced in the software.

The collective version is more valuable than the per-vendor version. Software vendors and the emergency management community should come together on shared golden-answer test sets and benchmark datasets that all AI tools in this space could be evaluated against. Other fields have built shared evaluation infrastructure for exactly this reason: it's how trust gets built across vendors rather than vendor by vendor, and it's how a field stops grading its own homework.

The use-case-specific version is the part that's missing from most current discussions. We don't need one EM benchmark. We need a benchmark family — separate evaluation sets for AAR generation, for plan analysis, for exercise scenario design, for situational awareness, for alerting — with the bar calibrated to the consequence space of each use case.

Architecture is the liability shift hiding as a control feature

The accept button isn't a UX detail. It's the moment where responsibility for an AI-generated output transfers from the system to the human who approved it. Get the architecture right and you have a defensible chain of decisions, with timestamps, identities, and the diff between what the AI proposed and what the human signed off on. Get it wrong and you have a vendor and an operator both claiming the other one is responsible, with no shared record to settle the question.

That's why what I call intentional friction matters. The principle is straightforward: deliberately slow the time between AI generation and downstream consequence — between draft and approval, between approval and transmission, between recommendation and resource commitment — so that the human in the loop has time to actually be in the loop. Friction isn't a UX failure; it's a safety feature. Every additional second between an AI output and an irrevocable action is a second where a wrong evacuation zone gets caught, where a hallucinated doctrine reference gets challenged, where the duty officer's "wait, that's not right" can land before the alert hits hundreds of thousands of phones.

The mechanisms vary. A required diff view that shows what the AI drafted versus what the human edited. A mandatory free-text justification when the human accepts an AI recommendation that overrides standard policy. A second authentication on high-consequence actions. A hold period before transmission. A required acknowledgment that the human, not the platform, is the one approving. Software developers will recognize the pattern from GitHub's Danger Zone — the section of repository settings where actions like deleting a repository or transferring ownership are gated behind extra confirmation, often requiring you to type the repository name before the action proceeds. Irreversible actions get architectural friction by default in mature software. Emergency management AI is software where a substantial portion of outputs are functionally irreversible — a transmitted alert, a committed resource decision, a recommendation acted on by a duty officer with no time to second-guess. The Danger Zone pattern should be the floor, not the exception.

Intentional friction also reshapes the legal record. When the software requires a human to click accept, approve, or send — and logs that interaction with timestamp, identity, the AI's draft, the human's edits, and the time elapsed at each step — the record establishes who decided what and when. That record is the liability shift. The vendor without this architecture is tangled together with the operator on the question of who decided what. The vendor with it has the receipt: the AI offered, the human reviewed, the system logged the deliberation, the human accepted. Skipping the architecture isn't just skipping a safety mechanism. It's keeping the liability on the vendor by default — and depriving the operator of the documentation they'd need to defend their own decisions in any post-incident review.

This matters across the cycle, not just in active response. Preparedness tools with AI-generated content benefit from the same logging discipline; the audit trail is just less immediately consequential because the time horizons allow correction. In active response, intentional friction is load-bearing — it's what stands between an AI-drafted alert and the public acting on a wrong one.

When the human isn't in the loop by design

There are AI applications, including in Preppr, where a human can't be in the loop in any meaningful way. Scale-driven applications process volumes no team could review in real time. Agentic systems take actions across multi-step workflows where pausing for review at every step defeats the purpose. In emergency management, this might look like an agentic system ingesting sensor feeds and producing a continuously updated situational picture, or a tool processing thousands of intake records and tagging them for downstream attention. In both cases, intentional friction at every decision point isn't an option.

The architectural mechanism shifts. When the human isn't there to be the audit trail's witness, the audit trail itself has to carry more weight — and it has to be one the vendor can't quietly tamper with after the fact.

This is where tamper-evident records, like what we use with Preppr Collabprate, become load-bearing infrastructure. The version that works uses a third-party notary: an external service that writes cryptographic hashes of the agentic system's interactions to independent storage in real time. The vendor keeps the raw data; the notary holds the hashes. Any change to the vendor's records after the fact would mismatch the notary's hashes, making tampering detectable. When something goes wrong, the inquiry has a source of ground truth that isn't the vendor's word against the user's.

This is the architectural equivalent of intentional friction for unmanned workflows. Both serve the same underlying purpose: making the system's behavior verifiable and the responsibility for that behavior assignable. Intentional friction puts the human in a position to be accountable, with the time and the record to make that accountability real. Tamper-evident records put the system itself in a position to be accountable when no human can be.

Vendors who skip both — no friction in the human-reviewed path, no notarized records in the unmanned path — aren't operating in a neutral position. They're operating in a position where, when a harm event occurs, no one can prove what actually happened.

The liability bet

What concerns me most about the current state of the field is what unqualified accuracy claims imply about the operating assumptions of the vendors making them. A vendor putting AI in active response without published evaluation infrastructure is implicitly betting that the failure modes won't matter, won't be discovered, won't result in attributable harm. That's not a technology bet. It's a liability bet.

Software errors in life-safety domains have a different legal and regulatory shape than software errors in commercial domains. Products liability theory applies to software marketed for emergency response in ways it doesn't apply to a CRM. Discovery in a future case is brutal: marketing claims about accuracy, internal communications about known failure modes, the absence of published evaluation methodology — all of it becomes evidence. Industry standard of care is forming right now in real time, and vendors who don't engage with disclosure norms aren't just outside the standard. They're defining the gap a future plaintiff's expert will point to.

Insurance markets are catching up to this. They always do. AI/Technology Errors and omissions carriers in adjacent high-stakes industries already ask whether vendors publish evaluation methodology, whether they have refusal mechanisms, whether they limit scope to validated use cases. When carriers in the EM AI space start asking the same questions — and they will, because actuaries respond to attributable harm — vendors without disclosure infrastructure will see premiums rise or coverage withdrawn.

Procurement will follow the same curve. CISA, FEMA, and state emergency management offices will formalize AI procurement standards, and the bar will be the same disclosure framework practitioners and researchers are already articulating. Vendors who built that infrastructure early will pass the screen. Vendors who didn't will be locked out of the contracts that matter. Knowing what to ask during procurement is a challenge, but AI product evaluation tools are available.

This isn't speculative. It's the pattern from adjacent high-consequence industries over the past decade: disclosure infrastructure moves from competitive differentiator to procurement floor to insurance prerequisite within a few years of significant claims. The window where unqualified accuracy claims work as marketing is always shorter than vendors think.

Where this lands

The honest position on AI in emergency management is the one mamy other high consequence fields have already taken: high accuracy is achievable, perfect accuracy is not, and the gap between accurate and accurate-ish is where the engineering, the disclosure, the architectural review mechanisms, and the human oversight have to do their work. The accuracy bar isn't uniform across the cycle and shouldn't be. The benchmarks shouldn't be either.

The vendors who say so out loud — who publish their numbers, stratify by use case, document their failure modes, build review mechanisms into the software, and limit scope to what they can defend — are the ones who'll still be standing when the procurement and insurance environment catches up.

The vendors claiming accuracy in response use cases without qualification are making a bet. Not a technology bet. A liability bet. And it's a bet against the way every other high-consequence industry has gone before them.

A call to convene

The path forward isn't a vendor publishing a unilateral standard. It's a convening. The field has done this work before — HSEEP didn't emerge from a single vendor; NIMS didn't come from a single agency. Both were built when practitioners, researchers, and federal partners came together to establish shared methodology and apply it across the field. AI evaluation is the next iteration of that same work.

Software vendors building AI for emergency management. Researchers studying alerts, decision-making, and crisis communications. AI developers and infrastructure providers. Practitioners — emergency managers, planners, alerting authorities, exercise designers — across jurisdiction sizes and hazard profiles. Procurement officers from federal, state, and local agencies who'll have to operationalize whatever standard emerges. Insurance carriers and legal experts who'll define how it's enforced.

The work has three phases. Framing: defining what accuracy means across the cycle, which use cases need their own benchmarks, what disclosures should be required at procurement and during operation, which architectural mechanisms — intentional friction, tamper-evident records, refusal infrastructure — should be expected in which contexts. Building: creating the artifacts — a benchmark family with separate golden answer test sets for distinct use cases, disclosure templates, procurement language, audit methodologies, governance structure that prevents capture by any single vendor. Applying: vendors agreeing to be evaluated against the standard, procurement officers writing it into contracts, insurance carriers using it as a basis for underwriting, researchers running independent evaluations and publishing the results.

Without the convening, every vendor publishes their own standard and grades their own homework. With it, the field gets a shared foundation buyers can rely on, regulators can reference, and harm victims can use as a yardstick when something goes wrong.

I'm not asking the field to agree with everything I've written here. I'm asking the people who'd have to be at the table for this to work to actually come to the table. Reach out. Let's see what we can build.

One disclosure. Preppr — the company I run — does not currently build AI for active emergency response. Our AI works across the preparedness cycle, where time horizons allow human review, iteration, and correction. The intentional friction principle described above is one we apply in our exercise design and delivery tools — review steps designed to slow both the AI and the user down and force thinking before acceptance. For features where humans can't be in the loop, we use AgentSystems Notary, a third-party service that writes cryptographic hashes of our agentic system's interactions to independent storage in real time — we control the raw data, but any modification would mismatch the notary's hashes and be detectable. Both are deliberate architectural choices rooted in exactly this argument. The disclosure standard I'm proposing isn't a positioning move; it's the bar I'd want any vendor in active response to meet, including Preppr if we ever chose to build there.


Related Articles

✢ JOIN THE WEEKLY NEWSLETTER

Want to receive insights and updates in your inbox?

✢ JOIN THE WEEKLY NEWSLETTER

Want to receive insights and updates in your inbox?