Data Quality Before AI: Why Bad Master Data Slows Every Automation

Most failed AI projects in SMEs did not fail on models. They failed on data — on duplicates, unclear sources, missing ownership and an Excel shadow process nobody officially knows.

The most dangerous sentence about AI is: "garbage in, garbage out." It is only half true. In reality: garbage in, convincing-looking garbage out.

Why AI worsens the data problem

A classic analysis with bad data delivers an obviously bad result. An AI with bad data delivers a fluent, confident, professionally phrased wrong result. AI does not fix data quality — it disguises it.

DORA's 2024 Accelerate State of DevOps Report shows the pattern here too: speed and stability come from clean, reliable foundations — not from more tooling on a shaky base. The NIST AI Risk Management Framework explicitly names data quality as a core risk.

The real bottleneck is rarely the model

When an AI project stalls, it is almost never the model. It is four things:

1. No source of truth

Three systems know "the customer", each slightly differently. Without a defined leading source, AI automates the conflict, not the solution.

2. Unclear definitions

What is an "active customer"? A "completed order"? If five departments have five answers, the AI has no chance — and still gives one.

3. Duplicates and gaps

The same supplier four times, half-maintained fields, historically grown special cases. Humans compensate intuitively; an automatic process does not.

4. The shadow Excel process

The actual flow often does not live in the system but in an Excel file on a drive. Whoever ignores it automates the wrong model of reality.

Data quality is not an IT task alone

The most common mistake is treating data quality as a technical cleanup. It is above all a question of ownership: who owns a data type, who decides definitions, who maintains it? Without that clarification, every cleanup is just a snapshot that decays immediately.

The pragmatic path: not everything, but what's needed

Data quality does not mean "clean up for three years first, then AI". It means: make exactly the data fields reliable that the first concrete use case needs — the same narrow, measurable cut as any good AI pilot (see AI automation: the 90-day pilot).

Checklist before AI automation

Is there a source of truth per data type?
Are the core terms unambiguously defined (e.g. "active customer")?
Are duplicates and gaps known in the relevant slice?
Is the shadow Excel process captured instead of ignored?
Is ownership clarified per data type (own, maintain)?
Do we make only the necessary fields reliable, not everything?
Is data quality an ongoing process, not a one-off action?

Frequently asked questions

Do we have to clean all data first? No. Only the slice the first use case needs. "Everything first" is just as much a mistake as "ignore data".

Can't AI clean the data itself? It can help on subtasks, yes — but controlled and reviewed. AI as an unsupervised data cleaner creates convincing new errors.

How do we spot bad data quality early? By contradictory numbers between systems, by "which list is valid now", by Excel files circulating via email.

Isn't this expensive? More expensive is an automated wrong decision at scale. Data quality is the cheapest phase of an AI project — if it comes first.

Conclusion

AI makes good data usable faster and bad data more dangerous. Whoever defines a source of truth, defines terms, takes the shadow Excel seriously and makes only what's needed reliable automates substance instead of convincing-looking nonsense.

Next step

Your AI initiative is stuck on unclear data? Start with a short assessment of your requirements. We clarify source of truth and definitions for exactly the first use case — not for everything.

Sources

DORA, Accelerate State of DevOps Report 2024 — dora.dev
NIST, AI Risk Management Framework — nist.gov
Destatis, Enterprises using AI — destatis.de