Data cleaning - Zelix Glossary

What it means

Data cleaning is the unglamorous middle 60 percent of an AI deployment. Phone numbers in five different formats. Customer names with leading whitespace. Product SKUs spelled three different ways. Date fields with mixed timezones. Duplicate records across two systems with no clear primary key.

An AI model trained on dirty data learns the dirt. It learns to confuse two customers because their names look similar, or to misroute a message because the SKU was typed wrong. Cleaning the data once, properly, removes that risk for every downstream use case.

Why it matters

The cost of dirty data compounds. Every AI workflow that reads the same dataset inherits the same dirt. Cleaning at the source pays back across every future deployment, not just the current one.

It also matters for trust. The first time an AI agent confidently quotes the wrong price because the catalog had two SKUs for the same product, the team stops trusting it. Clean data is what keeps trust intact during the early weeks of a deployment.

Example

A furniture retailer has 4,200 product records. The cleaning pass finds 380 duplicates, 110 SKUs with embedded HTML, 60 records missing a price, and 25 with prices in the wrong currency. After cleaning, the AI assistant that helps customers find products stops giving wrong answers and starts converting at twice the previous rate.

Where this comes up

Workflow engineering

What it means

Why it matters

Example

Related terms

Where this comes up