Most Agentic AI failures I've debugged turned out to be ingestion drift

Over the last few months, we’ve been working on creating an autonomous Agentic AI, and something unexpected kept showing up. I went in thinking the issues were with embeddings or the retriever, but the root cause was usually ingestion drifting upstream.

Some patterns that kept repeating: • PDFs extracting differently after a small template or export tool change • headings collapsing or shifting levels • hidden characters creeping into tokens • tables losing their structure • documents updated without being re-ingested • different converters producing slightly different text layouts

We only noticed the drift once we started diffing extraction output week-to-week and tracking token count variance. Running two extractors on the same file also revealed inconsistencies that weren’t obvious from looking at the text.

Even with pinned extractor versions, mixed-format sources (Google Docs, Word, Confluence exports, scanned PDFs) still drifted subtly over time. The retriever was doing exactly what it was told, the input data just wasn’t consistent anymore.

Curious if others have seen this. How do you keep ingestion stable in production RAG/Agentic AI systems?

2 points | by wehadit 1 hour ago

1 comments

  • chasing0entropy 43 minutes ago
    This is by design. AI that has consistent, reliable, accurate output is boring