Gate News message, April 20 — Top AI models excel at solving complex problems like Olympiad mathematics but struggle with routine enterprise work, according to David Meyer of Databricks. Some models may correct an incorrect invoice number instead of flagging it as an error, while coding tools like Claude can also underperform on data engineering tasks.
The gap stems from fundamental differences between enterprise data and the public web text used to train large models. Enterprise data often features vague column labels, numerous blank fields, and codes stored as plain text. In one academic study, an AI model’s F1 score, which balances precision and recall, dropped from 0.94 on public data to 0.07 on enterprise data for a data engineering task. Additionally, large models tend to default to familiar patterns from training; some defaulted to Structured Query Language (SQL) even after receiving instructions and documentation for a company’s proprietary query language.
Smaller open source models tuned with reinforcement learning can handle specific jobs more efficiently at significantly lower training costs than large general-purpose models. Databricks is building smaller AI agents for specific workflows, such as KARL, which uses reinforcement learning for multi-step reasoning with company documents. The industry is shifting from reliance on giant models to hybrid architectures where small efficient models handle routine volume, then escalate only unclear or complex cases to larger, costlier systems.
Databricks recently acquired Quotient AI to help large enterprises run AI agents more reliably. Competition in the AI business now centers on running the full AI lifecycle, including feedback systems for tracking errors and continuously improving models over time, making evaluation and tuning tools increasingly valuable after deployment.
Related News