Why the AI Industry Is Scaling in the Wrong Direction, and What It Will Take to Build AI With Real Intelligence

Mar 5
8 min read

The diplomatic version of this essay would open by acknowledging what the AI industry has achieved in the last five years and then carefully suggest that some of the underlying assumptions deserve a second look. But the mistake I'm describing isn't at the margins — it's at the foundation, and it needs to be said that way.

The AI industry is making a fundamental mistake. Not a tactical one, not a resource allocation error, not a messaging problem. A foundational architectural mistake that is baked into the dominant paradigm, celebrated in earnings calls, and defended by some of the most well-funded organizations in human history. And the longer it goes unaddressed, the more expensive it becomes to fix.

So let me lay out what that mistake is, why it's compounding, and what a better path forward actually looks like.

The Scaling Myth

The working assumption of the AI mainstream is simple: more is more. More parameters produce more capable models. More data produces better reasoning. More compute produces better outputs. If a model hallucinates, train it on more data. If it fails at reasoning, make it bigger. If it's unreliable, give it more power.

This assumption has produced genuinely impressive demonstrations. It has also produced systems that confidently fabricate court citations, invent scientific references, misstate facts with total fluency, and fail catastrophically at tasks that any moderately experienced human would find trivial.

The working defense is that these are edge cases. Bugs to be patched. Problems that the next version will solve. But they are not edge cases. They are structural outputs of an architecture that has no real grounding in the world. They are what happens when you build a system that is extraordinarily good at predicting what language should come next, and then mistake that capability for understanding.

The human brain runs on roughly 20 watts. Twenty watts. The most sophisticated cognitive system ever observed operates on less energy than a light bulb.

Meanwhile, the largest AI data centers now consume as much electricity as small cities. Nuclear reactors that had been decommissioned are being brought back online to keep the servers running. The energy cost of a single large language model query is orders of magnitude beyond what the human brain expends on equivalent tasks.

This is not a sign that we're on the right track. It is a sign that we have optimized for scale at the expense of efficiency, and we are paying for it in ways that haven't fully surfaced yet, economically or environmentally.

Pattern Matching Is Not Understanding

Let me be direct about what current large language models actually are. They are extraordinarily sophisticated pattern matchers. Given a sequence of tokens, they predict what tokens should follow with remarkable accuracy across an enormous range of domains. They have been trained on so much text that they can simulate the surface characteristics of expertise in almost any field.

What they cannot do is understand the world the text describes.

This distinction matters enormously. When a language model produces a confident, well-structured, completely fabricated answer, it is not malfunctioning. It is doing exactly what it was built to do: producing the most statistically probable continuation of the text. There is no mechanism inside the model that distinguishes between a true claim and a plausible-sounding false one. There is no grounding in reality that would cause it to pause, notice an inconsistency, or admit uncertainty when certainty is unwarranted.

Hallucination is not a bug. It is the predictable output of a system that has learned the shape of knowledge without the substance of it.

Scaling this architecture does not fix the problem. It makes the hallucinations more fluent, more confident, and harder to detect. A bigger pattern matcher produces more convincing wrong answers, not fewer wrong answers.

The Brain Does Not Work This Way

The human brain is not a large language model. Understanding this difference is the most important insight available to anyone trying to scale AI towards AGI.

The brain's primary interface with the world is not language. It is sensory perception. Before any human ever learned to speak or read, the brain was developing sophisticated representations of the environment: spatial relationships, object permanence, causal chains, threat assessment, social dynamics. All of this happened through direct sensory engagement with reality, not through exposure to descriptions of it.

Language arrived late in evolutionary history, and it serves a specific purpose: communication between individuals who already have independent understandings of the world. It is the transmission layer, not the substrate of understanding itself. You do not think in words and then translate them into understanding. You understand, and then - sometimes, imperfectly - you translate that understanding into words.

This is why you can know how to ride a bicycle without being able to explain how. It is why expert musicians can play passages they cannot verbally describe. It is why the most capable human performers in any physical domain - martial artists, athletes, pilots, combat operators - describe a state where language-based deliberation disappears and understanding becomes direct and automatic.

The Japanese have a word for this: Mushin. Directly translating to 'no mind.' The state of action without the intermediary of conscious linguistic deliberation. It is not the absence of intelligence. It is intelligence operating at its most efficient - subconscious, environmental, immediate.

Current AI architecture is attempting to reconstruct the entire edifice of human intelligence from the top layer down. It starts with language due to its pure abundance, but ignores everything that language is built upon: embodied experience, sensory grounding, environmental understanding.

The Efficiency Argument

There is also a straightforward engineering argument against the current paradigm, independent of the philosophical one.

Generalist models are wasteful by design. A single model that can discuss philosophy, write code, analyze medical images, and compose poetry is not specialized for any of these tasks. It carries the entire weight of all its training for every query it processes. The overhead is immense. The precision is limited by the breadth. And the failure mode is unpredictable, because the model has no reliable internal mechanism for knowing which domain it is operating in and how confident it should be.

Specialized systems do not have that weakness. A system trained specifically to detect objects in fog does not need to know anything about poetry. It can be smaller, faster, more accurate in its domain, and far less prone to confident errors outside its boundaries. Knowing the domain means you can engineer around its actual constraints. Knowledge overflow is minimized, and what you get is a system that's reliable within its scope instead of unreliable across everything.

The architecture we are building at Absentia reflects this principle. Rather than a single enormous generalist model, we develop specialized agents - each purpose-built for a specific environmental challenge, each small enough to operate at the edge, each capable of being validated against clear domain-specific benchmarks. These agents are linked within an orchestration framework that routes queries to the appropriate specialist, aggregates outputs intelligently, and maintains coherence across the system.

You get the capability of a large system without the overhead. You get specialization without isolation. And critically, you get a system where the failure modes are bounded and predictable, because each component knows exactly what it is and is not designed to do.

The future of AI is not one enormous model that knows everything imperfectly. It is a network of specialists that know their domain precisely, coordinated by an architecture that understands how to deploy them.

Vision First

Within this framework, we have made a specific foundational choice: we start with vision.

This is not arbitrary. Vision is the primary signal through which intelligent systems - biological or artificial - first make contact with their environment. It is the highest bandwidth sensory channel available. It is the input that provides the richest and most immediate representation of spatial reality. And it is the domain where the gap between current AI capability and real-world operational need is most acute and most consequential.

Starting with vision also imposes a discipline that generalist language-based systems can sidestep altogether. Visual data is either accurate or it is not. You cannot hallucinate a fog-cleared image the way you can hallucinate a plausible-sounding fact. The gap between the input and the output is visible. The errors are measurable. The standard of accuracy is externally verifiable.

This grounds the system in reality in a way that language-based training cannot. And from this foundation - visual grounding in the actual environment - we can layer additional sensory modalities: thermal, infrared, acoustic. Each additional layer increases environmental understanding. The result is a system that perceives its environment more richly than any individual human operator, processes that perception more quickly than any biological system can, and acts on it without the latency of conscious deliberation.

This is what genuine AI-augmented capability looks like. Not a system that reads about the world and predicts what words should follow. A system that perceives the world and truly understands it in a way comparable to a human being.

The Missing Infrastructure Layer

There is a gap in current AI infrastructure that has not been adequately named. It exists between raw sensor input and actionable intelligence, and it is the gap that determines whether a system is genuinely useful in the field or merely impressive in a demo.

Human operators fill this gap with expertise, experience, and cognitive load they can ill afford. They take degraded, noisy, incomplete sensor data and translate it - in real time, under pressure - into situational awareness and decisions. They are extraordinarily good at this, but they are also limited by fatigue, bandwidth, and the fundamental constraints of biological attention.

MYSTIC is designed to occupy this translation layer. Not to replace the human operator - the judgment, the contextual understanding, the accountability, the mission awareness that no current AI system can replicate. But to expand the range and reliability of what the operator perceives. To handle the low-level translation from raw sensor data to meaningful environmental understanding, so the operator can focus on the high-level decisions that require human judgment.

This is the appropriate division of cognitive labor between human intelligence and artificial intelligence. AI handling the subconscious layer - the automated, sensory, pattern-recognition layer - that currently taxes human operators without requiring their uniquely human capacities. Humans handling the conscious layer - the judgment, the ethics, the mission context - that AI is not equipped to manage.

The result is not a replacement of human capability. It is a genuine augmentation of it.

Why This Matters Now

The stakes of this architectural question are not abstract. The systems being built today will shape the infrastructure of AI-augmented operations for the next decade. The assumptions baked into today's dominant paradigm will be expensive to excavate once they are load-bearing.

The industry is at an inflection point. The current scaling trajectory is hitting diminishing returns. The energy costs are becoming untenable. The hallucination problem is not resolving. And the gap between what AI systems are claimed to be capable of and what they are actually reliable enough to do in high-stakes environments is widening rather than narrowing.

Right now the industry is celebrating progress instead of questioning the direction. What is AI actually doing when it 'reasons'? What does it mean for a system to understand the environment it operates in? What architecture produces AI you can trust when failure is not an option?

We are asking these questions at Absentia Tech. We have reached specific conclusions about the answers. And we are building systems that reflect those conclusions rather than the comfortable assumptions of the mainstream.

The direction is not bigger. The direction is better. The direction is grounded in reality, specialized by domain, efficient by design, and trusted because it earns trust through performance rather than claiming it through scale.

That is the AI worth building.

Written by the Absentia Leadership Team

Andrew Ferguson is CEO and Co-Founder of Absentia Technologies, developing next-generation AI-powered visual intelligence systems for defense and security applications. To learn more about Absentia's approach, visit absentiatech.com.