How We Scale to AGI, and What Writing About the Future of AI Has Shown Me

Mar 5
10 min read

December 2008, New York. A friend handed me a copy of The Singularity Is Near by Ray Kurzweil. He already knew where my head was. That book didn't start the obsession, it just named it. I have read countless books and articles about it since, always thinking about AI and its role in our collective story.

That curiosity eventually found its way into my screenwriting and films. My

first AI screenplay, Paradise Ranch, co-written with Kiril Sabev, is set in a far future where technology has long since collapsed back to something analog, roughly 1980s in character. AI is far off history. Then a group of small-town kids stumble onto a doomsday shelter inside their local dam, and find the last functioning AGI on earth still running inside it.

Paradise Ranch is a screenplay by Emanouil Angelov and Kiril Sabev about AGI in the far future.

My current screenplay follows the events, decades after humanity's technological collapse, and the emergence of Ray Kurzweil's theorized singularity in this technological desert.

Surviving The Fall is a screenplay by Emanouil Angelov about AGI and the Singularity.

Writing these stories demanded a kind of honesty that most public discussions of AI avoid. You cannot fake the internal logic of a world and expect people to believe it. To write about AI with any truth, you have to follow the ideas to their inevitable ends, face what they reveal, and resist the temptation to settle for what sounds exciting when the real answers are harder to imagine. So it forced me to think like a futurist.

Eventually I began to see a gap between what the industry says it is building and what it is actually building. Hype has shaped AI development as much as science has. Scaling prediction is not the same as scaling understanding.

The mainstream bet on the large language model as the final destination of AI is, for all its momentum, a wrong turn. There is a bubble forming around it. But AI will survive that bubble the way the internet survived the dot-com crash: the infrastructure will remain, the hype will clear, and the companies still standing will be the ones that built something real underneath it.

I come to all of this from an unusual combination of angles. I am bilingual, trained in cinematography and cameras, and spent years teaching English across cultures while learning foreign languages myself. Vision and language are domains I have worked in professionally and studied closely. When writing about AI forced me to reckon honestly with what intelligence actually is, the mismatch with the industry's direction was not subtle.

The AI industry is not building intelligence. It is scaling a communication tool. To understand this fork in the road to AGI we have to consider the role sensory data and language play in our own intelligence.

The Wrong Layer, Built from the Wrong Stuff

Large language models are not mimicking intelligence. They are mimicking language itself, the communication tool that sits at the top of the stack, furthest from where understanding actually originates.

Consider Two-Face from Batman. He surrenders every decision to a coin toss. Not because the coin is wise, but because he has abdicated agency. The coin does not understand the problem. It does not weigh consequences or generate meaning. It simply resolves the paralysis of choice with a random output. And crucially, it always lands somewhere. Every flip produces an answer. Sound familiar?

An LLM operates the same way. Not randomly, but statistically. Given any input, it will produce a confident, fluent, well-formed output. It has no agency, no understanding, no stake in whether the answer is true. It is navigating a chaotic world of language patterns with a very sophisticated coin or token system if you like. The output lands somewhere every time. That is not intelligence. That is the absence of intelligence dressed in its clothing.

Real order in a chaotic world is not statistical. It is biological. It emerges from organisms that have skin in the game, that live and die by the accuracy of their perception, that built their understanding of reality through direct contact with it over millions of years of evolutionary pressure. Language is one of the last things that emerged from that process, not the foundation of it. Building AI on language and calling it intelligence is building on the most abstracted, most removed, most human-specific layer of a process that runs incomparably deeper.

Companies like OpenAI, Google, X and others are building that layer from the flimsiest possible stratum of human consciousness: language. Which is, by definition, a compressed and distorted transmission of what the mind actually understood. Void of understanding, void of agency, yet the bet is that if you scale it far enough, the mimicry becomes close enough to the real thing that nobody checks what is underneath.

Think about how you learned to walk, to catch a ball, to read a room. Nobody described these things to you in sentences and you reconstructed the skill from the description. You engaged with the world directly. Your sensorimotor system built representations through iteration, failure, and feedback from physical reality. The understanding came first. Language, if it arrived at all, came after as a way to communicate what the body already knew.

Intelligence is not a product of language. Language is a product of intelligence. It is what the conscious actor produces when drawing on the subconscious processing layer where actual understanding lives. When we build AI on language data, we are training on outputs of outputs: compressed descriptions of understood experiences, produced by an imperfect communication tool, fed into a model that has never had the experience itself.

We are not building intelligence. We are scaling a communication tool built from other people's descriptions of experiences the model has never had.

The result is Two-Face's coin, supremely optimized. It lands on an answer every time, fluent and confident, yet with no understanding of what was actually asked. Sometimes correct, sometimes fabricated, because the model carries none of the lived context every human brings to language without thinking: embodied experience, cultural grounding, the felt sense of what words actually refer to in the world. That context is not in the text. It never was. It lives in the people who wrote the text, and it did not survive the translation into tokens.

The model has no way to distinguish a true answer from a plausible one because it has no ground truth to check against. It works only with statistical patterns in text that were already one step removed from reality by being stripped of larger sense based contexts, before the model ever saw them. We are then surprised when it hallucinates. But hallucination is not a malfunction. It is the coin doing exactly what it was built to do, landing somewhere every time, with reasoning layers now trying to limit where and how it lands to better mimic real intelligence.

Alphabets, Hieroglyphs, and the Shape of Representation

To understand what a better architecture looks like, it helps to understand what alphabetic language actually is as a representational system.

Western written language descended from administrative necessity: counting grain, recording debts, tracking inventory. It was optimized for discrete quantification, breaking reality into separate labeled units that could be stored and retrieved. The alphabet is the endpoint of this process. An arbitrary set of symbols, each representing a sound, combined into words, each word a label attached by convention to some slice of reality.

There is no resemblance between the word 'fire' and fire. The relationship is entirely symbolic, abstract, and culturally agreed upon. The word does not carry the heat, the light, the danger, the warmth. It points at them, imprecisely, for people who already know what fire is.

Hieroglyphs, Kanji and Mayan script work on a different principle. They are graphic and visual. A hieroglyph for a bird looks like a bird. It does not just label the concept; it evokes it, positions it visually in relation to other elements, and communicates something of the thing itself rather than merely pointing at a convention. These scripts are more holistic because they are still partly image. The abstraction from physical reality is less complete.

Now consider the representational choice at the heart of modern AI. Language models tokenize text into discrete symbolic units and learn statistical relationships between them. Each token is the endpoint of alphabetic abstraction: a symbol stripped of any direct relationship to the physical reality it supposedly represents. The model learns which tokens tend to follow which other tokens across billions of examples. It does not learn the larger contexts behind the tokens. This is the symbol grounding problem in formal ML terms. Tokens lack grounded representations. They are not connected to their referents in the world in a way that translates real and inherent context.

Foundation vs Interface in Intelligence Architecture

Language models trained on text are working at maximum distance from the world they are supposed to understand.

What would a hieroglyphic approach to AI representation look like? It already exists in embryonic form. Vision-language models like CLIP learn joint embeddings across images and text, producing dense vector representations where concepts that are visually and semantically related occupy nearby regions of a shared high-dimensional latent space. Unlike discrete tokens, these embeddings are continuous and relational. The representation of 'fire' in a well-trained multimodal embedding space carries proximity to heat, light, danger, and warmth, because the model has processed images of fire alongside text about fire and learned the structural relationships between them.

This is closer to holistic representation. Not a label attached to a convention, but a position in a latent space shaped by both visual and semantic data about the thing. The vector is not the word and not the image. It is a learned compression of both. Still imperfect, still partially anchored to language through its training data, but pointing in the right direction: away from language as the primary modality and toward real-world sensory data as the foundation.

How a Child Actually Learns

The correct model for machine intelligence development is not the language model. It is the infant.

A child does not acquire understanding of the world by reading descriptions of it. They engage with it physically and sensorially: touching objects, tracking movement, learning causality through interaction, developing spatial understanding through navigation, building social models through face-to-face engagement long before language arrives. By the time a child can form sentences, they already have a rich internal model of physical and social reality built entirely from sensorimotor experience. In essence they are building their own LWM and Large Cultural Model.

Language then arrives as a tool for communicating and refining that model. It is extraordinarily useful in this role. But it did not build the model. The model was already there, constructed from direct engagement with the world.

In ML terms, this is sensorimotor grounding: building internal representations through iterative feedback between action, perception, and environmental response. The representations that emerge are genuinely grounded because they are causally connected to the physical world that generated them. Error signals come from reality, not from human annotation of text.

Scaling this approach rather than scaling language produces a fundamentally different kind of system. The compute efficiency argument is real: a model that learns grounded representations from sensory data does not need to process the entire expressive output of human civilization to achieve competence in a domain. It needs sufficient environmental exposure to that domain. The signal is richer per data point because it is grounded. The model learns faster because reality provides immediate and unambiguous error correction in a way that language data cannot.

A sensorimotor learning system can also run at machine speed through simulated or real environments, generating orders of magnitude more learning events per unit time than any language-based training process. What takes a human years of embodied practice can be compressed, not because the system is smarter in some abstract sense, but because it is iterating against reality rather than against human descriptions of reality.

Paradigm Shift: Intelligence Is our Sub-conscious Layer

I want to be precise about the claim I am making, because it is easy to misread.

I am not saying consciousness does not exist or does not matter. I am saying that what we call intelligence, the capacity for skilled understanding and effective action in the world, lives primarily in the subconscious processing layer. The conscious actor, the reflective, the freudian ego, language-using, narrating self, is a real and important phenomenon. But it is not the source of intelligence. It is an interface to it.

The ego can reflect on what the subconscious has produced. It can use language to communicate it, refine it, share it with others. Language is a tool of the reflective actor, a powerful and distinctly human tool. But it is a communication device, not a cognition device. The cognition happens elsewhere.

This matters for AI because it means the question 'what is intelligence?' has a different answer than the language model paradigm assumes. Intelligence is not sophisticated language use. It is sophisticated world-modeling, grounded in sensory reality, operating at speeds and depths the conscious actor cannot access directly. Language is how that intelligence talks to other actors. It is not the thing itself.

Every human accesses this layer. The musician who has practiced until the music plays itself. The operator who reads a threat environment before consciously registering it. The surgeon whose hands know where to go. The writer whose story seems to just flow through him and not from him. None of this is happening at the language layer. Language arrives after, to describe what intelligence already did.

Language is how intelligence communicates. It is not intelligence itself. We have spent close to a decade scaling the communication tool and calling it the mind.

The AGI Worth Building Has No Conscious Will

The path to AGI is not through language. It is through the same process that produced intelligence in the first place: direct, iterative engagement with the physical world.

Every intelligent creature that has ever existed built its understanding from the ground up. Sensory contact with reality. Action and consequence. Pattern recognition earned through experience, not extracted from descriptions of it. Intelligence at this layer has no agenda. It does not develop preferences or pursue goals. Agency is not a property of intelligence. It is what the ego layers on top after the fact. The subconscious does not want anything. It just understands.

Real AGI will work the same way. Not an agent that acts in the world on its own behalf, but an intelligence layer we tap into, the same way we are trying to interact with LLMs today, except grounded in reality instead of language. A substrate of genuine world-modeled understanding that operators and systems can query, draw from, and build on. Deeper, faster, and more accurate than anything built on text, because it was trained on the world itself rather than on human descriptions of it.

The reflective actor, the system with something resembling agency or free will, is a separate and far harder problem. It may be worth solving eventually. But you cannot solve it honestly until you have first built the intelligence layer it would run on. Right now the industry is trying to simulate the actor without having built the substrate. That is why current AI systems feel simultaneously impressive and hollow. The performance is there. The foundation is not.

Build the foundation first. Crack genuine AGI as an intelligence layer. Then, if it still seems like a good idea, tackle the harder problem of what it would mean to give that intelligence a will of its own.

That is what we are building at Absentia. Vision first, because light bouncing off matter is the highest-bandwidth grounded signal available. Extending into multi-sensor fusion to avoid the ouroboros of text-based training data. No actor in the architecture, because operators do not need AI that performs intelligence. They need AI that provides it.

Written by the Absentia Leadership Team

Emanouil Angelov is Co-Founder of Absentia Technologies. His work at Absentia is informed by the intersection of visual perception, linguistic theory, AI architecture, and business strategy. To learn more, visit absentiatech.com.