The End of Training From Scratch: Our Research on Why All Vision Models Are Converging

Nov 20, 2025
8 min read

One of our team members just had research accepted as a Spotlight at NeurIPS 2025—the top 1% of submissions to machine learning's most prestigious conference. The work tests a provocative idea that's been circulating in the AI community: that all sufficiently large neural networks, regardless of architecture or training data, are converging toward the same internal representation of reality.

If true, this has massive implications for anyone building computer vision systems. It means the age of training specialized models from scratch is ending. Not because it doesn't work, but because it's unnecessary.

What They Tested

The research compared representations across different vision architectures—transformers, convolutional networks, self-supervised models—at different scales. The question: as these models grow larger, do their internal representations become more similar, even when they're trained on completely different types of visual data?

The answer, measured across multiple datasets and model families: yes. Consistently. The larger the model, the more aligned the representations, regardless of whether you're using supervised learning, self-supervised approaches, or entirely different training paradigms.

Statistical significance was clear. When comparing crossmodal representations, 28 out of 33 model pairs showed increased alignment with scale. For intramodal comparisons within model families, 14 out of 18 pairs converged as capacity increased. These aren't marginal effects.

Why This Matters For Real Systems

We've been building on pre-trained vision models since day one. Not because we lacked the resources to train from scratch—though that's expensive—but because it empirically worked better. Models pre-trained on millions of natural images transferred to our degraded video domain more effectively than anything we could train exclusively on surveillance footage.

This research explains why.

Neural networks aren't learning arbitrary representations specific to their training distribution. They're learning fundamental visual properties that generalize because they reflect actual regularities in how light, objects, and space work. Edges exist in both clean photos and foggy surveillance feeds. Objects compose into scenes whether you're looking at ImageNet or a smoke-filled warehouse. Motion follows the same physics in everyday videos and low-light drone footage.

As models scale, they capture these fundamentals more completely. And because the fundamentals are the same across domains, the representations converge. You're not getting "ImageNet features" or "surveillance features"—you're getting visual features, period.

The Practical Engineering Insight

This has a straightforward implication: if you're building a computer vision system for a specialized domain, start with the largest pre-trained model you can deploy, not the most domain-specific architecture.

A ViT-Large pre-trained on natural images will likely outperform a custom architecture trained from scratch on your specific use case, even if that custom architecture was carefully designed for your exact degradation conditions. The pre-trained model has seen more visual phenomena, learned richer feature hierarchies, and built more robust representations of fundamental visual concepts.

Your domain expertise matters—you still need to understand the specific physics of your problem. But you apply that expertise as adaptation layers on top of general-purpose foundations, not by reinventing visual feature learning.

For us, that means:

SPECTER's low-light enhancement builds on pre-trained spatial understanding

GHOST's fog penetration leverages learned object recognition

SPIRIT's atmospheric compensation uses pre-trained motion and distortion models

We're not teaching networks what edges or textures or objects are. We're teaching them how to extract those already-learned concepts from extremely degraded signals.

Why Convergence Changes The Calculus

Five years ago, the decision to train from scratch versus fine-tune pre-trained models involved real tradeoffs. Pre-trained models gave you a head start, but they came with the baggage of their original training distribution. If your domain was different enough, that baggage might hurt more than help.

The convergence finding changes this calculus. If models are learning fundamental representations rather than distribution-specific features, then "baggage" is actually "robust prior knowledge about how vision works." The bigger the model, the more complete that knowledge, and the less it matters that the training distribution was different from your deployment domain.

This is testable, and the research tested it. Models pre-trained on general images showed strong representational alignment with domain-specific models—and the alignment improved with scale. The astronomy-specific model in the study performed comparably to general vision models on the same data, suggesting both had converged toward similar underlying representations.

For specialized computer vision applications, this means the right strategy is usually: find the largest general-purpose foundation model that fits your latency and compute constraints, then fine-tune aggressively on your domain. Don't train from scratch unless you have truly exceptional requirements or massive specialized datasets.

The Architecture Question

Does model architecture matter if everything's converging? Yes, but differently than you might think.

The research compared transformers, convolutional networks, and hybrid approaches. All showed convergence with scale, but they converged from different starting points and at different rates. Architecture determines how efficiently you reach good representations and what biases you have along the way.

Convolutional networks have inductive biases toward local spatial patterns—useful when processing video where adjacent pixels are strongly correlated. Transformers have no spatial priors but can model long-range dependencies through attention—useful when context from across the entire frame matters for understanding degraded regions.

We use both, depending on the use case. For real-time enhancement where latency is critical, convolutional architectures often win because they're faster for the same quality level. For post-mission analysis where we can spend more compute, transformer-based models often produce better results because they capture global context more effectively.

But in both cases, we start from pre-trained weights. The architecture choice determines deployment characteristics and fine-tuning efficiency, not whether we need domain-specific pre-training.

What This Means For Video Understanding

The convergence principle has particularly strong implications for video analysis, where the challenge isn't just recognizing objects but tracking them through degraded conditions.

Object detectors trained on clean images work surprisingly well on enhanced low-light footage, even though they never saw anything resembling our inputs during training. Face recognition models transfer to IR imagery despite being trained on visible spectrum photos. Pose estimation works through smoke when the enhancement recovers enough structure.

This shouldn't work if models were learning dataset-specific features. It works because the pre-trained models learned fundamental representations of "person," "vehicle," "motion"—concepts that transcend the specifics of image quality or capture conditions.

Our enhancement pipeline's job is to recover enough signal that these pre-trained representations can activate reliably. We're not teaching detectors to recognize objects in fog. We're removing enough fog that their existing concept of "object" applies.

The Limits of Convergence

The research shows convergence happening, but it doesn't mean all models are identical or that domain knowledge is irrelevant.

Convergence is asymptotic—models are moving toward shared representations but haven't reached perfect alignment. Smaller models show less convergence than larger ones, and even the largest models tested aren't completely aligned across all conditions.

Domain adaptation still matters. Pre-trained models give you the foundation, but you need to teach them the specifics of your input distribution. For us, that's the physics of how light propagates through obscurants, how sensors behave at extreme low-light, how atmospheric turbulence distorts images.

And some domains genuinely are different enough that specialized architectures help. If your inputs have radically different structure than natural images—think radio astronomy, seismic data, or medical imaging with exotic modalities—the inductive biases from pre-trained vision models may not transfer as cleanly.

But for video, even heavily degraded video, the convergence finding holds. Surveillance footage through fog, drone feeds through smoke, low-light recordings from security cameras—these are still images of the same 3D world with the same physics as clean photos. The degradation obscures information but doesn't change the fundamental nature of what's being observed.

Building For Post-Mission Analysis

The convergence principle aligns perfectly with our core mission: making degraded footage useful for post-incident investigation.

When something goes wrong—a security breach, an accident, an unauthorized access—investigators inherit whatever cameras were recording. The footage exists, but it's often too degraded to be useful. Too dark, too foggy, too motion-blurred, too compressed.

Traditional approaches fail because they assume the footage was reasonable to begin with. Enhance the contrast, sharpen the edges, interpolate missing frames. These techniques hit physics limits fast when the signal-to-noise ratio is terrible.

Our approach works because we're leveraging models that have learned rich representations of visual reality. We're not just applying filters—we're using networks that understand what scenes, objects, and motion look like in general, then inferring the most likely underlying reality given the degraded observations.

This only works if the pre-trained representations are robust and fundamental. If they were dataset-specific, they'd fail on our inputs. The fact that they transfer so effectively validates the convergence hypothesis in the most practical way possible.

The Compute Investment Question

Here's the uncomfortable truth about training large models from scratch: it's expensive enough that most organizations can't afford to do it properly.

The models tested in this research range from tens of millions to over a billion parameters. The largest ones were pre-trained on millions or billions of images using GPU-centuries of compute. Replicating that training for a specialized domain costs millions of dollars and produces huge carbon footprints.

But if models are converging toward the same representations, you don't need to pay that cost. The ML community has already made the investment. The resulting models are open source and freely available. You can download ViT-Large, DINOv2, or ConvNeXt and start fine-tuning immediately.

This isn't freeloading—it's capitalizing on a genuine scientific finding. These models aren't learning "natural image features," they're learning visual features, and visual features generalize. Using them for specialized applications isn't a compromise; it's the optimal strategy given what we now know about representational convergence.

For Absentia, this means we can offer sophisticated video enhancement without requiring customers to fund massive pre-training runs. We invest our compute budget in fine-tuning and deployment optimization, not rediscovering basic visual feature hierarchies.

Where This Goes Next

The convergence finding opens several research directions we're exploring.

Multi-modal fusion: If vision models are converging and language models are converging, can we fuse them more effectively? We're experimenting with natural language queries over enhanced footage—"show me everyone who entered through the north entrance after midnight"—powered by aligned vision-language representations.

Transfer across degradation types: If a model fine-tuned for low-light enhancement has learned robust visual representations, does it transfer to fog removal with minimal additional training? Early results suggest yes, though the degree of transfer depends on what specific physics you've encoded in your adaptation layers.

Efficient scaling: If convergence improves with model size, what's the smallest model that gives acceptable performance for specific deployment scenarios? We can't run billion-parameter models at 30fps on edge devices, but we might not need to if smaller models have converged far enough.

Real-time vs. post-mission tradeoffs: How much convergence do you sacrifice by using faster architectures for real-time processing? Can we quantify the quality degradation and make informed latency-quality tradeoffs based on mission requirements?

The Takeaway

The convergence finding isn't just academic. It's a practical guide for how to build computer vision systems in 2025.

Stop training from scratch unless you have exceptional reasons. Start with the largest pre-trained model that fits your deployment constraints. Invest your engineering effort in domain adaptation, not feature learning. Measure success by downstream task performance, not by how specialized your pre-training data was.

For video enhancement specifically, this means leveraging pre-trained vision models that have learned fundamental visual representations, then adapting them to extract those representations from degraded signals. The enhancement is a translation problem—converting degraded inputs into a form where pre-trained detectors, trackers, and classifiers can function reliably.

We've been building this way since the beginning, and it's gratifying to see research validating the approach. Not because we need validation, but because it clarifies why this works and suggests how to push it further.

The future of specialized computer vision isn't thousands of domain-specific models trained from scratch. It's a smaller number of massive foundation models, thoroughly pre-trained on diverse visual data, adapted intelligently to specific deployment requirements. The convergence finding tells us this isn't a cost-saving compromise—it's genuinely the right technical approach.

Absentia Tech builds AI systems for video enhancement and analysis, processing footage from security cameras, drones, and autonomous vehicles—both in real-time for live monitoring and post-mission for incident investigation. Everything runs on NVIDIA hardware.