A fascinating experiment with infants involves presenting them with a glass of water placed on a desk, then hiding it behind a wooden board. When the board moves past the glass as though it’s invisible, do the infants show surprise? Most 6-month-olds react with surprise, and by age 1, nearly all children grasp the concept of object permanence through their observations. Remarkably, some artificial intelligence models have developed similar understanding.
Researchers have created an AI system that learns about its environment through videos and demonstrates a sense of “surprise” when faced with contradicting information. This system, known as Video Joint Embedding Predictive Architecture (V-JEPA), developed by Meta, learns about the world without making assumptions about the physical laws depicted in the videos yet still begins to comprehend how the world functions.
Advancements in AI Understanding
Building AI that can effectively analyze visual data, like self-driving cars, poses significant challenges. Many AI systems focus on classifying video content or recognizing object shapes by working in what’s termed “pixel space,” where every pixel is treated uniformly in importance. However, these pixel-based models face limitations; for instance, in a scene with various elements, they might fixate on inconsequential details, potentially overlooking critical information like the color of a traffic light.
The V-JEPA architecture, launched in 2024, addresses these challenges. While the intricate makeup of its neural networks is complex, the fundamental concept is straightforward. Unlike traditional models that predict masked pixels individually, V-JEPA utilizes higher-level abstractions called “latent representations” to model video content, focusing on essential details rather than minute pixel information.
This model processes video frames by masking certain pixels to generate latent representations through its first encoder before sending unmasked frames to a second encoder. The predictor then uses these latent representations to forecast the output of the second encoder, enabling the model to identify meaningful elements in the scene while disregarding unnecessary details.
Replicating Intuitive Understanding
In February, the V-JEPA team reported their success in grasping intuitive physical properties of the real world, achieving nearly 98% accuracy on a test designed to evaluate an AI’s understanding of physical plausibility. In contrast, a previously established pixel-based model performed barely better than random guessing. The team also quantified the “surprise” displayed by V-JEPA when predictions deviated from actual observations, akin to the intuition seen in infants when encountering unexpected events.
Looking ahead, the V-JEPA team released a more advanced model, V-JEPA 2, in June, which boasts 1.2 billion parameters and was pretrained on a vast dataset comprising 22 million videos. This version seeks to improve robotics capabilities through efficient fine-tuning with limited data for planning robot actions. However, V-JEPA 2’s input limitations to a few seconds of video suggest room for improvement. As noted by a team member, its memory capabilities could liken it to that of a goldfish, indicating challenges that lie ahead for further development.