LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

AuthorDateAffiliationEvent
Randall BalestrieroNovember 12th, 2025Brown UniversityMeta-FAIR
Yann LeCunNew York University (NYU)

Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in LeJEPA1, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–Sketched Isotropic Gaussian Regularization (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits:

  1. single trade-off hyperparameter,
  2. linear time and memory complexity,
  3. stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, 4. heuristics-free, e.g., no stop-gradient, no teacher–student, no hyper-parameter schedulers
  4. distributed training-friendly implementation requiring only ≈50 lines of code.

Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (GitHub repo).

1https://arxiv.org/pdf/2511.08544
1

Introduction

Learning manipulable representations of the world and its dynamics is a long-standing question in AI, with roots dating back centuries ago [Von Helmholtz, 1867, Tolman, 1948, Gregory, 1980, Sutton, 1991, Friston, 2010]. Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is how to learn an organized and actionable high-dimensional embedding space from observations? Using Deep Networks–parameterized nonlinear operators 𝑓𝜽–to map observations to embeddings is a standard first piece of that puzzle [LeCun et al., 2015, Goodfellow et al., 2016].

The second, less standardized, piece of that puzzle is how to train 𝑓𝜽. Joint-Embedding Predictive Architectures (JEPAs) suggest training 𝑓𝜽 by maximizing predictive agreement between the embeddings of semantically related views [Bromley et al., 1993, LeCun, 2022, Balestriero et al., 2023]. Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc [Tian et al., 2020]. In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align 𝑓𝜽’s embeddings towards the underlying knowledge present in the data.

Alas, JEPA’s prediction task admits failure modes, such as representation collapse, where 𝑓𝜽 maps all inputs to nearly identical embeddings (complete collapse) or to a lowdimensional subspace (dimensional collapse) [Jing et al., 2021][Jing et al., 2021, Cosentino et al., 2022, Balestriero and LeCun, 2022]. To mitigate such shortcut solutions, state-of-the-art recipes rely on heuristics–stop-gradient [Chen et al., 2020a], asymmetric view generation [Wang et al., 2022], teacher–student networks with carefully tuned EMA schedules [Caron et al., 2021, Tian et al., 2021], explicit normalization and whitening layers [Ermolov et al., 2021, Chen et al., 2021]–and a delicate balance of hyperparameters. As a result, today’s JEPA training is brittle and most research has shifted toward scaling data [Vo et al., 2024], models [Fan et al., 2025] and even post-training Rodas et al. [2025] while leaving the theoretical foundations of JEPAs largely unexplored.

Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking what are the necessary conditions that JEPAs should abide by? Those minimal conditions will then act as axioms for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings (Section 3). While (i) follows standard practice [Balestriero and LeCun, 2022], we introduce in Section 4 a novel distribution matching objective–Sketched Isotropic Gaussian Regularization (SIGReg)–to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its memory and computational complexity is linear in dimension and sample size. Crucially, SIGReg’s isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model’s expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution–coined Latent-Euclidean JEPA (LeJEPA)–is introduced in Section 5. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see Section 6). We summarize our contributions below.

  • Contribution 1: We prove the optimal embedding distribution for foundation models. We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In Section 3, we derive this result rigorously for both linear (Section 3.1) and nonlinear probes (Section 3.2), providing the first principled answer to what distribution 𝑓𝜽’s embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization.
  • Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale. We present Sketched Isotropic Gaussian Regularization (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching (Section 4 and Figure 2). SIGReg provides statistical guarantees (Sections 4.1 and 4.2) while achieving linear complexity and bounded gradients—a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality (Section 4.3), making it both theoretically sound and practically efficient for high-dimensional embeddings.
  • Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction. By combining JEPA’s predictive objective with SIGReg targeting the isotropic Gaussian, we introduce LeJEPA—LatentEuclidean JEPA (Section 5). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled theory directly yields practical simplicity.
  • Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable. Our experiments (Section 6) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining—previously considered impractical for small datasets.
2

Background and Notations

We start by introducing some of the notations we will be using throughout our manuscript (Section 2.1), followed by a review of JEPAs (Section 2.2), and existing literature studying their design (Section 2.3).

2.1

Notations and Definitions

Data. We are in possession of a dataset of shape (𝑁, 𝑉, 𝐷) ∈ N∗3 where 𝑁 is the number of samples, 𝑉 is the number of views, and 𝐷 is the dimension. One entry of this dataset is accessed via 𝒙𝑛,𝑣,𝑑. Those dimensions are often interpreted as follows: (N) is the number of independent samples, e.g., different images or different videos, (V) is the number of views, e.g., data-augmentations for images, frames for videos, and (D) is the dimension of each 𝒙𝑛,𝑣, e.g., number of RGB pixels for images. In many cases the ordering over 𝑉 is given by time–but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one’s dataset into a (𝑁, 𝑉, 𝐷) tensor–and none of our theory and implementation assumes a particular design decision for that tensor. However, we will rely on the following two properties, (independence) the samples 𝒙𝑛 , 𝒙𝑛′ have been obtained independently from each other ∀𝑛 ≠ 𝑛′, and (identically distributed) the sampling process was identical among 𝒙𝑛 , ∀�

Deep Networks. Today’s AI solutions rely on Deep (Neural) Networks (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN’s mapping as 𝑓𝜽 : R 𝐷 → R 𝐾 with 𝐾 the dimension of the embedding space. The internals of 𝑓𝜽 are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of 𝑓𝜽 are irrelevant to our study–as we will see the proposed LeJEPA works out-of-the-box on any 𝑓𝜽. In any case, all the learnable parameters are gathered in the vector 𝜽 ∈ R 𝑃, with 𝑃 counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that 𝜽 can be learned from gradient descent to ultimately produce a useful system, or foundation model, 𝑓𝜽.

JEPAs. A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters 𝜽. This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by LeCun [2022] as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous wellestablished techniques such as siamese networks [Bromley et al., 1993] and predictive coding [Helmholtz et al., 1867, Bruner and Postman, 1949]. While the exact blueprint of JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view 𝒙𝑛,𝑣 from the embedding of another view 𝒙𝑛,𝑣′ , 𝑣′ ≠ 𝑣, all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for 𝒙, is provided in Equation (1). The predictability criterion can be done by directly comparing the embeddings of the partial views 𝐸𝑛𝑐(𝒙𝑛,𝑣,.) and 𝐸𝑛𝑐(𝒙𝑛,𝑣′,.) with a metric, e.g., ℓ𝑝. In some cases, an additional DN coined Pred, is employed to compare 𝑃𝑟𝑒 𝑑(𝐸𝑛𝑐(𝒙𝑛,𝑣,.)) against 𝐸𝑛𝑐(𝒙𝑛,𝑣′ ,.)–which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data [Khazatsky et al., 2024].

2.2

The Need for Reliable Pretraining

The JEPA’s prediction task is designed based on a priori knowledge of the data. Its design is often quite natural since it is relatively intuitive to form 𝒙 so that its views share the relevant information content one hope to capture. On the other hand, the design of the “anti-collapse” criterion is much closer to a game of Whac-A-Mole. Today’s designs rely on many different under-specified safeguards which are carefully combined in the hope that degenerate shortcut solutions are avoided during training. Such mechanisms include (i) feature whitening [Ermolov et al., 2021, Bardes et al., 2021], (ii) negative samples [Chen et al., 2020a, He et al., 2020], and (iii) asymmetric views and teacher-student networks with stop-gradient [Caron et al., 2021, Assran et al., 2023]. Those mechanisms all suffer from at least two of the following limitations:

  1. under-specification, i.e., the criteria can be minimized while embeddings are in a degenerate configuration
  2. quadratic time and memory complexity with mini-batch size and/or embedding dimension
  3. sensitivity to data distribution, hyperparameters, architecture, and
  4. lack of theoretical understanding and guarantees.
2.3

The Need for Actionable Theory

For decades, the two major solutions for AI were supervised learning [LeCun et al., 2015] and learning by reconstruction [Rumelhart et al., 1986]–sometimes combined together, e.g., for semi-supervised learning [Kingma et al., 2014]. In supervised learning, the labels both ensure that semantically similar samples are close to each other in embedding space while preventing complete representation collapse. In particular, it is possible to measure the amount of collapse in supervised learning as a function of the number of classes [Papyan et al., 2020]. The reconstruction objective is similarly well suited to prevent representation collapse as the original input must be recovered from the embeddings, i.e., the embeddings must be as informative about the input as possible–up to some optional denoising tasks that users can setup as part of the training [Vincent et al., 2010].

Because supervised and reconstruction-based learning have been widely studied for decades, there exists a large body of work to explain and inform practical designs–as well as studying their limitations in producing foundation models [Balestriero and LeCun, 2024, Van Assel et al., 2025]. This is not the case for the more recent JEPAs where empirical advances quickly outpace anyone hoping to delve into their inner workings. This dynamic led the community to focus on post-hoc theoretical justification of already found solutions [Liu et al., 2021, Shwartz Ziv and LeCun, 2024, Shwartz-Ziv et al., 2022, Zhang et al., 2023]. In most cases, those studies involve the Mutual Information (MI) [Shannon, 1948, Cover, 1999] whose different bounds recover established methods [Gutmann and Hyvärinen, 2010, Ma and Collins, 2018, Oord et al., 2018, Poole et al., 2019, Hjelm et al., 2018, McAllester and Stratos, 2020]. Because existing studies focus on explaining and interpreting already developed JEPAs, too little principled guidance and innovation has been brought forward. Instead, most of the recent empirical advances take the form of collecting larger dataset, scaling up pre-existing training recipes [Goyal et al., 2019, Chen et al., 2020b, Oquab et al., 2023, Fan et al., 2025], and deriving novel data curation processes [Vo et al., 2024, Kerdreux et al., 2025]. In contrast, our goal in the following Sections 3 to 5 will be to derive a novel JEPA solution from first principles, i.e., whose design relies on proved necessary conditions for optimality, and with a pretraining recipe that can finally reconcile exploratory research, scalability, and state-of-theart performances.

Document Tester V2