The paper introduces LeWorldModel (LeWM), a stable Joint-Embedding Predictive Architecture (JEPA) that trains end-to-end directly from raw pixels. Unlike existing methods that rely on complex losses, pre-trained encoders, or auxiliary supervision to prevent representation collapse, LeWM uses only two loss terms: next-embedding prediction and Gaussian latent regularization. This approach significantly simplifies the training process by reducing tunable hyperparameters. The model is highly efficient, with approximately 15 million parameters capable of being trained on a single GPU within hours, and it offers planning speeds up to 48x faster than foundation-model-based world models while remaining competitive in 2D and 3D control tasks. Additionally, the latent space effectively encodes physical structures, allowing the model to detect physically implausible events through surprise evaluation.