January 2025, Amini
Let {x1(m).…xL(m)} represent a univariate time series for spectral channel m (e.g., the Red band in Sentinel-2). For biweekly composites, L≈26 data points per year, but L can grow to hundreds of time steps when spanning multiple years. Our goal is to learn an embedding function Φ mapping each pixel-level sequence to a latent representation that captures meaningful spatial-temporal semantics.
To reduce computational overhead while incorporating local temporal context, each univariate series is divided into patches of length P, typically with stride S≤P. For instance, P=16 and S=8 might produce about L/S patch tokens in a length-L sequence. Next, instead of early channel fusion, each channel (Red, NIR, NDWI, etc) is processed independently via the same Transformer encoder. This approach avoids mixing different noise profiles across channels, promotes parameter efficiency and helps learn universal time-series representations more effectively.
We randomly mask ~ 40% of the patch tokens in each channel by replacing them with mask tokens. A 600K parameter transformer encoder then processes both unmasked and masked patches, producing patch-level embeddings of shape (X, n_patches, embedding_size). This patch-wise representation enables the model to capture local sub-seasonal patterns within each patch as well as broader multi-season context across patches. Finally, a lightweight projection head reconstructs only the masked patches under a mean-squared error (MSE) loss, ensuring the model integrates global temporal information rather than just relying on short-range interpolation.
We tested this approach on Sentinel-2 biweekly composites over 4 years from Uasin Gishu County, Kenya, a region known for its diverse land-cover types and environmental conditions. The preprocessing steps included:
The ~600k-parameter model converges smoothly under the masked-autoencoder objective, demonstrating a strong ability to reconstruct missing patches. Over the course of training, the training loss steadily declines (from ~0.1715 to ~0.0232), while validation loss stabilizes around 0.0864 –0.0211 signifying limited overfitting.
To evaluate the effectiveness of the learned representations, we leverage the pretrained transformer’s frozen weights to extract patch-level embeddings. These embeddings are averaged across the patch dimension to obtain a single global representation for each pixel’s temporal sequence. The resulting fixed-dimensional vectors are then fed into a linear classifier trained to distinguish five land cover classes derived from ESRI LULC labels: water, trees, crops, built area, and rangeland.
Despite the class imbalance in our test set and label noise that might be inherent in the ESRI dataset, diagonal dominance from the confusion matrix indicates the model’s learnt representations effectively distinguish these major land-cover types.
To further evaluate the quality of the learned representations, we conducted an anomaly detection experiment focusing on maize field pixels from this dataset. We first extracted 128-dimension embeddings for each pixel by aggregating the patch-based representations from the pre-trained transformer. These embeddings were then used to train an Isolation Forest algorithm, which identifies anomalous patterns labelled as “-1” and normal temporal behaviours labelled as “1”. The figure below shows a 2D PCA visualization of the embeddings:
Overall, we’ve shown that patch-based masking stands as a flexible and powerful method for self-supervised representation learning on remote-sensing pixel time series, balancing both computational efficiency and local scale dynamics that are critical for real -world earth observation challenges. While our present experiment focuses on a single county, we are actively developing a foundation scale training pipeline targeting a larger geographic scope and additional sensor modalities. We will share our findings. In the meantime, we invite you to explore our pretrained model , trained on Uasin Gishu County on Hugging Face and share any feedback you may have through our Community Channel.