Self-Supervised Representation Learning on Remote Sensing Pixel Time Series with Patch-Based Masking

Share this post

January 2025, Amini

Method Overview

Figure 1: PatchTST architecture
Pixel-Time-Series Setup

Let {x1(m).…xL(m)} represent a univariate time series for spectral channel m (e.g., the Red band in Sentinel-2). For biweekly composites, L≈26 data points per year, but L can grow to hundreds of time steps when spanning multiple years. Our goal is to learn an embedding function Φ mapping each pixel-level sequence to a latent representation that captures meaningful spatial-temporal semantics.

Patch-Based Segmentation & Channel Independence

To reduce computational overhead while incorporating local temporal context, each univariate series is divided into patches of length P, typically with stride S≤P. For instance, P=16 and S=8 might produce about L/S patch tokens in a length-L sequence. Next, instead of early channel fusion, each channel (Red, NIR, NDWI, etc) is processed independently via the same Transformer encoder. This approach avoids mixing different noise profiles across channels, promotes parameter efficiency and helps learn universal time-series representations more effectively.

Masked Autoencoding Objective

We randomly mask ~ 40% of the patch tokens in each channel by replacing them with mask tokens. A 600K parameter transformer encoder then processes both unmasked and masked patches, producing patch-level embeddings of shape (X, n_patches, embedding_size). This patch-wise representation enables the model to capture local sub-seasonal patterns within each patch as well as broader multi-season context across patches. Finally, a lightweight projection head reconstructs only the masked patches under a mean-squared error (MSE) loss, ensuring the model integrates global temporal information rather than just relying on short-range interpolation.

Experiments

Dataset & Preprocessing

We tested this approach on Sentinel-2 biweekly composites over 4 years from Uasin Gishu County, Kenya, a region known for its diverse land-cover types and environmental conditions. The preprocessing steps included:

  • Composite Formation: For each biweekly interval, multiple Sentinel-2 acquisitions are merged. Cloud masks are applied, median reflectance computed, and data converted to a pixel time series format for temporal processing.
  • Spectral Channels: A set of 10 spectral bands/indices [‘Green’, ‘Blue’, ‘Red’, ‘NIR’, ‘SWIR1’, ‘SWIR2’, ‘NDMI’, ‘NDWI’, ‘CI’, ‘NDVI’] is used to form a comprehensive multi-spectral pixel time series used for pretraining.
  • Data Splits: We use spatial splits for training, validation, and test sets, minimizing overlap to ensure that the model generalizes to unseen regions.
  • Land-Cover Labels (Weak Labels): To evaluate the effectiveness of the learnt embeddings, we leverage ESRI LULC data as “weak labels” in a downstream classification task. Although these labels are not a perfect ground truth, their broad coverage and easy availability make them practical for experimentation. By accepting some degree of label noise, we can still gauge how well the embeddings separate high-level land-cover categories (e.g., water vs. vegetation vs. built-up areas).

Results & Discussion

Pretraining Dynamics & Reconstruction Performance

The ~600k-parameter model converges smoothly under the masked-autoencoder objective, demonstrating a strong ability to reconstruct missing patches. Over the course of training, the training loss steadily declines (from ~0.1715 to ~0.0232), while validation loss stabilizes around 0.0864 –0.0211 signifying limited overfitting.

Figure 2: Train & validation losses

Land Cover Classification & Confusion Matrix

To evaluate the effectiveness of the learned representations, we leverage the pretrained transformer’s frozen weights to extract patch-level embeddings. These embeddings are averaged across the patch dimension to obtain a single global representation for each pixel’s temporal sequence. The resulting fixed-dimensional vectors are then fed into a linear classifier trained to distinguish five land cover classes derived from ESRI LULC labels: water, trees, crops, built area, and rangeland.

Figure 3: LULC Confusion Matrix

Despite the class imbalance in our test set and label noise that might be inherent in the ESRI dataset, diagonal dominance from the confusion matrix indicates the model’s learnt representations effectively distinguish these major land-cover types.

Anomaly Detection in Maize Fields

To further evaluate the quality of the learned representations, we conducted an anomaly detection experiment focusing on maize field pixels from this dataset. We first extracted 128-dimension embeddings for each pixel by aggregating the patch-based representations from the pre-trained transformer. These embeddings were then used to train an Isolation Forest algorithm, which identifies anomalous patterns labelled as “-1” and normal temporal behaviours labelled as “1”. The figure below shows a 2D PCA visualization of the embeddings:

Figure 4: PCA Visualization of anomaly detection on maize fields

Conclusion & Future Directions

Overall, we’ve shown that patch-based masking stands as a flexible and powerful method for self-supervised representation learning on remote-sensing pixel time series, balancing both computational efficiency and local scale dynamics that are critical for real -world earth observation challenges. While our present experiment focuses on a single county, we are actively developing a foundation scale training pipeline targeting a larger geographic scope and additional sensor modalities. We will share our findings. In the meantime, we invite you to explore our pretrained model , trained on Uasin Gishu County on Hugging Face and share any feedback you may have through our Community Channel.

Share this post

Sign up for our newsletter

Join our community today and stay informed about the latest developments in enterprise AI infrastructure.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.