MURD-ViT (Urban Retrofitting Detection with Vision Transformer) is a multimodal deep learning pipeline designed to detect urban retrofitting interventions—micro-scale upgrades to existing urban spaces. It is a ViT-based model that utilizes temporal Google Street View (GSV) imagery and demographic data (population density changes) to classify urban changes.

Key Features
  • Multimodal Fusion: Combines temporal image pairs with demographic features like population density and percentage change.
  • ViT-based Backbone: Leverages Vision Transformer architectures to capture global spatial dependencies in street view images.
  • Spatial Stratified Sampling: Uses K-Means clustering to ensure geographic diversity and balanced class distribution.
  • Robust Evaluation: Employs Top-2 Accuracy to address the rarity and class imbalance of urban retrofitting events.
  • Geospatial Visualization: Includes interactive Folium maps to visualize sample distributions across study areas.
Source Code