Intuitive Surgical

Seeing Through Smoke:
Surgical Desmoking for Improved Visual Perception

Jingpei Lu, Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, and Omid Mohareri

Intuitive Surgical, Sunnyvale, CA 94086, USA

80KSynthetic training pairs
5,817Real stereo pairs
52 FPSReal-time inference
SOTASSIM & PSNR

Desmoking Results

Our transformer-based model removes surgical smoke in real time, recovering scene content behind dense smoke with minimal color distortion.

Liver — Ours (Synthetic Only)

Liver — Ours (Fine-tuned)

Heart — Ours (Synthetic Only)

Heart — Ours (Fine-tuned)

Research Overview

Abstract

Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities.

We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free images and corresponding smoke maps. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding approximately 80,000 paired samples for supervised training.

We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments demonstrate state-of-the-art performance in image reconstruction. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.

01

Transformer + Physics Head

ViT-Base backbone with DPT decoder and a physics-inspired desmoking head that jointly regresses K and B parameters to predict smoke-free images and smoke maps.

02

80K Synthetic Pairs

Large-scale synthetic data pipeline blending artificial smoke into real endoscopic images with domain randomization over smoke color, transparency, and blending coefficients.

03

Largest Paired Dataset

5,817 stereo image pairs (1024×1280) captured with the da Vinci system using bidirectional smoke capture, with depth maps and segmentation labels.

Approach

Methodology

Model overview: ViT backbone, DPT decoder, and desmoking head
Figure 1. Overview of the proposed surgical desmoking model.

Given an input smoky image, we use a ViT-Base backbone initialized with Masked Autoencoder pretrained weights on surgical images to extract multi-scale features from intermediate blocks {3, 6, 9, 12}.

Features are fused and decoded using DPT blocks, then a lightweight physics-inspired desmoking head jointly predicts the smoke-free image J and the smoke map S.

Drawing from the atmospheric scattering model (I = Jt + A(1−t)), the head regresses a spatially-varying per-pixel 4D tensor O = [K, B] ∈ RH×W×4, where K = 1/t − 1 and B = A(1/t − 1), enabling recovery via J = KI − B + I. The smoke map is derived from the airlight component as S = f(B / (K+1)), where f is approximated by a convolutional layer. The smoke map is treated as an auxiliary output evaluated qualitatively only, as ground-truth smoke maps are unavailable for real surgical scenes.

Benchmark

Smoke Dataset

We collected data using an Intuitive Surgical da Vinci robotic surgical system. Porcine tissue specimens (liver, intestine, stomach, kidney) were placed in a sealed chamber. A bidirectional capture strategy — progressive occlusion followed by progressive clearance — yields paired observations of identical static scenes across a continuous spectrum of smoke densities. The resulting dataset is, to our knowledge, the largest publicly available endoscopic smoke dataset to date.

DatasetSmoke TypeSizeImage Type (Resolution)Available
LSD3KSynthetic3,000 pairsMono (480×480)
LSVDReal786 clipsMono (1080×1920)
De-SmokingReal3,464 pairsMono (inconsistent)
OursNEWReal5,817 pairsStereo (1024×1280)

Additional annotations: Depth maps derived from FoundationStereo on smoke-free images, and instrument segmentation labels generated by Segment Anything 3 (SAM3). The benchmark is split into a fine-tuning set (3,097 pairs) and test set (2,720 pairs) with disjoint anatomical structures.

Results

Qualitative Comparisons

Desmoking on Our Test Set

Qualitative comparisons across representative scenes (liver and stomach). Our method recovers scene content behind dense smoke with minimal color distortion and fewer artifacts compared to existing dehazing and desmoking approaches.

Qualitative comparison of desmoking methods on liver and stomach scenes
Figure 3. Qualitative comparisons on our test set across representative scenes (top: liver; bottom: stomach)

More Results

Liver scene desmoking result
Liver Scene — Desmoking Result
Stomach scene desmoking result
Stomach Scene — Desmoking Result
Predicted smoke maps
Predicted Smoke Maps
Comprehensive Results

Quantitative Comparison

Results on the De-Smoking Dataset

MethodCholecystectomyProstatectomy
SSIM ↑PSNR ↑SSIM ↑PSNR ↑
He et al.0.73 ± 0.0918.44 ± 3.030.72 ± 0.1219.39 ± 3.38
Song et al.0.83 ± 0.1023.25 ± 4.510.73 ± 0.1519.22 ± 5.93
Qin et al.0.84 ± 0.1024.49 ± 5.880.72 ± 0.1519.13 ± 6.68
Salazar et al.0.75 ± 0.1220.06 ± 4.000.72 ± 0.1319.94 ± 4.10
Liu et al.0.81 ± 0.0922.21 ± 3.790.73 ± 0.1320.08 ± 5.70
Sidorov et al.0.76 ± 0.0920.69 ± 3.890.66 ± 0.1120.09 ± 2.82
Jin et al.0.74 ± 0.1321.53 ± 3.890.62 ± 0.1418.92 ± 3.79
Wu et al.0.83 ± 0.1123.88 ± 6.890.67 ± 0.1517.99 ± 6.13
OursBEST0.86 ± 0.0925.51 ± 4.690.75 ± 0.1321.71 ± 4.41

Results on Our Benchmark Dataset

Reconstruction quality (SSIM/PSNR) and downstream performance after desmoking: stereo depth estimation (MAE) and instrument segmentation (IoU).

MethodSSIM ↑PSNR ↑MAE (mm) ↓% IoU ↑
None47.22 ± 61.4881.15 ± 17.01
He et al.0.53 ± 0.0712.83 ± 1.5351.53 ± 66.7180.81 ± 17.03
Song et al.0.61 ± 0.0914.45 ± 1.8349.00 ± 66.4580.90 ± 16.27
Qin et al.0.72 ± 0.0919.57 ± 3.2049.03 ± 63.8481.15 ± 17.15
Salazar et al.0.54 ± 0.1214.21 ± 2.0553.73 ± 66.5079.46 ± 16.85
Liu et al.0.65 ± 0.1116.43 ± 2.6348.02 ± 63.6281.31 ± 16.18
Sidorov et al.0.59 ± 0.0816.17 ± 1.8659.44 ± 74.0277.78 ± 16.70
Jin et al.0.50 ± 0.1517.41 ± 2.7556.48 ± 69.1374.98 ± 17.36
Wu et al.0.70 ± 0.0919.73 ± 4.2653.03 ± 65.2079.30 ± 17.13
Ours0.73 ± 0.0920.61 ± 3.1249.42 ± 64.4881.48 ± 16.85
Ours-finetuneBEST0.75 ± 0.0923.57 ± 3.3547.64 ± 62.0381.81 ± 16.08

Note: Model size: 111.36 MB, runs at 52 FPS on NVIDIA RTX 6000 at 512×640 resolution. Ours-finetune uses an additional 50 epochs of fine-tuning on real smoke data and is excluded from direct comparison as competing methods were not fine-tuned on real smoke images. The De-Smoking dataset comprises 1,063 cholecystectomy and 2,401 prostatectomy paired images; values may differ slightly from the original publication due to implementation details.

Downstream Impact

Impact on Computer Vision Tasks

Downstream stereo depth estimation and instrument segmentation after desmoking
Figure 4. Qualitative results on downstream stereo depth estimation (top) and instrument segmentation (bottom) after different desmoking methods.

Stereo Depth Estimation

MAE: 47.22 mm (no desmoking baseline)

Interestingly, the no-desmoking baseline performs best for depth estimation. Foundation models like FoundationStereo appear relatively robust to smoke due to large-scale training. Desmoking can alter image statistics and potentially disrupt left–right feature consistency critical for stereo matching.

Instrument Segmentation

IoU: 81.81% (+0.66% vs. no desmoking)

Our method improves IoU on smoky scenes, achieving 81.48% (synthetic only) and 81.81% (fine-tuned), attributed to enhanced local contrast and clearer instrument boundaries after desmoking. This highlights desmoking's potential to benefit downstream applications by improving visual perception in the presence of smoke.

Summary

Conclusion

We presented a transformer-based surgical desmoking framework that combines a ViT backbone with a physics-inspired head to jointly predict smoke-free images and corresponding smoke maps. To mitigate the scarcity of paired training data, we developed a large-scale synthetic data generation pipeline and curated a high-resolution paired dataset for benchmarking, enabling supervised training and evaluation at unprecedented scale.

Our method achieves state-of-the-art reconstruction performance across datasets, improving anatomical visibility under dense smoke while preserving visual fidelity. Analysis of downstream tasks shows that desmoking can improve instrument segmentation, but does not consistently benefit stereo depth estimation, underscoring the nuanced interaction between image reconstruction and geometry-based algorithms.

Overall, these results position digital desmoking as a valuable component of surgical vision pipelines. Future work will focus on real-time deployment and tighter integration with computer-assisted interventions.

Publication

Citation

BibTeX

@article{lu2026seeing,
  title={Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception},
  author={Lu, Jingpei and Jiang, Fengyi and Zhang, Xiaorui and Jin, Lingbo and Mohareri, Omid},
  journal={arXiv preprint arXiv:2603.25867},
  year={2026}
}

About This Work

This research was conducted at Intuitive Surgical, Sunnyvale, CA.

Contact: jingpei.lu@intusurg.com