Similarity Metrics for MR Image-to-Image Translation

The major conclusion of the paper Similarity Metrics for MR Image-to-Image Translation is that relying on the most commonly used metrics, specifically SSIM and PSNR, is insufficient for validating Magnetic Resonance (MR) image-to-image translation models due to their specific blind spots regarding blurring and normalization. Instead, the authors conclude that researchers must employ a carefully selected combination of reference and non-reference metrics to ensure comprehensive quality assessment.

The paper outlines the following specific recommendations for a robust validation framework:

1. Recommended Reference Metrics

To detect a wide range of distortions, the authors recommend a specific suite of four metrics:

SSIM (Structural Similarity Index Measure): Useful for structural coherence but must be normalized correctly using the actual intensity range.
LPIPS (Learned Perceptual Image Patch Similarity): Recommended for capturing perceptual quality, which traditional metrics often miss.
MSE (Mean Squared Error): Recommended primarily to maintain comparability with previous studies.
NMI (Normalized Mutual Information): Highly recommended because it is insensitive to linear intensity scaling, making it robust against intensity shifts that confuse other metrics.

2. The Rejection of PSNR

A significant conclusion is that PSNR (Peak Signal-to-Noise Ratio) is not recommended for evaluating image synthesis.

Inconsistency: PSNR scores are heavily dependent on the specific normalization method used (e.g., Z-score vs MinMax), meaning scores cannot be reliably compared across different studies unless the methodology is identical.
Lack of Correlation: It does not correlate well with human perception and struggles to measure the true degree of distortion in medical contexts.

3. The Necessity of Non-Reference Metrics

Since reference metrics like SSIM can underestimate specific issues (such as blurring), the paper concludes that non-reference (blind) metrics are essential additions:

BLUR: Essential because SSIM tends to favour blurred images; this metric reliably detects blurriness that other metrics miss.
MSN (Mean Structural Noise): Recommended for detecting MRI-specific artefacts, such as ghosting or stripes, which reduce anatomical consistency.

4. Critical Role of Normalization

The authors conclude that metric scores are often meaningless without defined normalization standards. They recommend:

Z-score normalization for SSIM, LPIPS, and MSE.
No normalization for NMI.
Detailed reporting of normalization parameters (such as the data range $L$ for SSIM) is mandatory for reproducibility.

5. Downstream Task Evaluation

Finally, the paper concludes that evaluating images via a downstream task (e.g., tumour segmentation using the DICE score) is valuable because synthesis models can hallucinate or remove clinically relevant structures. However, the performance of the segmentation model itself must be verified first to ensure the evaluation is valid.

Analogy: The paper concludes that validating medical AI with only SSIM and PSNR is like judging a chef solely by the portion size (MSE/PSNR) and the arrangement of food on the plate (SSIM). While these matter, you also need to taste the food (LPIPS/NMI) and check for foreign objects like hair or eggshells (MSN/Non-reference metrics) to truly know if the meal is safe and high-quality.

Which specific metrics best capture MRI-specific distortions compared to general image similarity measures?

To effectively capture MRI-specific distortions, non-reference quality metrics, specifically Mean Structural Noise (MSN) and Mean Nyquist Ghosting (MNG), are superior to general similarity measures, which often fail to detect specific artifacts like ghosting or stripes. While general metrics such as SSIM and PSNR are widely used, they exhibit significant shortcomings in the medical domain, particularly regarding sensitivity to intensity shifts and blurring.

MRI-Specific Metrics

These metrics are designed to detect artifacts inherent to MR acquisition and reconstruction without needing a reference image (blind metrics).

Mean Structural Noise (MSN): This metric measures the mean correlation between neighbouring lines of pixels in an image. It is highly effective at detecting stripe artifacts, ghosting, and Gaussian noise. A reduction in MSN scores indicates a loss of anatomical consistency caused by these distortions.
Mean Nyquist Ghosting (MNG): This metric calculates the correlation between image lines separated by half the image distance. It is specifically sensitive to stripe artifacts and ghosting artifacts, which are common in MR imaging due to frequency sampling errors. Experiments show that MNG scores drop significantly when stripe artifacts are present.
BLUR: While not exclusive to MRI, the BLUR metric is essential because general metrics like SSIM frequently underestimate blurriness. BLUR reliably detects blurry images by comparing the original image to a re-blurred version.

Limitations of General Image Similarity Measures

General metrics often struggle with the specific nature of medical image synthesis and MR data ranges.

Structural Similarity Index Measure (SSIM): although used in 84% of medical image-to-image translation studies, SSIM ignores blurring. It tends to favour blurred images over other distortions and is highly sensitive to constant intensity shifts unless the data is normalized.
Peak Signal-to-Noise Ratio (PSNR): This metric is generally not recommended for validating MR image synthesis. It does not correlate well with human perception and is extremely sensitive to the choice of normalization method (e.g., Z-score vs MinMax), making comparisons between studies difficult.
Error Metrics (MSE, MAE, NMSE): Like SSIM, these are sensitive to intensity shifts and translation misalignment rather than specific MR artifacts. However, MSE is recommended as part of a validation set for comparability with previous studies.

Recommendations for Comprehensive Validation

To capture both general image quality and MRI-specific distortions, a combination of metrics is required.

Reference Metrics: A combination of SSIM, LPIPS (Learned Perceptual Image Patch Similarity), MSE, and NMI (Normalized Mutual Information) is recommended to detect a broad set of distortions. NMI is particularly useful as it is insensitive to linear intensity scaling.
Normalization: The ability of general metrics to capture distortions depends heavily on data normalization. Z-score normalization is recommended for SSIM, LPIPS, and MSE, while NMI should be used without normalization.
Downstream Tasks: Using a segmentation metric (like the DICE score) on a downstream task can validate that the synthesized images preserve diagnostically relevant structures, such as tumours, even if visual artifacts are present.

Analogy: Relying solely on general metrics (like SSIM) to evaluate an MRI is like using a spell-checker to review a printed document; it will tell you if the words (structure) are correct, but it will completely miss if the printer has run out of ink or left streaks (MRI-specific artifacts) across the page. MRI-specific metrics act as the printer technician, specifically looking for those mechanical flaws.

Created by NotebookLM. Please check thoroughly if you intend to use it. Please volunteer to check the content generated by telling us in the comment that you have checked if something is wrong or something should be changed!

Similarity Metrics for MR Image-to-Image Translation

1. Recommended Reference Metrics

2. The Rejection of PSNR

3. The Necessity of Non-Reference Metrics

4. Critical Role of Normalization

5. Downstream Task Evaluation

Which specific metrics best capture MRI-specific distortions compared to general image similarity measures?

MRI-Specific Metrics

Limitations of General Image Similarity Measures

Recommendations for Comprehensive Validation

Related

Leave a ReplyCancel reply

Similarity Metrics for MR Image-to-Image Translation

1. Recommended Reference Metrics

2. The Rejection of PSNR

3. The Necessity of Non-Reference Metrics

4. Critical Role of Normalization

5. Downstream Task Evaluation

Which specific metrics best capture MRI-specific distortions compared to general image similarity measures?

MRI-Specific Metrics

Limitations of General Image Similarity Measures

Recommendations for Comprehensive Validation

Share this:

Related

Leave a ReplyCancel reply