Joint Frequency and Image Space Learning for MRI Reconstruction and Analysis

We propose neural network layers that explicitly combine frequency and image feature representations and show that they can be used as a versatile building block for reconstruction from frequency space data. Our work is motivated by the challenges arising in MRI acquisition where the signal is a corrupted Fourier transform of the desired image. The proposed joint learning schemes enable both correction of artifacts native to the frequency space and manipulation of image space representations to reconstruct coherent image structures at every layer of the network. This is in contrast to most current deep learning approaches for image reconstruction that treat frequency and image space features separately and often operate exclusively in one of the two spaces. We demonstrate the advantages of joint convolutional learning for a variety of tasks, including motion correction, denoising, reconstruction from undersampled acquisitions, and combined undersampling and motion correction on simulated and real world multicoil MRI data. The joint models produce consistently high quality output images across all tasks and datasets. When integrated into a state of the art unrolled optimization network with physics-inspired data consistency constraints for undersampled reconstruction, the proposed architectures significantly improve the optimization landscape, which yields an order of magnitude reduction of training time. This result suggests that joint representations are particularly well suited for MRI signals in deep learning networks. Our code and pretrained models are publicly available at https://github.com/nalinimsingh/interlacer.


Introduction
Magnetic resonance imaging (MRI) (Lauterbur, 1973) acquires frequency space data and converts these measurements to images for visualization and downstream analysis.Practical imaging considerations often affect the data acquisition process.For example, motion occurs during acquisition (Andre et al., 2015), noise affects sensor readings (Macovski, 1996), and sub-Nyquist undersampling is routinely used to speed up data acquisition (Lustig et al., 2008).Traditionally, the acquired frequency space signals are converted to image space reconstructions via an inverse Fourier transform, with each individual frequency space measurement contributing to all output pixels in the image space.As a result, local changes in the acquired frequency space data induce global effects on the entire output image.To produce accurate image reconstructions, modeling tools for Fourier imaging must correct these global artifacts in addition to performing fine-scale image space processing.
Recently, neural networks have emerged as an alternative approach for MRI reconstruction (Aggarwal et al., 2018;Hammernik et al., 2018;Hyun et al., 2018;Lee et al., 2017;Putzky and Welling, 2019;Quan et al., 2018;Schlemper et al., 2017;Sun et al., 2016;Yang et al., 2017;Aggarwal et al., 2018;Hammernik et al., 2018;Cheng et al., 2018;Han et al., 2019;Zhu et al., 2018;Duffy et al., 2021;Haskell et al., 2019;Johnson and Drangova, 2019;Küstner et al., 2019;Pawar et al., 2018;Shaw et al., 2020;Oksuz et al., 2019;Usman et al., 2020;Benou et al., 2017;Jiang et al., 2018;Manjón and Coupe, 2018).Most existing architectures are based on purely frequency space representations or purely image space representations.Here, we propose and demonstrate joint frequency-image space representations that enable networks to learn a wide set of tasks including and beyond the extensively studied undersampled reconstruction.To motivate our approach, we examine the correlation structure for frequency and image space representations in Fig. 1.Local neighborhoods around a pixel exhibit strong correlations, suggesting that local convolution operations, which are widely successful on image space computer vision tasks, might also be useful when applied to frequency space data to capture this local structure.Convolutional operations in frequency space promise to enable direct correction of local frequency space artifacts corresponding to global image space effects, while convolutional image space processing facilitates complementary correction of artifacts that are best captured in the image domain.

Prior Work
We study joint representations in the context of three corruption processes that arise during the imaging process.

Motion.
Previous retrospective motion correction strategies (Batchelor et al., 2005;Haskell et al., 2018) are cast as large, non-convex optimization problems with iterative solutions that are slow to compute.Deep learning methods (Duffy et al., 2021;Haskell et al., 2019;Johnson and Drangova, 2019;Küstner et al., 2019;Pawar et al., 2018;Shaw et al., 2020;Usman et al., 2020) solve the motion correction problem with a neural network operating purely in the image space, even though motion artifacts are induced directly in the frequency space during data acquisition.An alternative approach has been demonstrated recently that detects motion directly on frequency space data, followed by motion correction via an image space network (Oksuz et al., 2019).
Undersampling.Classical undersampled reconstruction techniques either construct the output image as a least-squares estimate from the acquired frequency space data (Pruessmann et al., 1999) or combine convolutional filters in the frequency space with an inverse Fourier transform (Griswold et al., 2002;Lustig and Pauly, 2010).Many deep learning methods apply convolutions to image space reconstructions of the acquired undersampled frequency data (Aggarwal et al., 2018;Hammernik et al., 2018;Hyun et al., 2018;Lee et al., 2017;Putzky and Welling, 2019;Quan et al., 2018;Schlemper et al., 2017;Sun et al., 2016;Yang et al., 2017).To improve the quality and fidelity of the reconstruction, the convolutional layers can be combined into an architecture that emulates unrolled optimization, with a convolutional regularizer coupled with a physics-inspired data consistency constraint that is enforced after each iteration (Aggarwal et al., 2018;Hammernik et al., 2018).Alternatively, the convolutional architectures can act directly on the frequency space data (Akçakaya et al., 2019;Cheng et al., 2018;Han et al., 2019).The notably different AUTOMAP architecture uses fully-connected layers to convert frequency space data to the image space and then applies further image space convolutions (Zhu et al., 2018), incurring prohibitive memory complexity of O(N 4 ) for a N × N image.
More recently, solutions that combine frequency and image space convolutions have been demonstrated in the context of undersampled reconstruction.One approach is to combine separately trained pure frequency and pure image space networks into a common architecture (Eo et al., 2018;Souza and Frayne, 2019;Wang et al., 2019).The most closely related work to ours integrates frequency and image space blocks within the same network (Zhou and Zhou, 2020), effectively implementing one of the two variants we consider in this paper.
Here we propose an additional layer architecture that also tightly couples frequency and image space representations and evaluate both variants on a wide variety of tasks, well beyond the undersampled reconstruction scenario for which the previously combined architectures have been proposed.
In our experiments, a basic network that simply concatenates joint layers outperforms its pure frequency and image counterparts across a large set of artifacts and reconstruction quality metrics.To investigate how the joint layer architecture interacts with the data consistency constraints often used in undersampled reconstruction, we train the basic network with such a constraint and observe that it compares favorably with the state of the art task-specific undersampled reconstruction networks (Eo et al., 2018;Schlemper et al., 2017) that also incorporate a data consistency constraint.Moreover, we probe the relationship between the proposed joint layers and the widely used unrolled optimization architectures by replacing image convolutional layers with our joint layers in a state of the art unrolled optimization network, MoDL (Aggarwal et al., 2018).Using the proposed joint layers improves the training landscape and reduces training time by about an order of magnitude.
To summarize, our contributions are as follows: 1. We define two task-independent convolutional layer architectures that tightly couple frequency and image representations of an input image that can be used in conjunction with unrolled optimization, data consistency constraints, and other sophisticated strategies for building and training reconstruction neural networks.
2. We demonstrate in simulation experiments that joint networks outperform pure image or pure frequency space networks for reconstructing high quality images in the presence of (i) extreme motion, (ii) heavy noise, and (iii) combination of artifacts, such as motion and undersampling.
3. We demonstrate that the proposed joint learning strategy is compatible with a data consistency constraint and performs favorably relative to state-of-the-art networks specifically designed for the undersampled reconstruction task.
4. We demonstrate on complex-valued, multicoil, real world data that incorporating joint layers into unrolled optimization networks results in more effective training and an order of magnitude decrease of training time, suggesting that the proposed architectures are particularly well suited for image representation in MRI reconstruction networks.
This paper is organized as follows.In the next section, we define the proposed layer and network architectures.Section 3 provides the implementation details and describes our ablation studies.Section 4 reports experimental results, followed by the discussion of the proposed layers, their limitations, and conclusions in Section 5.

Joint Networks
MRI acquires Fourier transform measurements, referred to as k-space data.We assume a 2D multislice MRI acquisition.For each slice in this setup, the goal of image reconstruction is to generate an image I from the acquired Fourier transform measurements F = F {I}. Classically, this reconstruction is computed via a 2D inverse Fourier transform, producing an estimated image Î = F −1 {F }.In practice, corrupted and possibly undersampled measurements F are acquired instead of F , and the goal is to estimate the desired image I from the corrupted signal F .Many strategies exist for selecting which measurements to acquire in frequency space.Here we consider Cartesian sampling, where measurement coordinates k x and k y are evenly sampled across the 2D Fourier plane, but our method can be generalized to other acquisition schemes.In this section, we define two neural network layer variants that combine image and frequency space convolutional features, referred to as Interleaved and Alternating, specify the network architectures, and describe the learning procedure.

Joint Layer Structures
Fig. 2 illustrates the layer structures of the two joint networks.We use u n to denote the frequency space input and v n to denote the image space input of layer n.Thus, u 0 = F and v 0 = F −1 {u 0 } represent the frequency space and image space inputs to the network.
In the Interleaved setup, layer inputs are combined via learned, layer-specific mixing parameters α n and β n that parameterize the sigmoid function s(x) = (1+e −x ) −1 to constrain the mixing coefficients to (0,1): (1) Real and imaginary parts of inputs are represented as separate channels at each layer and are joined appropriately to form complex numbers when computing the Fourier transform F {•} or its inverse.Next, the layer applies batch normalization (BN), a convolution, and an activation function with a skip connection to produce the outputs: where (w n , b n ) are learned frequency space convolution weights and biases, (w n , b n ) are learned image space convolution weights and biases, and σ(•) and σ (•) are activation functions specific to the frequency space and image space network components, described later in this section.This layer architecture is a generalization of networks that operate purely in frequency space, obtained by choosing s(α n ) = 1 and s(β n ) = 0, and of networks that operate purely in image space, that arise when s(α n ) = 0 and s(β n ) = 1.When 0 < s(α n ) < 1 and 0 < s(β n ) < 1, this layer represents a function that cannot be expressed solely via pure image or frequency space convolutional layers that do not invoke the Fourier transform or its inverse.Note that the frequency output u n of layer n is not required to be the Fourier transform of the layer's image output v n , only that the mixing is applied to either two frequency space outputs or two image space outputs.This additional flexibility ensures that u n and v n are not entirely redundant and the network learns the right features to capture MRI structure based on the input data and the task at hand.
In the Alternating setup, each layer sequentially incorporates frequency and image space convolutions with the appropriate batch normalization and activation function: i.e., the reconstruction alternates between convolutions in the frequency and image space.A version of this architecture was previously introduced as part of a task-specific network for undersampled reconstruction (Zhou and Zhou, 2020).
For both joint architectures, the frequency space convolutions represent element-wise multiplications in the image space.Since the convolution kernels have limited width, the learned convolutions cannot represent all such element-wise multiplications, but instead parameterize the subset whose 2D Fourier transform is zero outside of a central region.Coupled with nonlinearities in the frequency space, these operations enable the network to use global, spatially varying operations not captured by image space convolutions.
Although both of these layers explicitly include the Fourier transform and its inverse, no parameters are associated with those transforms.Thus, we learn only convolutional weights, biases, and possibly mixing coefficients.Since our networks incorporate Fourier transforms, they have an overall O(N 2 log N ) space complexity for N × N images.

Activation Functions
Adopting the standard practice of using the ReLU nonlinearity for image data, we define σ (x) = ReLU(x) for all convolutions in the image space.This operation is applied separately to real and imaginary channels of each image space convolution output (Trabelsi et al, 2018).However, the zero-gradient of this nonlinearity for negative values is ill-suited for networks that operate on frequency space data, as individual inputs can take on a large range of positive and negative values.We introduce an alternative nonlinear activation function that we apply to both the real and imaginary channels of each frequency space convolution output: This nonlinearity's magnitude increases with that of the input everywhere, while preserving the distinction between positive and negative inputs.We found that networks using this nonlinearity consistently outperformed networks that employed ReLU activation functions on frequency space convolution outputs.

Learning
The networks evaluated in this paper can be trained with any differentiable loss function L.
In our experiments, we investigate a wide variety of loss functions.We train the joint network f (•; θ f , θ i ) for image reconstruction by optimizing a set of frequency space parameters θ f and a set of image space parameters θ i over the training dataset D = {( Fm , I m )} using stochastic gradient descent-based strategies to obtain where θ f and θ i depend on the setup of the joint layer.

Implementation Details and Ablation Architectures
We construct each joint network to contain 10 joint frequency and image space layers.We performed a hyperparameter sweep and observed that the accuracy of reconstruction on the validation set stopped improving for networks that included more than 10 joint layers.A single 2D convolutional layer acts on the frequency space output u 10 of the final joint layer to produce the final 2-channel complex output F .The estimated image Î is the inverse Fourier transform of the network's output, i.e., Î = F −1 { F }.All convolution blocks within both types of joint layers have kernel size 3x3 and 64 output features, resulting in a total of 670,622 parameters for the Interleaved network and 706,438 parameters for the Alternating network.
To evaluate the utility of combined frequency and image space layers as a network building block for manipulating Fourier imaging data, we compare performance of the Interleaved and Alternating architectures to two similarly structured baseline architectures with only frequency or only image space operations.
First, we create an architecture Frequency that performs convolutions only on frequency space data and train the network g(•; θ f ) to identify frequency space parameters The network contains 20 convolution layer to match the joint networks' 10 pairs of 2 convolution layers.As in the Interleaved and Alternating networks, each convolution layer has kernel size 3x3 and 64 output features, followed by the final, two-feature 2D convolutional layer, resulting in 706,438 parameters.This network captures the convolution strategy used in (Akçakaya et al., 2019;Han et al., 2019;Kim et al., 2019), which incorporate frequency space convolutions in the context of other task-specific architectures and loss choices.We also implement an image space network Image.The network g(•; θ i ) is trained by optimizing This network's architecture is identical to that of Frequency and also contains 706,438 parameters, but it operates on image space data.This network captures the convolution strategy used in prior work that incorporates image space convolutions with task-specific architectures and loss function choices, e.g., unrolled optimization and data consistency constraints (Aggarwal et al., 2018;Hammernik et al., 2018;Haskell et al., 2019;Hyun et al., 2018;Küstner et al., 2019;Lee et al., 2017;Manjón and Coupe, 2018;Pawar et al., 2018;Putzky and Welling, 2019;Quan et al., 2018;Schlemper et al., 2017;Sun et al., 2016;Yang et al., 2017).We initialize all convolution weights using the He normal initializer (He et al., 2015) and use the Adam optimizer (Kingma and Ba, 2014) (learning rate 0.001) until convergence.We initialize s(α) and s(β) to 0.5.Training each model requires one day on an NVIDIA RTX 2080 Ti GPU.Our code and pre-trained models for each of these networks is available at https://github.com/nalinimsingh/interlacer.

Experiments
In this section, we evaluate the proposed joint layers in a set of experiments that progress from simulated data and basic networks to real world complex-valued multicoil MRI measurements and unrolled optimization frameworks with physics-inspired data consistency constraints.The experiments in this section are performed on brain MRIs from multiple datasets.Additional experiments on FastMRI single coil knee MRI, including comparisons with the top methods on FastMRI leaderboard, are provided in Appendix A.

No Data Consistency
In this section, we present experiments where no data consistency contraint is employed in training our networks.These experiments directly compare the performance of the different layer types described in Sections 2 and 3.These experiments are particularly useful for understanding the relative performance of these methods in settings where direct data consistency may not be desirable because the acquired data is corrupted by an artifact.

Data.
In this experiment, we simulate artifacts of interest in a set of 6,276 T 1 -weighted brain MRI images from patients aged 55-90 collected as part of the Alzheimer's Disease Neuroimaging Initiative (ADNI) (Mueller et al., 2005).We select the central 2D axial image of each volume for training and evaluation.To simulate acquired data, we apply the 2D Fourier transform to each image.After simulating the artifacts as described below, we normalize each input and output training pair by dividing by the maximum value in the corrupted image.The k-space data were zero-padded in this dataset during the original image reconstruction process, prior to our simulations.As a result, the quantitative results from these experiments do not represent model performance when deployed on raw, acquired k-space data (Shimron et al., 2022).Instead, these experiments probe the relative performance of competing methods on tasks for which large datasets of raw k-space are not readily available, such as motion correction and denoising.Subsequent experiments with raw, acquired frequency space data that have not been padded demonstrate that the proposed joint layers can also handle non-padded data.We split the dataset into 4,115 training images, 2,061 validation images, and 100 test images such that no subjects are shared across the training, validation, and test sets.Preliminary experiments and hyper-parameters are evaluated on the validation dataset; the test set is only used for computing the performance statistics.
Training Loss and Evaluation Metric.
We train Frequency, Image, Interleaved, and Alternating networks described in Section 3 using L1 loss on the real and imaginary components of the output and employ the SSIM scores (Wang et al., 2004) between the ground truth and reconstructed magnitude images to evaluate the quality of reconstruction on the test set.

Experimental Setup
Motion.Imaging subjects may move as measurements are being acquired at different points in the Fourier space.In practice, all points within a single line F (•, k y ) in frequency space are acquired rapidly together.Thus, it is commonly assumed that no motion occurs during acquisition of a single frequency space line.In this work, we use a rigid-body motion model for motion that occurs between acquisitions of successive lines.
If the imaged subject is affected by a rotation φ ky about the origin, a horizontal translation ∆x ky , and a vertical translation ∆y ky during acquisition of line k y , the acquired signal corresponds to the rigidly transformed image Ĩky Ĩky (x, y) = I x − ∆x ky cos φ ky − y − ∆y ky sin φ ky , x − ∆x ky sin φ ky + y − ∆y ky cos φ ky .
Eq. ( 8) forms a translated and rotated version of the desired image I.A pure translation without rotation in the image space corresponds to a phase shift in the frequency space: for a N × N image.A pure rotation about the center of the image space without translation corresponds to a rotation by the same angle in the frequency space: To simulate motion artifacts during image acquisition as described in Eq. ( 8), we sample three motion parameters at various lines in frequency space: a horizontal translation ∆x, vertical translation ∆y, and rotation φ.We report results for the case when the fraction γ m of the total number of lines at which motion occurs is 0.03, though the trends in our results hold for several different values of this parameter.We apply the sampled motion parameters to contiguous lines in frequency space between consecutive motion line samples.Translation parameter values are drawn uniformly from the range [−8px, 8px], corresponding to physical translations on the range [−8mm, 8mm].Rotation parameter values are drawn uniformly from the range [−11 • , 11 • ].These parameter ranges are chosen to include extreme motion at the upper limit of what might be expected in a typical MRI scan.For a Cartesian, fully-sampled acquisition, the resulting combined frequency space data represents the signal acquired when the imaging subject shifts according to the sampled motion parameters at each of the randomly sampled lines in frequency space.
Noise.Noisy MRI data can be modeled via an additive i.i.d.complex Gaussian distribution: where N (µ, Σ) represents the Gaussian distribution with mean µ and covariance Σ.This noise distribution gives rise to the standard Rician distribution on MRI image space pixel magnitudes (Cárdenas-Blanco et al., 2008).
To simulate noisy acquisitions as described in Eq. ( 11), we sample pixelwise independent noise from a zero-mean Gaussian distribution.We report results in the case where this noise has standard deviation γ n of 10,000, though our observed trends are consistent for both smaller and larger values of this parameter.This value was chosen because it visually results in an aggressive noise corruption on the magnitude image; the average resulting magnitude image has SNR≈1.5.
Undersampling.To speed up image acquisition, a common approach is to only acquire data at a subset S y of discrete "lines," i.e., values of k y ∈ S y : We simulate undersampling as described in Eq. ( 12) with sampling frequency γ s = 25% (equivalent to an acceleration factor of 4), where the selected line indices S y are sampled at random.These lines are selected without a bias toward the low-frequency lines at the center of the Fourier plane of each image, independently of the sampling pattern in all other images.This challenging undersampling pattern measures how well different layer architectures perform under non-traditional acquisition schemes, for example, when using scan-specific acquisition patterns (Bahadir et al., 2020).Our subsequent experiments evaluate the proposed layers with more conventional undersampling schemes.As an aside, the ground truth data in this experiment has conjugate symmetry in the frequency space, so in the hypothetical case of γ s =50% with our random sampling scheme it is possible that all of the data required to perfectly reconstruct the image is present in the input.This is impossible for the acceleration factor of γ s =25% in this study.
Undersampling with Motion.Undersampling reduces scan time and thus is commonly used to limit the time during which motion can occur.We analyze the setting where both motion corruption and undersampling occur simultaneously (Fig. 3), forcing the reconstruction algorithms to correct both types of artifacts.As in the pure motion experiments, for each slice, we set the fraction of lines γ m = 0.03.For each line affected by motion, we sample three parameters of motion: ∆x i , ∆y i , and φ i , corresponding respectively to a horizontal translation, vertical translation, and counterclockwise rotation about the slice origin.We simulate the corresponding motion-corrupted frequency space as described in Eq. ( 8).We then sample the full center 8% of k y -lines and sample the remainder of the line indices from a uniform distribution to achieve an overall 4x acceleration factor.in Figs.5-7.Qualitatively, for each task, the Frequency network provides a blurry version of the ground truth image.The Image network provides a reconstruction which effectively removes 'background' effects but has limited success in correcting these artifacts within the image.In contrast, the Interleaved and Alternating networks provide sharper, high-quality reconstructions across all tasks.Further, the frequency space reconstructions provided by those networks appear the most faithful to the ground truth frequency data.

Hard Data Consistency Constraint
Deep learning for undersampled reconstruction is an active area of research and several state of the art methods have emerged for this task.In this experiment, we compare Interleaved and Alternating networks to such methods on ADNI data introduced in Section 4.1.

Ground Truth
Input Frequency Image Alternating Interleaved Undersampling is fundamentally different from motion and noise corruption, because the acquired data for lines k y ∈ S y are the correct, desired outputs of the reconstruction algorithm at those frequency space locations.Data consistency can be enforced at test time and at intermediate layers of the network by substituting the appropriate k-space lines into the k-space representations of the image (final or intermediate) produced by the network.We enforce data consistency in Interleaved and Alternating networks by copying the acquired frequency space data into the network output.
We compare Interleaved and Alternating networks to a U-Net (Falk et al., 2019), the CascadeNet (Schlemper et al., 2017), which combines image space convolutions with forced data consistency at each layer of the network, and, most similar to our method, the KIKI network (Eo et al., 2018), which includes two separate image and frequency space networks.The KIKI-net architecture incorporates four networks operating in the frequency, image, frequency, and image spaces, respectively.This is in contrast to our networks, where every layer contains convolutions in both spaces and uses a custom nonlinearity for the frequency space layers.Moreover, the KIKI-net architecture imposes a data consistency constraint after each k-space subnetwork.For tasks other than undersampled image reconstruction, the data consistency constraints in CascadeNet and KIKI-net would incorrectly force the acquired k-space lines to be maintained in the final reconstruction; thus, we restrict comparisons with CascadeNet and KIKI-net to the undersampled reconstruction case.
We use implementations of the baseline methods available at https://github.com/zaccharieramzi/fastmri-reproducible-benchmark (Ramzi et al., 2020).We scale each network to have roughly 800,000 parameters for fair comparison with our joint architectures.We use an L1 loss function to train the networks and SSIM scores to evaluate their performance on the test set.

Undersampling patterns.
In addition to the random sampling scheme in Section 4.1, we simulate two traditional undersampling patterns: (i) the central 8% of lines are fully sampled while every fourth line of the outer regions of k-space is sampled and (ii) the central 4% of lines are fully sampled while every eighth line of the outer regions of k-space is sampled.dom (right).In all cases, simple networks composed of repeated copies of our joint layers perform at least as well as other state of the art networks, and in the difficult case of a random sampling pattern, outperform the baseline networks.

Results
Fig. 8 reports statistics for U-Net, CascadeNet, KIKI-net, Joint and Alternating networks.Fig. 9 provides sample image reconstructions.Interleaved and Alternating networks perform comparably to other state of the art methods on the simpler uniform undersampling tasks and outperform the state of the art methods on the more complex random undersampling task.

Unrolled Optimization
Finally, we evaluate the performance of the proposed joint layers in the setting of an unrolled optimization architecture on real world multicoil MRI data.In this experiment, we replace the image space convolutional layers with our Interleaved layers in the MoDL framework (Aggarwal et al., 2018) for unrolled optimization.We use the authors' publicly available implementation of MoDL at https://github.com/hkaggarwal/modl.Each iteration of the MoDL network first passes the input through convolutional layers that serve as a data-driven regularizer and then applies an analytical update based on the data consistency term.To keep the total number of convolutions comparable, we train the baseline MoDL network with 10 image convolutional layers in each iteration and the joint MoDL network with 5 Interleaved layers in each iteration.We set K = 5 iterations for both networks.The authors use the strategy of first training a one-iteration MoDL network and using its weights to initialize the training of a multi-iteration MoDL network.This process speeds up training of the larger unrolled optimization network and avoids instabilities.We found that pre-training of a one-iteration model was unnecessary when using the joint layers, and train both the one-iteration and the five-iteration joint MoDL networks using random initializations.For consistency with the original MoDL training approach, we train all networks using L2 loss.

Data.
We use the data from the original MoDL study (Aggarwal et al., 2018).This dataset contains raw k-space data from 3D T2 CUBE acquisitions with Cartesian readouts using a 12-channel head coil.The dataset contains 360 training slices from 4 training subjects and a single, separate test subject.We exclude some edge slices in this test volume and use the central 90 slices for our evaluations to match the training distribution.We train all networks using a variable density 6x undersampling mask as specified in the original paper.

Results
Figure 10 presents the training curves, validation SSIM, and sample reconstructions for all versions of the MoDL architecture.All networks attain similar validation SSIM values, but MoDL networks with joint layers achieve high reconstruction quality in roughly a third as many epochs as image space networks.Further, using our joint layers removes the need to pretrain a one-iteration network.The five-iteration network with joint layers trains successfully from random initializations.The resulting differences in wall clock training times are summarized in Table 1.Table 1: Training times for the full (K = 5) versions of the MoDL architecture to achieve validation SSIM ≥ 0.98.For stable training, MoDL with image space convolutions must be initialized using the weights learned for a K = 1 MoDL network.MoDL architectures trained with our joint layers require no pre-training.In total, using joint layers results in roughly an 8x speed-up over the pure image space approach.

Discussion and Conclusions
We demonstrate the advantages of joint image and frequency space learning strategies for correcting corrupted MRI data.For tasks where data consistency constraints cannot be readily applied, our joint networks produce sharper reconstructions than the more blurry, artifacted versions generated by single space networks.For the well-studied task of undersampled reconstruction, where data consistency constraints can be imposed easily, we show that networks comprising joint layers can be trained with such constraints and compare favorably to other strategies that incorporate data consistency constraints to improve the quality of single space network reconstructions.For unrolled architectures that iteratively perform the steps of an optimization procedure to produce high quality reconstructions, the joint layers can straightforwardly replace image convolutional layers to improve train-ing landscape and convergence.These results point to joint layers as a useful building block when designing neural network architectures for correcting frequency space artifacts.
While we demonstrate our method in a diverse set of acquisition scenarios, our analysis does not exhaustively cover all possible imaging artifacts.For example, we do not analyze the effects of interslice motion, which may occur in addition to the intraslice motion studied in this work and introduces new image content from an adjacent slice into the slice being imaged.Further, while we analyze extremely aggressive versions of motion, noise, and undersampling to demonstrate the effectiveness of our method in the most challenging scenarios, future versions of this method could tune these parameters to more closely match the statistics of the patient population being scanned.For example, empirically measured motion trajectories could be used to characterize the rate and severity of the induced motion artifacts.
In the future, we aim to develop additional strategies for applications where direct consistency with acquired data is not necessarily desirable, such as motion correction.We also plan to investigate local operations beyond convolutions that more directly capitalize on properties and symmetries of frequency space data for use in joint architectures.Local convolutions in the frequency space represent a subset of all possible element-wise multiplications in the image space.Thus, future work could perform these operations in the image space, saving the computational overhead of performing an FFT within each layer, or could take advantage of additional element-wise image space multiplications whose Fourier transforms are not bandlimited to the size of our filter kernels.The combination of these advances promises to significantly improve reconstruction and analysis of MRI data in the face of widely varying acquisition challenges and downstream applications.
Table 3 reports reconstruction quality measures for Interleaved network and the top singleslice methods on the FastMRI benchmark.Interleaved network achieves results that are close to the state of the art architectures specifically tuned for this task.We emphasize that our goal is not to attain state of the art performance on the FastMRI benchmark, but rather to show that simple layers comprised of both frequency and image space convolutions achieve reasonable performance on this benchmark while offering flexibility for correcting a wide range of other artifacts, and for correcting multiple artifacts present simultaneously.

Figure 1 :
Figure 1: Maps of correlation coefficients between a single pixel (center of circle) and all other pixels in image (left two panels) and frequency space (right two panels) representations of MNIST and a brain MRI dataset.All maps show strong local correlations useful for inferring missing or corrupted data in both spaces.Frequency space correlations also display conjugate symmetry characteristic of Fourier transforms of real images.

Figure 2 :
Figure 2: The Interleaved (left) and Alternating (right) layers, embedded within full network architectures.Each 'F-Conv' or 'I-Conv' block applies Batch Normalization (BN), a convolution, and an activation function in the frequency or image space, respectively.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Data generation procedure for undersampling in the presence of motion.At line L i in frequency space, the original image I is rotated and translated to form I M i .Lines from the corresponding Fourier transforms F and F M iare mixed and undersampled to generate motion-corrupted frequency space data F that would have been acquired under the illustrated motion pattern.A similar method is used to simulate pure motion corruption without undersampling, where all frequency space lines are maintained to generate F .

Figure 6 :
Figure 6: Example reconstructions from 4x undersampled, motion-corrupted data data, zoomed-in image patches, difference patches between reconstructions and ground truth images, and frequency space reconstructions.As in the motion corruption and undersampling examples, the Interleaved and Alternating architectures provide more accurate reconstructions of the ground truth images and reconstructing a more coherent k-space.

Figure 7 :
Figure 7: Example reconstructions with noise of standard deviation 10,000.The Interleaved and Alternating reconstructions remove the pixelated noise effect without over-smoothing, in contrast to the single-domain networks.

Figure 8 :
Figure8: SSIM comparison of the joint networks with the state of the art undersampled reconstruction approaches on ADNI data.Results are reported for three undersampling patterns: 4x uniform undersampling with a fully-sampled central region (left), 8x uniform undersampling with a fully-sampled central region (middle), and 4x undersampling at random (right).In all cases, simple networks composed of repeated copies of our joint layers perform at least as well as other state of the art networks, and in the difficult case of a random sampling pattern, outperform the baseline networks.

Figure 9 :
Figure 9: Example reconstructions from 4x undersampled data, with lines selected at random.The Interleaved and Alternating architectures provide more accurate reconstructions of the ground truth images, better eliminating 'ringing' and blurring artifacts.

Figure 10 :
Figure 10: Example training loss and validation SSIM curves (left) and sample reconstructions and patches for MoDL networks with K = 1, 5 iterations trained with image convolutional layers and with the proposed joint (Interleaved) layers.MoDL networks with image convolutional layers do not converge if trained directly with K = 5.Instead, a K = 1 MoDL network must be trained and used to initialize the weights of a K = 5 MoDL network.MoDL networks trained with joint layers do not require pre-training and achieve the same loss and validation SSIM values as networks trained with image convolutions in significantly less time.

Table 2 :
Image reconstruction evaluation metrics (columns) for networks trained with varying loss functions (rows) on images acquired without fat suppression.Similar trends hold for images with fat suppression.MS SSIM stands for multiscale SSIM.For metrics labeled ↓, smaller values are better; for metrics labeled ↑, larger values are better.Across nearly every training loss function and metric, the Interleaved network performs best.In almost every case, the Alternating network architecture performs similarly or only slightly worse than the Interleaved network.This is particularly true in the case of SSIM-based loss functions, which provide the best overall quantitative results across all evaluation metrics.