Parallel Fast Fourier Transforms

Parallel FFTs are computed through a combination of global redistributions and serial transforms. In mpi4py-fft the interface to performing such parallel transforms is the mpifft.PFFT class. The class is highly configurable and best explained through a few examples.

Slab decomposition

With slab decompositions we use only one group of processors and distribute only one index of a multidimensional array at the time.

Consider the complete transform of a three-dimensional array of random numbers, and of shape (128, 128, 128). We can plan the transform of such an array with the following code snippet:

import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))

Here the signature N, axes=(0, 1, 2), dtype=np.float, grid=(-1,) tells us that the created fft instance is planned such as to slab distribute (along first axis) and transform any 3D array of shape N and type np.float. Furthermore, we plan to transform axis 2 first, and then 1 and 0, which is exactly the reverse order of axes=(0, 1, 2). Mathematically, the planned transform corresponds to

\[\begin{split}\tilde{u}_{j_0/P,k_1,k_2} &= \mathcal{F}_1( \mathcal{F}_{2}(u_{j_0/P, j_1, j_2})), \\ \tilde{u}_{j_0, k_1/P, k_2} &\xleftarrow[P]{1\rightarrow 0} \tilde{u}_{j_0/P, k_1, k_2}, \\ \hat{u}_{k_0,k_1/P,k_2} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P, k_2}).\end{split}\]

Note that axis 0 is distributed on the input array and axis 1 on the output array. In the first step above we compute the transforms along axes 2 and 1 (in that order), but we cannot compute the serial transform along axis 0 since the global array is distributed in that direction. We need to perform a global redistribution, the middle step, that realigns the global data such that it is aligned in axes 0. With data aligned in axis 0, we can perform the final transform \(\mathcal{F}_{0}\) and be done with it.

Assume now that all the code in this section is stored to a file named pfft_example.py, and add to the above code:

u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True) # Note that normalize=True is default and can be omitted
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)

Running this code with two processors (mpirun -np 2 python pfft_example.py) should raise no exception, and the output should be:

1 (64, 128, 128)
0 (64, 128, 128)

This shows that the first index has been shared between the two processors equally. The array u thus corresponds to \(u_{j_0/P,j_1,j_2}\). Note that the newDistArray() function returns a DistArray object, which in turn is a subclassed Numpy ndarray. The newDistArray() function uses fft to determine the size and type of the created distributed array, i.e., (64, 128, 128) and np.float for both processors. The False argument indicates that the shape and type should be that of the input array, as opposed to the output array type (\(\hat{u}_{k_0,k_1/P,k_2}\) that one gets with True).

Note that because the input array is of real type, and not complex, the output array will be of global shape:

128, 128, 65

The output array will be distributed in axis 1, so the output array shape should be (128, 64, 65) on each processor. We check this by adding the following code and rerunning:

u_hat = newDistArray(fft, True)
print(MPI.COMM_WORLD.Get_rank(), u_hat.shape)

leading to an additional print of:

1 (128, 64, 65)
0 (128, 64, 65)

To distribute in the first axis first is default and most efficient for row-major C arrays. However, we can easily configure the fft instance by modifying the axes keyword. Changing for example to:

fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)

and axis 1 will be transformed first, such that the global output array will be of shape (128, 65, 128). The distributed input and output arrays will now have shape:

(64, 128, 128)
(64, 128, 128)

(128, 33, 128)
(128, 32, 128)

Note that the input array will still be distributed in axis 0 and the output in axis 1. This order of distribution can be tweaked using the grid keyword. Setting grid=(1, 1, -1) will force the last axis to be distributed on the input array.

Another way to tweak the distribution is to use the Subcomm class directly:

from mpi4py_fft.pencil import Subcomm
subcomms = Subcomm(MPI.COMM_WORLD, [1, 0, 1])
fft = PFFT(subcomms, N, axes=(0, 1, 2), dtype=np.float)

Here the subcomms tuple will decide that axis 1 should be distributed, because the only zero in the list [1, 0, 1] is along axis 1. The ones determine that axes 0 and 2 should use one processor each, i.e., they should be non-distributed.

The PFFT class has a few additional keyword arguments that one should be aware of. The default behaviour of PFFT is to use one transform object for each axis, and then use these sequentially. Setting collapse=True will attempt to minimize the number of transform objects by combining whenever possible. Take our example, the array \(u_{j_0/P,j_1,j_2}\) can transform along both axes 1 and 2 simultaneously, without any intermediate global redistributions. By setting collapse=True only one object of rfftn(u, axes=(1, 2)) will be used instead of two (like fftn(rfftn(u, axes=2), axes=1)). Note that a collapse can also be configured through the axes keyword, using:

fft = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), dtype=np.float)

will collapse axes 1 and 2, just like one would obtain with collapse=True.

If serial transforms other than fftn()/rfftn() and ifftn()/irfftn() are required, then this can be achieved using the transforms keyword and a dictionary pointing from axes to the type of transform. We can for example combine real-to-real with real-to-complex transforms like this:

from mpi4py_fft.fftw import rfftn, irfftn, dctn, idctn
import functools
dct = functools.partial(dctn, type=3)
idct = functools.partial(idctn, type=3)
transforms = {(0,): (rfftn, irfftn), (1, 2): (dct, idct)}
r2c = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), transforms=transforms)
u = newDistArray(r2c, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = r2c.forward(u)
uj = np.zeros_like(u)
uj = r2c.backward(u_hat, uj)
assert np.allclose(uj, u)

As a more complex example consider a 5-dimensional array where for some reason you need to perform discrete cosine transforms in axes 1 and 2, discrete sine transforms in axes 3 and 4, and a regular Fourier transform in the first axis. Here it makes sense to collapse the (1, 2) and (3, 4) axes, which leaves only the first axis uncollapsed. Hence we can then only use one processor group and a slab decomposition, whereas without collapsing we could have used four groups. A parallel transform object can be created and tested as:

N = (5, 6, 7, 8, 9)
dctn = functools.partial(fftw.dctn, type=3)
idctn = functools.partial(fftw.idctn, type=3)
dstn = functools.partial(fftw.dstn, type=3)
idstn = functools.partial(fftw.idstn, type=3)
fft = PFFT(MPI.COMM_WORLD, N, ((0,), (1, 2), (3, 4)), grid=(-1,),
           transforms={(1, 2): (dctn, idctn), (3, 4): (dstn, idstn)})

A = newDistArray(fft, False)
A[:] = np.random.random(A.shape)
C = fftw.aligned_like(A)
B = fft.forward(A)
C = fft.backward(B, C)
assert np.allclose(A, C)

Pencil decomposition

A pencil decomposition uses two groups of processors. Each group then is responsible for distributing one index set each of a multidimensional array. We can perform a pencil decomposition simply by running the first example from the previous section, but now with 4 processors. To remind you, we put this in pfft_example.py, where now grid=(-1,) has been removed in the PFFT calling:

import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray

N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float)
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)

The output of running mpirun -np 4 python pfft_example.py will then be:

(64, 64, 128)
(64, 64, 128)
(64, 64, 128)
(64, 64, 128)

Note that now both the two first index sets are shared, so we have a pencil decomposition. The shared input array is now denoted as \(u_{j_0/P_0,j_1/P_1,j2}\) and the complete forward transform performs the following 5 steps:

\[\begin{split}\tilde{u}_{j_0/P_0,j_1/P_1,k_2} &= \mathcal{F}_{2}(u_{j_0/P_0, j_1/P_1, j_2}), \\ \tilde{u}_{j_0/P_0, j_1, k_2/P_1} &\xleftarrow[P_1]{2\rightarrow 1} \tilde{u}_{j_0/P_0, j_1/P_1, k_2}, \\ \tilde{u}_{j_0/P_0,k_1,k_2/P_1} &= \mathcal{F}_1(\tilde{u}_{j_0/P_0, j_1, k_2/P_1}), \\ \tilde{u}_{j_0, k_1/P_0, k_2/P_1} &\xleftarrow[P_0]{1\rightarrow 0} \tilde{u}_{j_0/P_0, k_1, k_2/P_1}, \\ \hat{u}_{k_0,k_1/P_0,k_2/P_1} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P_0, k_2/P_1}).\end{split}\]

Like for the slab decomposition, the order of the different steps is configurable. Simply change the value of axes, e.g., as:

fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)

and the input and output arrays will be of shape:

(64, 128, 64)
(64, 128, 64)
(64, 128, 64)
(64, 128, 64)

(64, 32, 128)
(64, 32, 128)
(64, 33, 128)
(64, 33, 128)

We see that the input array is aligned in axis 1, because this is the direction transformed first.

Convolution

Working with Fourier one sometimes need to transform the product of two or more functions, like

(9)\[\widehat{ab}_k = \int_{0}^{2\pi} a b e^{-i k x} dx, \quad \forall k \in [-N/2, \ldots, N/2-1]\]

computed with DFT as

(10)\[\widehat{ab}_k = \frac{1}{N}\sum_{j=0}^{N-1}a_j b_j e^{-2\pi i j k / N}, \quad \forall \, k\in [-N/2, \ldots, N/2-1].\]

Note

We are here assuming an even number \(N\) and use wavenumbers centered around zero.

If \(a\) and \(b\) are two Fourier series with their own coefficients:

(11)\[\begin{split}a &= \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x}, \\ b &= \sum_{q=-N/2}^{N/2-1} \hat{b}_q e^{i q x},\end{split}\]

then we can insert for the two sums from (11) in (9) and get

(12)\[\begin{split}\widehat{ab}_k &= \int_{0}^{2\pi} \left( \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x} \sum_{q=-N/2}^{N/2-1} \hat{b}_q e^{i q x} \right) e^{-i k x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1] \\ \widehat{ab}_k &= \sum_{p=-N/2}^{N/2-1} \sum_{q=-N/2}^{N/2-1} \hat{a}_p \hat{b}_q \int_{0}^{2\pi} e^{-i (p+q-k) x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1]\end{split}\]

The final integral is \(2\pi\) for \(p+q=k\) and zero otherwise. Consequently, we get

(13)\[\widehat{ab}_k = 2\pi \sum_{p=-N/2}^{N/2-1}\sum_{q=-N/2}^{N/2-1} \hat{a}_p \hat{b}_{q} \delta_{p+q, k} , \quad \forall \, k \in [-N/2, \ldots, N/2-1]\]

Unfortunately, the convolution sum (13) is very expensive to compute, and the direct application of (10) leads to aliasing errors. Luckily there is a fast approach that eliminates aliasing as well.

The fast, alias-free, approach makes use of the FFT and zero-padded coefficient vectors. The idea is to zero-pad \(\hat{a}\) and \(\hat{b}\) in spectral space such that we get the extended sums

\[\begin{split}A_j &= \sum_{p=-M/2}^{M/2-1} \hat{\hat{a}}_p e^{2 \pi i p j/M}, \\ B_j &= \sum_{q=-M/2}^{M/2-1} \hat{\hat{b}}_q e^{2 \pi i q j/M},\end{split}\]

where \(M>N\) and where the coefficients have been zero-padded such that

\[\begin{split}\hat{\hat{a}}_p = \begin{cases} \hat{a}_p, &\forall |p| \le N/2 \\ 0, &\forall |p| \gt N/2 \end{cases}\end{split}\]

Now compute the nonlinear term in the larger physical space and compute the convolution as

(14)\[\widehat{ab}_k = \frac{1}{M} \sum_{j=0}^{M-1} A_j B_j e^{- 2 \pi i k j/M}, \quad \forall \, k \in [-M/2, \ldots, M/2-1]\]

Finally, truncate the vector \(\widehat{ab}_k\) to the original range \(k\in[-N/2, \ldots, N/2-1]\), simply by eliminating all the wavenumbers higher than \(|N/2|\).

With mpi4py-fft we can compute this convolution using the padding keyword of the PFFT class:

import numpy as np
from mpi4py_fft import PFFT, newDistArray
from mpi4py import MPI

comm = MPI.COMM_WORLD
N = (128, 128)   # Global shape in physical space
fft = PFFT(comm, N, padding=[1.5, 1.5], dtype=np.complex)

# Create arrays in normal spectral space
a_hat = newDistArray(fft, True)
b_hat = newDistArray(fft, True)
a_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
b_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j

# Transform to real space with padding
a = newDistArray(fft, False)
b = newDistArray(fft, False)
assert a.shape == (192//comm.Get_size(), 192)
a = fft.backward(a_hat, a)
b = fft.backward(b_hat, b)

# Do forward transform with truncation
ab_hat = fft.forward(a*b)

Note

The padded instance of the PFFT class is often used in addition to a regular non-padded class. The padded version is then used to handle non-linearities, whereas the non-padded takes care of the rest, see demo.