# Parallel Fast Fourier Transforms

Parallel FFTs are computed through a combination of global redistributions
and serial transforms. In mpi4py-fft the interface to performing such
parallel transforms is the `mpifft.PFFT`

class. The class is highly
configurable and best explained through a few examples.

## Slab decomposition

With slab decompositions we use only one group of processors and distribute only one index of a multidimensional array at the time.

Consider the complete transform of a three-dimensional array of random numbers, and of shape (128, 128, 128). We can plan the transform of such an array with the following code snippet:

```
import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
```

Here the signature `N, axes=(0, 1, 2), dtype=np.float, grid=(-1,)`

tells us
that the created `fft`

instance is *planned* such as to slab distribute
(along first axis) and
transform any 3D array of shape `N`

and type `np.float`

. Furthermore, we
plan to transform axis 2 first, and then 1 and 0, which is exactly the reverse
order of `axes=(0, 1, 2)`

. Mathematically, the planned transform corresponds
to

Note that axis 0 is distributed on the input array and axis 1 on the output array. In the first step above we compute the transforms along axes 2 and 1 (in that order), but we cannot compute the serial transform along axis 0 since the global array is distributed in that direction. We need to perform a global redistribution, the middle step, that realigns the global data such that it is aligned in axes 0. With data aligned in axis 0, we can perform the final transform \(\mathcal{F}_{0}\) and be done with it.

Assume now that all the code in this section is stored to a file named
`pfft_example.py`

, and add to the above code:

```
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True) # Note that normalize=True is default and can be omitted
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)
```

Running this code with two processors (`mpirun -np 2 python pfft_example.py`

)
should raise no exception, and the output should be:

```
1 (64, 128, 128)
0 (64, 128, 128)
```

This shows that the first index has been shared between the two processors
equally. The array `u`

thus corresponds to \(u_{j_0/P,j_1,j_2}\). Note
that the `newDistArray()`

function returns a `DistArray`

object, which in turn is a subclassed Numpy ndarray. The `newDistArray()`

function uses `fft`

to determine the size and type of the created distributed
array, i.e., (64, 128, 128) and `np.float`

for both processors.
The `False`

argument indicates that the shape
and type should be that of the input array, as opposed to the output
array type (\(\hat{u}_{k_0,k_1/P,k_2}\) that one gets with `True`

).

Note that because the input array is of real type, and not complex, the output array will be of global shape:

```
128, 128, 65
```

The output array will be distributed in axis 1, so the output array shape should be (128, 64, 65) on each processor. We check this by adding the following code and rerunning:

```
u_hat = newDistArray(fft, True)
print(MPI.COMM_WORLD.Get_rank(), u_hat.shape)
```

leading to an additional print of:

```
1 (128, 64, 65)
0 (128, 64, 65)
```

To distribute in the first axis first is default and most efficient for
row-major C arrays. However, we can easily configure the `fft`

instance
by modifying the axes keyword. Changing for example to:

```
fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)
```

and axis 1 will be transformed first, such that the global output array will be of shape (128, 65, 128). The distributed input and output arrays will now have shape:

```
0 (64, 128, 128)
1 (64, 128, 128)
0 (128, 33, 128)
1 (128, 32, 128)
```

Note that the input array will still be distributed in axis 0 and the output in axis 1. This order of distribution can be tweaked using the grid keyword. Setting grid=(1, 1, -1) will force the last axis to be distributed on the input array.

Another way to tweak the distribution is to use the `Subcomm`

class directly:

```
from mpi4py_fft.pencil import Subcomm
subcomms = Subcomm(MPI.COMM_WORLD, [1, 0, 1])
fft = PFFT(subcomms, N, axes=(0, 1, 2), dtype=np.float)
```

Here the `subcomms`

tuple will decide that axis 1 should be distributed,
because the only zero in the list `[1, 0, 1]`

is along axis 1. The ones
determine that axes 0 and 2 should use one processor each, i.e., they should
be non-distributed.

The `PFFT`

class has a few additional keyword arguments that one
should be aware of. The default behaviour of `PFFT`

is to use
one transform object for each axis, and then use these sequentially.
Setting `collapse=True`

will attempt to minimize the number of transform
objects by combining whenever possible. Take our example, the array
\(u_{j_0/P,j_1,j_2}\) can transform along both axes 1 and 2 simultaneously,
without any intermediate global redistributions. By setting
`collapse=True`

only one object of `rfftn(u, axes=(1, 2))`

will be
used instead of two (like `fftn(rfftn(u, axes=2), axes=1)`

).
Note that a collapse can also be configured through the `axes`

keyword,
using:

```
fft = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), dtype=np.float)
```

will collapse axes 1 and 2, just like one would obtain with `collapse=True`

.

If serial transforms other than `fftn()`

/`rfftn()`

and
`ifftn()`

/`irfftn()`

are required, then this can be achieved
using the `transforms`

keyword and a dictionary pointing from axes to
the type of transform. We can for example combine real-to-real
with real-to-complex transforms like this:

```
from mpi4py_fft.fftw import rfftn, irfftn, dctn, idctn
import functools
dct = functools.partial(dctn, type=3)
idct = functools.partial(idctn, type=3)
transforms = {(0,): (rfftn, irfftn), (1, 2): (dct, idct)}
r2c = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), transforms=transforms)
u = newDistArray(r2c, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = r2c.forward(u)
uj = np.zeros_like(u)
uj = r2c.backward(u_hat, uj)
assert np.allclose(uj, u)
```

As a more complex example consider a 5-dimensional array where for some reason you need to perform discrete cosine transforms in axes 1 and 2, discrete sine transforms in axes 3 and 4, and a regular Fourier transform in the first axis. Here it makes sense to collapse the (1, 2) and (3, 4) axes, which leaves only the first axis uncollapsed. Hence we can then only use one processor group and a slab decomposition, whereas without collapsing we could have used four groups. A parallel transform object can be created and tested as:

```
N = (5, 6, 7, 8, 9)
dctn = functools.partial(fftw.dctn, type=3)
idctn = functools.partial(fftw.idctn, type=3)
dstn = functools.partial(fftw.dstn, type=3)
idstn = functools.partial(fftw.idstn, type=3)
fft = PFFT(MPI.COMM_WORLD, N, ((0,), (1, 2), (3, 4)), grid=(-1,),
transforms={(1, 2): (dctn, idctn), (3, 4): (dstn, idstn)})
A = newDistArray(fft, False)
A[:] = np.random.random(A.shape)
C = fftw.aligned_like(A)
B = fft.forward(A)
C = fft.backward(B, C)
assert np.allclose(A, C)
```

## Pencil decomposition

A pencil decomposition uses two groups of processors. Each group then is
responsible for distributing one index set each of a multidimensional array.
We can perform a pencil decomposition simply by running the first example
from the previous section, but now with 4 processors. To remind you, we
put this in `pfft_example.py`

, where now `grid=(-1,)`

has been removed
in the PFFT calling:

```
import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float)
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)
```

The output of running `mpirun -np 4 python pfft_example.py`

will then be:

```
0 (64, 64, 128)
2 (64, 64, 128)
3 (64, 64, 128)
1 (64, 64, 128)
```

Note that now both the two first index sets are shared, so we have a pencil decomposition. The shared input array is now denoted as \(u_{j_0/P_0,j_1/P_1,j2}\) and the complete forward transform performs the following 5 steps:

Like for the slab decomposition, the order of the different steps is
configurable. Simply change the value of `axes`

, e.g., as:

```
fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)
```

and the input and output arrays will be of shape:

```
3 (64, 128, 64)
2 (64, 128, 64)
1 (64, 128, 64)
0 (64, 128, 64)
3 (64, 32, 128)
2 (64, 32, 128)
1 (64, 33, 128)
0 (64, 33, 128)
```

We see that the input array is aligned in axis 1, because this is the direction transformed first.

## Convolution

Working with Fourier one sometimes need to transform the product of two or more functions, like

computed with DFT as

Note

We are here assuming an even number \(N\) and use wavenumbers centered around zero.

If \(a\) and \(b\) are two Fourier series with their own coefficients:

then we can insert for the two sums from (11) in (9) and get

The final integral is \(2\pi\) for \(p+q=k\) and zero otherwise. Consequently, we get

Unfortunately, the convolution sum (13) is very expensive to compute, and the direct application of (10) leads to aliasing errors. Luckily there is a fast approach that eliminates aliasing as well.

The fast, alias-free, approach makes use of the FFT and zero-padded coefficient vectors. The idea is to zero-pad \(\hat{a}\) and \(\hat{b}\) in spectral space such that we get the extended sums

where \(M>N\) and where the coefficients have been zero-padded such that

Now compute the nonlinear term in the larger physical space and compute the convolution as

Finally, truncate the vector \(\widehat{ab}_k\) to the original range \(k\in[-N/2, \ldots, N/2-1]\), simply by eliminating all the wavenumbers higher than \(|N/2|\).

With mpi4py-fft we can compute this convolution using the `padding`

keyword
of the `PFFT`

class:

```
import numpy as np
from mpi4py_fft import PFFT, newDistArray
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (128, 128) # Global shape in physical space
fft = PFFT(comm, N, padding=[1.5, 1.5], dtype=np.complex)
# Create arrays in normal spectral space
a_hat = newDistArray(fft, True)
b_hat = newDistArray(fft, True)
a_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
b_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
# Transform to real space with padding
a = newDistArray(fft, False)
b = newDistArray(fft, False)
assert a.shape == (192//comm.Get_size(), 192)
a = fft.backward(a_hat, a)
b = fft.backward(b_hat, b)
# Do forward transform with truncation
ab_hat = fft.forward(a*b)
```

Note

The padded instance of the `PFFT`

class is often used in addition
to a regular non-padded class. The padded version is then used to handle
non-linearities, whereas the non-padded takes care of the rest, see demo.