Data format for particle trajectories

On my site, I host data on particle motion, including the motion of molecules in gases or liquids, as well as the motion of Lagrangian particles in turbulence. These raw datasets contain trajectories for millions of particles and are intended for further statistical analysis. Understandably, the size of each dataset would be quite large, on the order of hundreds of gigabytes, were it not for the use of a specialized compressed data format. Below I describe this format and provide code samples for reading it in Julia.

Overview

In terms of the file format, the data is stored in HDF5, however the organization of the data needs to be explained. Consider a trajectory x(t)\bm x(t) of a single particle as a function of time tt, t[tstart,tend]t \in [t_\text{start}, t_\text{end}]. One can store such a trajectory simply as the values x(tstart)\bm x(t_\text{start}), x(tstart+Δt)\bm x(t_\text{start} + \Delta t), x(tstart+2Δt)\bm x(t_\text{start} + 2\Delta t) and so on, using a fixed time interval Δt\Delta t. Moreover, this is how x(t)\bm x(t) is typically obtained in a numerical simulation. But once the simulation is finished and the whole trajectory is known, we can expand x(t)\bm x(t) into a Chebyshev series, i.e. using Chebyshev polynomials as a basis set. Storing the coefficients of the Chebyshev series requires significantly less space than storing the values of x(t)\bm x(t) at fixed time intervals. In practice, this approach achieves an order-of-magnitude reduction in the size of stored data.

The Chebyshev polynomial of the first kind TnT_n of order nn is defined as

Tn(θ)=cos(narccosθ),θ[1,1].T_n(\theta) = \cos(n \arccos \theta), \qquad \theta \in [-1, 1].

Thus, our trajectory x(t)\bm x(t) can be represented as a Chebyshev series truncated at order nmaxn_\text{max}:

x(t)=n=0nmaxxnTn(2ttendtstarttendtstart),\bm x(t) = \sum_{n=0}^{n_\text{max}} \bm x_n \, T_n\biggl( \frac{2t - t_\text{end} - t_\text{start}}{t_\text{end} - t_\text{start}} \biggr),

where xn\bm x_n is the nnth coefficient of the Chebyshev expansion. The argument of TnT_n accounts for the scaling of the interval [tstart,tend][t_\text{start}, t_\text{end}] to [1,1][-1, 1].

With that brief overview, we can proceed to how the Chebyshev coefficients are stored in the HDF5 file along with the caveats regarding particular types of data.

Storing actual data

The root of the HDF5 file contains arrays with the coefficients of the Chebyshev expansion with respect to time of various physical quantities. The time dependence of said physical quantities usually reflects their evolution along the trajectories of an ensemble of moving particles. In addition, the root of the HDF5 file contains t_start and t_end variables for tstartt_\text{start} and tendt_\text{end}. If the conventional name for the time-dependent quantity is f(t)f(t), then the corresponding array of the Chebyshev-series coefficients is also named ff. Unless otherwise stated, all the datasets use the Cartesian system of coordinates for spatial variables.

Array dimensions

Let us have a concrete task of storing trajectories of NN particles moving in a three-dimensional space. For this, we store array xx of size (nmax+1)×N×3(n_\text{max} + 1) \times N \times 3, so that, for a given particle kk, its iith Cartesian coordinate component xi(k)x_i^{(k)} at time tt is

xi(k)(t)=n=0nmaxx[n,k,i]Tn(2ttendtstarttendtstart).x_i^{(k)}(t) = \sum_{n=0}^{n_\text{max}} x[n,k,i] \, T_n\biggl( \frac{2t - t_\text{end} - t_\text{start}}{t_\text{end} - t_\text{start}} \biggr).

Most of the data stored is that of scalar- or vector-valued functions along particle trajectories. Respectively, these are stored as (nmax+1)×N(n_\text{max} + 1) \times N or (nmax+1)×N×3(n_\text{max} + 1) \times N \times 3 arrays. Higher-order tensors are stored similarly, for example the velocity gradient tensor γij=uiuj\gamma_{ij} = \frac{\partial u_i}{\partial u_j} will be stored as an array γ\gamma of dimensions (nmax+1)×N×3×3(n_\text{max} + 1) \times N \times 3 \times 3. The value of the velocity gradient for particle kk at time tt can be computed as

γij(k)(t)=n=0nmaxγ[n,k,i,j]Tn(2ttendtstarttendtstart).\gamma_{ij}^{(k)}(t) = \sum_{n=0}^{n_\text{max}} \gamma[n,k,i,j] \, T_n\biggl( \frac{2t - t_\text{end} - t_\text{start}}{t_\text{end} - t_\text{start}} \biggr).

If you read the HDF5 in Julia with its HDF5 package, the array will be loaded with dimensions reordered in reverse as 3×3×N×nmax3 \times 3 \times N \times n_\text{max}. This ensures fast operations with Julia’s column-major order representation of multidimensional arrays, but requires extra care when dealing with tensors of order higher than 1.

Periodic boundary conditions

Many datasets that I host were obtained from numerical simulations with periodic boundary conditions. That means that the actual particle coordinates are discontinuous functions of time, and storing them with Chebyshev polynomials is not optimal. So, instead of storing the actual trajectory of a particle xtrue(t)\bm x_\text{true}(t), I store the time integral of its velocity u(t)\bm u(t), i.e.

x(t)=x(tstart)+tstarttendu(s)ds.\bm x(t) = \bm x(t_\text{start}) + \int_{t_\text{start}}^{t_\text{end}} \bm u(s) \, \mathrm{d} s.

The actual coordinate can be obtained by applying the periodic boundary conditions to x(t)\bm x(t).

Time derivatives

Different time derivatives of the same physical quantity are not stored separately. Instead, one can calculate higher-order derivatives from the quantity itself by differentiating its Chebyshev series with respect to time. For example, only particle coordinates are stored, from which once can calculate particle velocities and accelerations. In practice time differentiation should be carried out by the software library you are using to work with Chebyshev polynomials. Note however that cannot reliably calculate arbitrary high time derivatives from the data provided. The highest order of the time derivative reliably computable from the provided dataset is explicitly specified.

Additional information

Apart of the time-dependent variables stored as Chebyshev-series coefficients, each HDF5 file contains /parameters/ group which lists time-independent physical quanitites for the given dataset. In addition, the group /info/ contains description of the dataset and information how to read it, similar to this page.

Reading data in Julia

To read HDF5 in Julia, you need to use the HDF5 package, while working with the Chebyshev polynomials is most conveniently done with the ApproxFun package. Below is the example code for the dataset. It demonstrates how to read

using HDF5
using ApproxFun
using ForwardDiff
using Statistics

function read_data(file_name)
  data = Dict()
  h5open(file_name, "r") do f
    data = read(f)
  end
  return data
end