Data format

On my site, I host data on particle motion, including the motion of molecules in gases or liquids, as well as the motion of Lagrangian particles in turbulence. These raw datasets contain trajectories for millions of particles and are intended for further statistical analysis. Understandably, the size of each dataset would be quite large, on the order of hundreds of gigabytes, were it not for the use of a specialized compressed data format. Below I describe this format and provide code samples for reading it in Julia.

Overview

In terms of the file format, the data is stored in HDF5, however the organization of the data needs to be explained. Consider a trajectory

\bm x(t)

of a single particle as a function of time

t

t \in [t_\text{start}, t_\text{end}]

. One can store such a trajectory simply as the values

\bm x(t_\text{start})

\bm x(t_\text{start} + \Delta t)

\bm x(t_\text{start} + 2\Delta t)

and so on, using a fixed time interval

\Delta t

. Moreover, this is how

\bm x(t)

is typically obtained in a numerical simulation. But once the simulation is finished and the whole trajectory is known, we can expand

\bm x(t)

into a Chebyshev series, i.e. using Chebyshev polynomials as a basis set. Storing the coefficients of the Chebyshev series requires significantly less space than storing the values of

\bm x(t)

at fixed time intervals. In practice, this approach achieves an order-of-magnitude reduction in the size of stored data.

Chebyshev series

Thus, our trajectory

\bm x(t)

can be represented as a Chebyshev series truncated at order

n_\text{max}

where

\bm x_n

is the

n

th coefficient of the Chebyshev expansion. The form of

\theta

accounts for the scaling of the interval

[t_\text{start}, t_\text{end}]

[-1, 1]

Clenshaw algorithm

Although one can naively sum up the truncated Chebyshev series, this is unstable at the ends of the interval,

t_\text{start}

and

t_\text{end}

, especially if one also wants to compute the time derivatives of

\bm x(t)

. This problem is solved with theClenshaw algorithm.

To compute time derivatives of

\bm x(t)

, it is enough to just differentiate the Clenshaw recurrence relations. Namely, the velocity

\bm u(t) = \mathrm{d} \bm x / \mathrm{d} t

is computed via another recursive relation with new coefficients

\dot{\bm b}_n

Storing actual data

With the brief overview of chebyshev Series done, we can proceed to how the Chebyshev coefficients are stored in the HDF5 file along with the caveats regarding particular types of data.

The root of the HDF5 file contains arrays with the coefficients of the Chebyshev expansion with respect to time of various physical quantities. The time dependence of said physical quantities usually reflects their evolution along the trajectories of an ensemble of moving particles. In addition, the root of the HDF5 file contains t_start and t_end variables for

t_\text{start}

and

t_\text{end}

. If the conventional name for the time-dependent quantity is

f(t)

, then the corresponding array of the Chebyshev-series coefficients is also named

f

. Unless otherwise stated, all the datasets use the Cartesian system of coordinates for spatial variables.

Array dimensions

Let us have a concrete task of storing trajectories of

N

particles moving in a three-dimensional space. For this, we store array

x

of size

(n_\text{max} + 1) \times N \times 3

, so that, for a given particle

k

, its

i

th Cartesian coordinate component

x_i^{(k)}

at time

t

Most of the data stored is that of scalar- or vector-valued functions along particle trajectories. Respectively, these are stored as

(n_\text{max} + 1) \times N

(n_\text{max} + 1) \times N \times 3

arrays. Higher-order tensors are stored similarly, for example the velocity gradient tensor

\gamma_{ij} = \frac{\partial u_i}{\partial u_j}

will be stored as an array

\gamma

of dimensions

(n_\text{max} + 1) \times N \times 3 \times 3

. The value of the velocity gradient for particle

k

at time

t

can be computed as

If you read the HDF5 in Julia with its HDF5 package, the array will be loaded with dimensions reordered in reverse as

3 \times 3 \times N \times n_\text{max}

. This ensures fast operations with Julia’s column-major order representation of multidimensional arrays, but requires extra care when dealing with tensors of order higher than 1.

Periodic boundary conditions

Many datasets that I host were obtained from numerical simulations with periodic boundary conditions. That means that the actual particle coordinates are discontinuous functions of time, and storing them with Chebyshev polynomials is not optimal. So, instead of storing the actual trajectory of a particle

\bm x_\text{true}(t)

, I store the time integral of its velocity

\bm u(t)

, i.e.

The actual coordinate can be obtained by applying the periodic boundary conditions to

\bm x(t)

Time derivatives

Different time derivatives of the same physical quantity are not stored separately. Instead, one can calculate higher-order derivatives from the quantity itself by differentiating its Chebyshev series with respect to time. For example, only particle coordinates are stored, from which once can calculate particle velocities and accelerations. In practice time differentiation should be carried out by the software library you are using to work with Chebyshev polynomials. Note however that cannot reliably calculate arbitrary high time derivatives from the data provided. The highest order of the time derivative reliably computable from the provided dataset is explicitly specified.

Additional information

Apart of the time-dependent variables stored as Chebyshev-series coefficients, each HDF5 file contains /parameters/ group which lists time-independent physical quantities for the given dataset. In addition, the group /info/ contains description of the dataset and information how to read it, similar to this page.

Reading data in Julia

To read HDF5 in Julia, you need to use the HDF5 package, while working with the Chebyshev polynomials is most conveniently done with the ApproxFun package. Below is the example code for the dataset. It demonstrates how to read

using HDF5
using ApproxFun
using ForwardDiff
using Statistics

function read_data(file_name)
  data = Dict()
  h5open(file_name, "r") do f
    data = read(f)
  end
  return data
end

Data format for smooth particle trajectories