On my site, I host data on particle motion, including the motion of molecules in gases or liquids, as well as the motion of Lagrangian particles in turbulence. These raw datasets contain trajectories for millions of particles and are intended for further statistical analysis. Understandably, the size of each dataset would be quite large, on the order of hundreds of gigabytes, were it not for the use of a specialized compressed data format. Below I describe this format and provide code samples for reading it in Julia.
In terms of the file format, the data is stored in HDF5, however the organization of the data needs to be explained. Consider a trajectory of a single particle as a function of time , . One can store such a trajectory simply as the values , , and so on, using a fixed time interval . Moreover, this is how is typically obtained in a numerical simulation. But once the simulation is finished and the whole trajectory is known, we can expand into a Chebyshev series, i.e. using Chebyshev polynomials as a basis set. Storing the coefficients of the Chebyshev series requires significantly less space than storing the values of at fixed time intervals. In practice, this approach achieves an order-of-magnitude reduction in the size of stored data.
The Chebyshev polynomial of the first kind of order is defined as
Thus, our trajectory can be represented as a Chebyshev series truncated at order :
where is the th coefficient of the Chebyshev expansion. The argument of accounts for the scaling of the interval to .
With that brief overview, we can proceed to how the Chebyshev coefficients are stored in the HDF5 file along with the caveats regarding particular types of data.
The root of the HDF5 file contains arrays with the coefficients of the Chebyshev expansion with respect to time of various physical quantities. The time dependence of said physical quantities usually reflects their evolution along the trajectories of an ensemble of moving particles. In addition, the root of the HDF5 file contains t_start
and t_end
variables for and . If the conventional name for the time-dependent quantity is , then the corresponding array of the Chebyshev-series coefficients is also named . Unless otherwise stated, all the datasets use the Cartesian system of coordinates for spatial variables.
Let us have a concrete task of storing trajectories of particles moving in a three-dimensional space. For this, we store array of size , so that, for a given particle , its th Cartesian coordinate component at time is
Most of the data stored is that of scalar- or vector-valued functions along particle trajectories. Respectively, these are stored as or arrays. Higher-order tensors are stored similarly, for example the velocity gradient tensor will be stored as an array of dimensions . The value of the velocity gradient for particle at time can be computed as
If you read the HDF5 in Julia with its HDF5 package, the array will be loaded with dimensions reordered in reverse as . This ensures fast operations with Julia’s column-major order representation of multidimensional arrays, but requires extra care when dealing with tensors of order higher than 1.
Many datasets that I host were obtained from numerical simulations with periodic boundary conditions. That means that the actual particle coordinates are discontinuous functions of time, and storing them with Chebyshev polynomials is not optimal. So, instead of storing the actual trajectory of a particle , I store the time integral of its velocity , i.e.
The actual coordinate can be obtained by applying the periodic boundary conditions to .
Different time derivatives of the same physical quantity are not stored separately. Instead, one can calculate higher-order derivatives from the quantity itself by differentiating its Chebyshev series with respect to time. For example, only particle coordinates are stored, from which once can calculate particle velocities and accelerations. In practice time differentiation should be carried out by the software library you are using to work with Chebyshev polynomials. Note however that cannot reliably calculate arbitrary high time derivatives from the data provided. The highest order of the time derivative reliably computable from the provided dataset is explicitly specified.
Apart of the time-dependent variables stored as Chebyshev-series coefficients, each HDF5 file contains /parameters/
group which lists time-independent physical quanitites for the given dataset. In addition, the group /info/
contains description of the dataset and information how to read it, similar to this page.
To read HDF5 in Julia, you need to use the HDF5 package, while working with the Chebyshev polynomials is most conveniently done with the ApproxFun package. Below is the example code for the dataset. It demonstrates how to read
using HDF5
using ApproxFun
using ForwardDiff
using Statistics
function read_data(file_name)
data = Dict()
h5open(file_name, "r") do f
data = read(f)
end
return data
end