neurosift-blog

Leveraging LINDI for efficient and non-redundant NWB access on DANDI

INCF Neuroinformatics Assembly, September 26, 2024

Jeremy Magland, Center for Computational Mathematics, Flatiron Institute

With Ryan Ly and Oliver Ruebel

Thanks: NWB team, DANDI team, CatalystNeuro

LINDI Objectives

LINDI (LINked Data Interface) sets out to solve two challenges for NWB data on DANDI:

What is a LINDI file?

Get the details here.

LINDI

Deep dive into the JSON format

LINDI JSON contains:

It does not contain:

Essentially it contains as much information as possible without the large binary data.

Advantages of JSON over HDF5:

Advantages of leaving out large binary data:

Here’s an example .nwb.lindi.json file

If you are familiar Zarr, you’ll recognize the .zattrs, .zgroup, .zarray, and chunk files. That’s because LINDI is a valid Zarr store!

Here’s an excerpt from this .nwb.lindi.json file:

{
    "acquisition/ElectricalSeriesEEG/.zattrs": {
      "comments": "no comments",
      "description": "voltage measured over time and associated timestamps from EEG array",
      "namespace": "core",
      "neurodata_type": "ElectricalSeries",
      "object_id": "7e71efad-e026-459c-8a01-5a6e61fa4214"
    },
}

You’ll see that the content of the .zattrs Zarr file is embedded in the JSON file.

Similarly for .zarray files:

{
    "acquisition/ElectricalSeriesEEG/data/.zarray": {
      "chunks": [ 83835, 1 ],
      "compressor": null,
      "dtype": "<i2",
      "fill_value": 0,
      "filters": [ { "id": "zlib", "level": 4 } ],
      "order": "C",
      "shape": [ 21461760, 30 ],
      "zarr_format": 2
    }
}

The tricky part is the data chunks (this is inspired by Kerchunk and the reference file system of fsspec).

{
    "acquisition/ElectricalSeriesEEG/timestamps/1015": [
      "https://api.dandiarchive.org/api/ dandisets/000458/ versions/0.230317.0039/assets/ db2372af-f041-42c8-a5f1-594be5a83c9e/download/",
      960839955,
      51593
    ],
}

This is an external reference to a chunk of data in a remote file. In this case we are looking at th 1015th chunk of the timestamps dataset in the ElectricalSeriesEEG group, and it is referring to 51,593 bytes of data starting at byte 960,839,955 in the remote HDF5 file.

If LINDI is Zarr, how does it represent HDF5?

Zarr and HDF5 are both hierarchical data formats, but they differ in some important ways:

HDF5 Zarr Compatibility notes
Groups Groups Compatible
Datasets Arrays Compatible
Scalar datasets No direct equivalent Not compatible
Attributes Attributes Largely compatible with some caveats
Chunking Chunking Largely compatible, default chunking is different
Compression Compression Compatible for the most usual cases, but Zarr allows custom compressors via numcodecs
Datatypes Datatypes Compatible for the most common cases (int16, uint16, int32, uint32, float32, float64, etc.)
References No direct equivalent Not compatible
Soft links No direct equivalent Not compatible
Compound datatypes No direct equivalent Not compatible

LINDI contains special conventions for representing some HDF5 features used by NWB that are not directly compatible with Zarr.

For example, in the above .nwb.lindi.json file:

{
    ".zattrs": {
      ".specloc": {
        "_REFERENCE": {
          "object_id": null,
          "path": "/specifications",
          "source": ".",
          "source_object_id": "9adc2c77-b588-4966"
        }
      },
      "namespace": "core",
      "neurodata_type": "NWBFile",
      "nwb_version": "2.5.0",
      "object_id": "9adc2c77-b588-4966-b630-0d12c3059156"
    }
}

Precomputation of LINDI files for public Dandisets

Neurosift uses precomputed LINDI files to efficiently stream NWB files from public Dandisets.

diagram1

Neurosift load times with and without LINDI

Compare Neurosift load speeds with and without precomputed LINDI files for Danidset 000458:

Python streaming with and without LINDI

There are a number of ways to stream NWB files in Python. Here are four examples:

# Precomputed LINDI (fast)
def stream_nwbfile_lindi_precomputed(nwb_url: str):
    file = lindi.LindiH5pyFile.from_lindi_file(nwb_url)
    io = NWBHDF5IO(file=file, load_namespaces=True)
    nwbfile = io.read()
    return nwbfile, io

# HDF5 via lindi (medium)
def stream_nwbfile_lindi(nwb_url: str):
    file = lindi.LindiH5pyFile.from_hdf5_file(nwb_url)
    io = NWBHDF5IO(file=file, load_namespaces=True)
    nwbfile = io.read()
    return nwbfile, io

# HDF5 via remfile (medium)
def stream_nwbfile_remfile(nwb_url: str):
    file = remfile.File(nwb_url)
    h5f = H5pyFile(file, mode="r")
    io = NWBHDF5IO(file=h5f, load_namespaces=True)
    nwbfile = io.read()
    return nwbfile, io

# HDF5 via fsspec (slow)
def stream_nwbfile_fsspec(nwb_url: str):
    fs = filesystem("http")
    file_system = fs.open(nwb_url, "rb")
    file = H5pyFile(file_system, mode="r")
    io = NWBHDF5IO(file=file, load_namespaces=True)
    nwbfile = io.read()
    return nwbfile, io

The timings are compared for the Dandiset 000485 example using this script.

Here are the results produced by a GitHub Actions workflow.

On 9/23/2024 for the 000485 example, the average elapsed times were:

Method Average elapsed time (s)
lindi precomputed 1.23
remfile 2.79
lindi 2.81
fsspec 14.96
ros3 Not tested

Here, the precomputed LINDI method is the fastest, followed by the remfile and lindi methods, and the fsspec method is the slowest.

Note that this is only for the initial load of the NWB file. Subsequent access to data chunks may be more comparable between the methods.

Derivative NWB files with LINDI

graph TD
    A[Raw NWB file on DANDI] -->|Dendro Job| B[Derived LINDI NWB file]
    B -->|Dendro Job| C[Second derived LINDI NWB file]
    C -->|Dendro Job| D[Upload to DANDI]

CEBRA example, Dandiset 000140

CEBRA is a machine-learning method that can be used to compress time series in a way that reveals otherwise hidden structures in the variability of the data.

Let’s take a look at

Dandiset 000140 – MC_Maze_Small: macaque primary motor and dorsal premotor cortex spiking activity during delayed reaching

Open one of the sessions in Neurosift

image

We’ve got a trials table (100 trials), three SpatialSeries objects (cursor_pos, eye_pos, hand_pos), and 142 neural Units.

Click on the “Units” link and then the “CEBRA” tab. Here you can queue up a job to compute a CEBRA embedding for the neural activity.

image

This produces a new NWB file with the CEBRA embedding added on as a new TimeSeries object. Click on the “View output in Neurosift” link to view the output file.

image

Notice there is a new object at processing/CEBRA/embedding.

Tick the checkboxes for “trials and “embedding” and then click “View 2 items” in the left panel to get a synchronized view of the trials and the CEBRA embedding.

image

You can see that the embedding has periodic structure that matches the trial structure! This is significant because in this case we did not provide the trial structure or the behavioral data to the CEBRA process. It was able to infer the trial structure from the neural data alone.

LINDI binary (tar) format

That previous example was a .nwb.lindi.tar file, instead of a .nwb.lindi.json file. This is needed when we are writing new large data blobs into the derived file, as the JSON format is not suitable for large binary data.

Read more about the LINDI tar format here with the technical specification here.

Derivative NWB files on DANDI

Right now DANDI does not support the LINDI format, but I am hoping that they can move toward supporting it. This presentation I hope will begin to make the case.

For now I am uploading test LINDI files to the staging DANDI Archive.

Let’s return to the above example. Click on the DENDRO tab and you can see a button for uploading to the DANDI staging archive.

image

You can see the uploaded file on DANDI here.

image

Spike sorting example

Here’s an example of a derivative NWB file produced by a spike sorting process.

image

Notice that the file size is only 11 MB even though it contains references to the full raw data with a size of 69 GB.

Open in Neurosift and you can see the spike sorting results in processing/ecephys -> units_kilosort4. This is in addition to the original spike trains in the units table at the default location.

image

You can view the raster plot, average waveforms, autocorrelograms, and unit locations. Here’s a synchronized view of average waveforms and autocorrelograms.

image

Concerns and considerations

Keeping LINDI references alive

Derivative NWB files in LINDI format may reference other DANDI assets. We need a mechanism for only allowing references to published assets. Drafts can depend on drafts, but published should depend on published. Will need to think about how this will work.

Should we really use a new file format?

Keep in mind that LINDI is compatible with:​

No existing tool enables what we are trying to do.