Repository design

See also Repository API and Can NIPY get something interesting from BrainVISA databases?

For the NIPY system, there seems to be interest for the following:

  • Easy distributed computing

  • Easy scripting, replicating the same analysis on different data

  • Flexibility - easy of inter-operation with other brain imaging systems

At a minimum, this seems to entail the following requirements for the NIPY repository system:

  • Unique identifiers of data, which can be abstracted from the most local or convenient data storage

  • A mechanism for mapping the canonical data model(s) from NIPY to an arbitrary, and potentially even inconsistent repository structure

  • A set of semantic primitives / metadata slots, enabling for example:
    • “all scans from this subject”

    • “the first scan from every subject in the control group”

    • “V1 localizer scans from all subjects”

    • “Extract the average timecourse for each subject from the ROI defined by all voxels with t > 0.005 in the V1 localizer scan for that subject”

These problems are not unique to the problem of brain imaging data, and in many cases have been treated in the domains of database design, geospatial and space telescope data, and the semantic web. Technologies of particular interest include:

  • HDF5 - the basis of MINC 2.0 (and potentially NIFTII 2), the most recent development in the more general CDF / HDF series (and very highly regarded). There are excellent python binding available in PyTables.

  • Relational database design - it would be nice to efficiently select data based on any arbitrary subset of attributes associated with that data.

  • The notion of URI developed under the guidance of the w3c. Briefly, a URI consists of:

    • An authority (i.e. a domain name controlled by a particular entity)

    • A path - a particular resource specified by that authority

    • Abstraction from storage (as opposed to a URL) - a URI does not necessarily include the information necessary for retrieving the data referred to, though it may.

  • Ways of dealing with hierarchical data as developed in the XML field (though these strategies could be implemented potentially in other hierarchical data formats - even filesystems).

Note that incorporation of any of the above ideas does not require the use of the actual technology referenced. For example, relational queries can be made in PyTables in many cases more efficiently than in a relational database by storing everything in a single denormalized table. This data structure tends to be more efficient than the equivalent normalized relational database format in the cases where a single data field is much larger than the others (as is the case with the data array in brain imaging data). That said, adherence to standards allows us to leverage existing code which may be tuned to a degree that would be beyond the scope of this project (for example, fast Xpath query libraries, as made available via lxml in Python).