Can NIPY get something interesting from BrainVISA databases?

I wrote this document to try to give more information to the NIPY developers about the present and future of BrainVISA database system. I hope it will serve the discussion opened by Jarrod Millman about a possible collaboration between the two projects on this topic. Unfortunately, I do not know other projects providing similar features (such as BIRN) so I will only focus on BrainVISA.

Yann Cointepas

2006-11-21

Introduction

In BrainVISA, all the database system is home made and written in Python. This system is based on the file system and allows to do requests for both reading and writing (get the name of non existing files). We will change this in the future by defining an API (such the one introduced below) and by using at least two implementations, one relying on a relational database system and one compatible with the actual database system. Having one single API will make it possible, for instance, to work on huge databases located on servers and on smaller databases located in a laptop directory (with some synchronization features). This system will be independent from the BrainVISA application, it could be packaged separately. Unfortunately, we cannot say when this work will be done (our developments are slowed because all our lab will move in a new institute in January 2007). Here is a summary describing actual BrainVISA database system and some thoughts of what it may become.

What is a database in BrainVISA today?

A directory is a BrainVISA database if the structure of its sub-directories and the file names in this directory respect a set of rules. These rules make it possible to BrainVISA to scan the whole directory contents and to identify without ambiguity the database elements. These elements are composed of the following information:

  • Data type: identify the contents of a data (image, mesh,

    functional image, anatomical RM, etc). The data types are organized in hierarchy making it possible to decline a generic type in several specialized types. For example, there is a 4D Image type which is specialized in 3D Image. 3D Image is itself declined in several types of which T1 MRI and Brain mask.

  • File format: Represent the format of files used to record a

    data. BrainVISA is able to recognize several file formats (for example DICOM, Analyze/SPM, GIS, etc). It is easy to add new data formats and to provide converters to make it possible for existing processes to use these new formats.

  • Files: contains the names of the files (and/or directories) used

    to record the data.

  • Attributes: an attribute is an association between a name and a

    value. A set of attributes is associated with each element of BrainVISA database. This set represents all of the characteristics of a data (as the image size, the name of the protocol corresponding to the data or the acquisition parameters). Attributes values are set by BrainVISA during directory scanning (typically protocol, group, subject, etc.).

It is possible to completely define the set of rules used to convert a directory in a BrainVISA database. That allows the use of BrainVISA without having to modify an existing file organization. However, the writing of such a system of rules requires very good knowledge of BrainVISA. This is why BrainVISA is provided with a default data organization system that can be used easily.

A database can be used for deciding where to write data. The set of rules is used to generate the appropriate file name according to the data type, file format and attributes. This is a key feature that greatly helps the users and allow automation.

It is not mandatory to use a database to process data with BrainVISA. However, some important features are not available when you are using data which are not in a database. For example, the BrainVISA ability to construct a default output file name when an input data is selected in a process relies on the database system. Moreover, some processes use the database system to find data; for example, the brain mask viewer tries to find the T1 MRI used to build the brain mask in order to superimpose both images in an Anatomist window.

A few thoughts about a possible API for repositories

I think the most important point for data repositories is to define an user API. This API should be independent of data storage and of data organization. Data organization is important because it is very difficult to find a single organization that covers the needs of all users in the long term. In this API, each data item should have an unique identifier (let’s call it an URL). The rest of the API could be divided in two parts:

  1. An indexation system managing data organization. It defines properties attached to data items (for instance, “group” or “subject” can be seen as properties of an FMRI image) as well as possible user requests on the data. This indexation API could have several implementations (relational database, BIRN, BrainVISA, etc.).

  2. A data storage system managing the link between the URL of a data item and its representation on a local file system. This system should take into account various file formats and various file storage systems (e.g. on a local file system, on a distant ftp site, as bytes blocks in a relational database).

This separation between indexation and storage is important for the design of databases, it makes it possible, for instance, to use distant or local data storage, or to define several indexations (i.e. several data organizations) for the same data. However indexation and data storage are not always independent. For example, they are independent if we use a relational database for indexation and URLs for storage, but they are not if file or directory names give indexation information (like in BrainVISA databases described above). At the user level, things can be simpler because the separation can be hidden in one object: the repository. A repository is composed of one indexation system and one data storage system and manage all the links between them. The user can send requests to the repository and receive a set of data items. Each data item contains indexation information (via the indexation system) and gives access to the data (via the storage system). Here is a sample of what-user-code-could-be to illustrate what I have in mind followed by a few comments:

# Get an access to one repository
repository = openRepository( repositoryURL )
# Create a request for selection of all the FMRI in the repository
request = ‘SELECT * FROM FMRI’
# Iterate on data items in the repository
for item in repository.select( request ):
  print item.url
  # Item is a directory-like structure for properties access
  for property in item:
    print property, ‘=’, item[ property ]
  # Retrieve the file(s) (and directorie(s) if any) from the data storage system
  # and convert it to NIFTI format (if necessary).
  files = item.getLocalFiles( format=’NIFTI’ )
  niftiFileName = files[ 0 ]
  # Read the image and do something with it
  ...
  1. I do not yet have a good idea of how to represent requests. Here, I chose to use SQL since it is simple to understand.

  2. This code does not make any assumption on the properties that are associated to an FMRI image.

  3. The method getLocalFiles can do nothing more than return a file name if the data item correspond to a local file in NIFTI format. But the same code can be used to access a DICOM image located in a distant ftp server. In this case, getLocalFiles will manage the transfer of the DICOM file, then the conversion to the required NIFTI format and return name of temporary file(s).

  4. getLocalFiles cannot always return just one file name because on the long term, there will be many data types (FMRI, diffusion MRI, EEG, MEG, etc.) that are going to be stored in the repositories. These different data will use various file formats. Some of these formats can use a combination of files and directories (for instance, CTF MEG raw data are stored in a directory (*.ds), the structural sulci format of BrainVISA is composed of a file(*.arg) and a directory (*.data), NIFTI images can be in one or two files, etc. ).

  5. The same kind of API can be used for writing data items in a repository. One could build a data item, adds properties and files and call something like repository.update( item ).