.. _data-package-discuss: ########################## Principles of data package ########################## ******* Summary ******* This is a discussion of data packages, as they are currently implemented in nibabel / nipy. This API proved to be very uncomfortable, and we intend to replace it fairly soon. See ``data_packages.rst`` in the `nibabel wiki`_ for our current thinking, not yet implemented. ********** Motivation ********** When developing or using nipy, many data files can be useful. #. *small test data* - very small data files required for routine code testing. By small we mean less than 100K, and probably much less. They have to be small because we keep them in the main code repository, and you therefore always get them with any download. #. *large test data*. These files can be much larger, and don't come in the standard repository. We use them for tests, but we skip the tests if the data are not present. #. *template data* - data files required for some algorithms to function, such as templates or atlases #. *example data* - data files for running examples. We need some standard way to provide the larger data sets. To do this, we are here defining the idea of a *data package*. This document is a draft specification of what a *data package* looks like and how to use it. ******************* Separation of ideas ******************* This section needs some healthy beating to make the ideas clearer. However, in the interests of the 0SAGA_ software model, here are some ideas that may be separable. Package ======= This idea is rather difficult to define, but is a bit like a data project, that is a set of information that the packager believed had something in common. The package then is an abstract idea, and what is in the package could change completely over course of the life of the package. The package then is a little bit like a namespace, having itself no content other than a string (the package name) and the data it contains. Package name ============ This is a string that gives a name to the package. Package instantiation ===================== By *instantiation* we mean some particular actual set of data for a particular package. By actual, we mean stuff that can be read as bytes. As we add and remove data from the package, the *instantiation* changes. In version control, the instantiation would be the particular state of the working tree at any moment, whether this has been committed or not. It might not be enjoyable, but we'll call a package instantiation a *pinstance*. Pinstance revision ================== A revision is an instantiation of the working tree that has a unique label - the *revision id*. Pinstance revision id ===================== The *revision id* is a string that identifies a particular *pinstance*. This is the equivalent of the revision number in subversion_, or the commit hash in systems like git_ or mercurial_. There is only one pinstance for any given revision id, but there can be more than one revision id for a pinstance. For example, you might have a revision of id '200', delete a file, restore the file, call this revision id '201', but they might both refer to the same instantiation of the package. Or they might not, that's up to you, the author of the package. Pinstance tag ============= A *tag* is a memorable string that refers to a particular pinstance. It differs from a revision id only in that there is not likely to be a tag for every revision. It's possible to imagine pinstances without a revision id but with a tag, but perhaps it's reasonable to restrict tags to refer to revisions. A *tag* is equivalent to a tag name in git or mercurial - a memorable string that refers to a static state of the data. An example might be a numbered version. So, a package may have a revision uniquely identified by a revision id ``af5bd6``. We might decide to label this revision ``release-0.3`` (the equivalent of applying a git tag). ``release-0.3`` is the tag and ``af5bd6`` is the revision id. Different sources of the same package might possibly produce different tags [#tag-sources]_ Pinstance version ================= A *pinstance* might also have a version. A version is just a tag that can be compared using some algorithm. .. _prundle: Package provider bundle ======================= Maybe we could call this a "prundle". The *provider bundle* is something that can deliver the bytes of a particular pinstance. For example, if you have a package named "interesting-images", you might have a revision of that package identified by revision id "f745dc2" and tagged with "version-0.2". There might be a *provider bundle* of that instantiation that is a zipfile ``interesting-images-version-0.2.zip``. There might also be a directory on an http server with the same contents ``http://my.server.org/packages/interesting-images/version-9.2``. The zipfile and the http directory would both be *provider bundles* of the particular instantiation. When I unpack the zipfile onto my hard disk, I might have a directory ``/my/home/packages/interesting-images/version-0.2``. Now this path is a provider bundle. Provider bundle format ====================== In the example above, the zipfile, the http directory and the local path are three different provider bundle formats delivering the same package instantiation. Let's call those formats: * zipfile format * url-path format * local-path format Pinstance release ================= A release might be a package instantiation that one person has: #. tagged #. made available as one or more *provider bundles* .. _prundle-discovery: Prundle discovery ================= We *discover* a package bundle when we ask a system (local or remote) whether they have a package bundle at a given revision, tag, or bundle format. That implies two discoveries - *local discovery* (is the package bundle on my local system, if so where is it?); and *remote discovery* (is the package bundle on your expensive server and if so, how do I get it?). For the Debian distributions, the ``sources.list`` file identifies sources from which we can query for software packages. Those would be sources for *remote discovery* in our language. Prundle discovery source ======================== A *prundle discovery source* is somewhere that can answer prundle discovery queries. One such thing might be a prundle registry, where an element in the registry contains information about a particular prundle. At a first pass this might contain: * package name * bundle format * revision id (optional) * tag (optional) Maybe it should also contain information about where the information came from. Pinstance metadata query ======================== We query a pinstance when we know that a particular system (local or remote) has a package bundle of the pinstance we want. Then we get some information about that pinstance. By definition, different prundles relating to the same pinstance have the same metadata. Pinstance metadata query source =============================== A *pinstance metadata query source* is somewhere that can answer pinstance metadata queries. Obviously a source may well be both a *prundle discovery source* and a *pinstance metadata query source*. Pinstance installation ====================== We install a pinstance when we get some prundle containing the pinstance and place it on local storage, such that we can *discover* the prundle on our own (local) system. That is we take some prundle and convert it to a *local-path* format bundle *and* we register this local-path format bundle to a *discovery source*. Data and metadata ================= Pinstance data is the bytes as they are arranged in a particular pinstance. Pinstance metadata is data about the pinstance. It might include information about what data is in the package. Prundle metadata Information about the particular prundle format. *********************** Comparative terminology *********************** In which we compare the package terminology above to the terminology of Debian packaging. Compared to Debian packaging ============================ * A Debian distribution is a label - such as 'unstable' or 'lenny' - that refers to a set of package revisions that go together. We have no equivalent. * A Debian *repository* is a set of packages within a distribution that go together - e.g. 'main' or 'contrib'. We probably don't have an equivalent (unless we consider Debian's repository as being like a very large package in our language). * A Debian source is a URI giving a location from which you can collect one or more repositories. For example, the line: "http://www.example.com/packages stable main contrib" in a "sources.list" file refers to the *source* "http://www.example.com/packages" providing *distribution* "stable" and *repositories* (within stable) of "main" and "contrib". In our language the combination of URI, distribution and repository would refer to a *prundle discovery source* - that is - something that will answer queries about bundles. * package probably means the same for us as for Debian - a name - like "python-numpy" - that refers to a set of files that go together and should be installed together. * Debian packages have versions to reflect the different byte contents. For example there might be a .deb file (see below) "some-package-0.11_3-i386.deb" for one distribution, and another (with different contents) for another distribution - say "some-package-0.12_9-i386.deb". The "0.11_3" and "0.12_9" parts of the deb filename are what we would call *package instantiation tags*. * A Debian deb file is an archive in a particular format that unpacks to provide the files for a particular package version. We'd call the deb file a *package bundle*, that is in *bundle format* "deb-format". ********** Desiderata ********** We want to build a package system that is very simple ('S' in 0SAGA_). For the moment, the main problems we want to solve are: creation of a package instantiation, installation of package instantiations, local discovery of package instantiations. For now we are not going to try and solve queries. At least local discovery should be so simple that it can be implemented in any language, and should not require a particular tool to be installed. We hope we can write a spec that makes all of (creation, installation, local discovery) clearly defined, so that it would be simple to write an implementation. Obviously we're going to end up writing our own implementation, or adapting someone else's. datapkg_ looks like the best candidate at the moment. ****** Issues ****** From a brief scan of the `debian package management documentation `_. Dependency management ===================== (no plan at the moment) Authentication and validation ============================= * Authentication - using signatures to confirm that you made this package. * Verification - verify that the contents have not been corrupted or changed since the original instantiation. For dependency and validation, see the `Debian secure apt`_ page. One related proposal would be: .. _Debian secure apt: https://wiki.debian.org/SecureApt * Each package instantiation would carry a table of checksums for the files within. Someone using this instantiation would check the checksums to confirm that they had the intended content. * Authentication would involve some kind of signing of the table of checksums, as in the ``Release.gpg`` file in Debian distributions (`Debian secure apt`_ again). This involves taking a checksum of the table of checksums, then using our trusted private key to encrypt this checksum, generating a digital signature. The signature is the thing we provide to the user. The user then gets our public key or has it already; they use the key to decrypt the signature to get the checksum, and they check the resulting checksum against the actual checksum of the checksum table. The property of the public/private key pair is that it is very hard to go backwards. To explain, here's an example. Imagine someone we don't like has made a version of the package instantiation, but wants to persuade the world that we made it. Their contents will have different checksums, and therefore a different checksum for the checksum table. Let's say the checksum of the new checksum table is *X*. They know that you, the user, will use your own copy of our public key, and they can't get at that. Their job then, is to make a new encrypted checksum (the signature) that will decrypt with our real public key, to equal *X*. That's going backwards from the desired result *X* to the signature, and that is very hard, if they don't have our private key. ****************************** Differences from code packages ****************************** The obvious differences are: #. Data packages can be very large #. We have less need for full history tracking (probably) The size of data packages probably mean that using git_ itself will not work well. git_ stores (effectively) all previous versions of the files in the repository, as zlib compressed blobs. The working tree is an uncompressed instantiation of the current state. Thus, if we have, over time, had 4 different versions of a large file with little standard diff relationship to one another, the repository will have four zlib compressed versions of the file in the ``.git/objects`` database, and one uncompressed version in the working tree. The files in data packages may or may not compress well. In contrast to the full git_ model, we may want to avoid duplicates of the data. We probably won't by default want to keep all previous versions of the data together at least locally. We probably do want to be able to keep track of which files are the same across different instantiations of the package, in the case where we already have one instantiation on local disk, and we are asking for another, with some shared files. We might well want to avoid downloading duplicate data in that case. Maybe the way to think of it is of the different costs that become important as files get larger. So the cost for holding a full history becomes very large, whereas the benefit decreases a little bit (compared to code). ************* Some usecases ************* Discovery ========= :: from ourpkg import default_registry my_pkg_path = default_registry.pathfor('mypkg', '0.3') if mypkg_path is None: raise RuntimeError('It looks like mypkg version 0.3 is not installed') .. rubric:: Footnotes .. [#tag-sources] Revision ids could for example be hashes of the package instantiation (package contents), so they could be globally unique to the contents, wherever the contents was when the identifier was made. However, *tags* are just names that someone has attached to a particular revision id. If there is more than one person providing versions of a particular package, there may not be agreement on the revision that a particular tag is attached to. For example, I might think that ``release-0.3`` of ``some-package`` refers to package state identified by revision id ``af5bd6``, but you might think that ``release-0.3`` of ``some-package`` refers to some other package state. In this case you and are are both a *tag sources* for the package. The state that particular tag refers to can depend then on the source from which the tag came. .. include:: ../links_names.txt