Dataset Caching

Issue: better data caching and transmission of update information needed for better data caching

The Data Library could be thought of as a cache of datasets. For most of the datasets, the dataset is analyzed to see how its structure can be presented in a standard way, and a custom script is creating and run periodically to maintain the local copy of the original dataset.

Going forward, DODS allows one to formalize the entire process. The original server already has structured the dataset, and DODS can be used to transmit both the data and metadata.

Example: mom-3 at GFDL

caching is also useful for slow calculations or slow reads

Example: Half-hourly IR
Example: NINO3

caching Data Library data
caching Data Library metadata
caching general datasets

Cache verification

A key part to a cache is verifying it against the original source. HTTP provides a last-modified mechanism, i.e. the response to any URL can be labelled with its last-modified time. Caches make conditional requests (see HTTP reference) and if the object is unchanged, are told so by the server rather than retransmitting the object.

Currently Ingrid has one last-modified time for a dataset, which is not adequate for maintaining copies of datasets which are continually extended in time. However, if last-modified were changed from a number to a function of requested time, and the last-modified time on the final HTTP object would then be calculated based only on the data needed, and the standard HTTP mechanism could be used for the most part (when the client finds out the dataset has changed, it could request the subset that corresponds to the data it already has to check whether that data is invalid or not).

It could be useful, however, to have additional queries that a cache could make of a server to see if any datasets have been changed.