ACROSS THE DECADES - 40 years of data archiving
 

Processing and preservation

Some things never change...

"The greatest misconception about survey archives is the belief ... that when data ... arrive at the archive their transfer is complete".

This quote from Allen Potter, the Archive's first director, is taken from an article in New Society published in 1969, yet it is in many ways timeless. Preparing data both for use by others unrelated to the collection process and for long-term access through a managed preservation policy, can be arduous, time-consuming and in some respects unrewarding, in as much that it is often taken for granted by the user community. The 'ingest process' - as it is now referred to - consists of several distinct but interlinked steps. A valiant attempt to capture these in diagrammatic form was made in 1976 and is reproduced here.

flow diagram

Whilst this flow diagram may seem somewhat politically incorrect thirty years on, what is striking is the fact that in many respects the representation of the ingest process (essentially on the left hand side of the diagram) is conceptually little different from today.

Open Archival Information System (OAIS) reference model

The internationally-recommended standard for the process management and organisation of digital archives is now the Open Archival Information System (OAIS) reference model, which was first developed by the Consultative Committee for Space Data Systems and is now being applied to digital repositories of all kinds. What is very heartening and reassuring is that the Archive measures up to the OAIS very well and is compliant in nearly all regards, and that a system of internal processes which evolved over time out of practical necessity fits well with the theoretical gold standard of the OAIS.



UKDA 40 logo "As a new member of staff in 1978, my induction started with a lesson on the mysteries of the punched card, followed by a session at a floor-standing computer terminal which noisily emitted realms of paper charting its communication with the university's huge mainframe computer.

In the secretary's office, the desks were piled high with large magnetic tapes waiting to be packaged in protective containers to be shipped off to waiting researchers."

Kathy Sayer
Senior Census Registration Service Officer, UKDA

Others do...

Although the conceptual nature of the various ingest activities may not have changed significantly, technological change has had a major impact on the way in which the ingest tasks are undertaken. The preparation of data for users was highly complex in the 1970s, and was considerably reliant on access to computer time, something which is inconceivable today. As Eric Roughley commented:

"At this time [1972] the Archive remained a small, and rather cosy, organisation. The eight or nine of us would meet each morning to decide which job would be run that afternoon on our two-hour dedicated time slot on the University's ICT 1909 mainframe".

The later switch to a DEC PDP-10 computer improved matters, but changing the electricity supply to this machine in the summer of 1976 meant a downtime of over a week, leading to considerable delays in the production of data and codebooks for users. Compare this to 2007 when the entire Archive moved building and relocated its server room with just one hour of downtime for users!

The multiplicity of data formats and file size also created problems in the 1970s, as illustrated by this quote from Bulletin No. 4 June, 1976 relating to processing of the Family Expenditure Survey (FES) in 1976:

"The FES is a huge data set that usually requires two or three magnetic tapes per survey... The data are stored ... in a format suitable for the Department of Employment computer programmes, but not, however, in a format that allows analysis with the kind of computer programmes available to our users. That is, one cannot apply SPSS or TSP or OSIRIS to these data without a considerable amount of pre-analytical processing. What this meant, in effect, is that we could not simply convert the data tapes to our own in-house computer code ... [as] ... it would render the data virtually inaccessible to all but the most skilled programmer....[w]e designed a system which allows the user to determine whether she/he is interested in household or individual data... Then given a set of variables requested by the user, we shall be able to supply a data set that can be immediately input into any of the popular statistical programmes. Thus all that users will have to do is complete a specially designed user request form, on which they describe the variables of interest. This form will serve as the input to our extraction programme"
processing the JUVOS dataset in 1983 onto punch cards - note the tapes and drawers of punch cards

Where things have not changed is in the belief that, regardless of how data are supplied to the Archive, they should be made available to researchers in a form which suits their requirements. In this regard one size will never fit all, so now, as then, processing has to employ the latest technical solutions to continuously ensure that users can choose from a range of formats and media types when receiving the data.

This is your last chance!

"This is your last chance!", announced the Bulletin in 1990, noting that:

"The University of Essex will be replacing its mainframe computer this Summer and, in consequence, we shall no longer be able to handle seven-track magnetic tapes, paper tape or punched cards".

In 1967 when the Archive was established no-one really thought about the consequences of having to keep the data files that were being collected and stored, accessible over time. As technological change, both hardware and software, renders computer files unreadable, and as magnetic-based storage media deteriorate over time, digital preservation has now become a major topic in its own right. In the past 40 years the Archive has witnessed a succession of changes in both storage media and file formats. As someone recently observed: "if digital preservation did not exist, the Data Archive would of [sic] had to invent it!".

UKDA 40 logo "The Data Archive is repeating its 1984 special offer on data supplied on floppy disks. The special offer charge for any dataset ordered (by 30 September 1985) on floppy disks will be only 9.95, which will include the handling-charge, the cost of one disk and postage and packing."

Bulletin No. 31 May, 1985

Over the years the Archive has had to respond and adapt to changes by developing a flexible yet pragmatic policy to preservation - for example, by holding data in non-proprietary migratable formats across multiple storage media. There is no simple solution to the problem of digital preservation. It is more about minimising risk of data loss or redundancy rather than trying to double guess what the future holds. The comforting thing is that in its 40-year history the Archive has yet to 'lose' a dataset - let us hope that this track record can continue!

Postscript:

The last chance has not exactly passed. The Archive still maintains magnetic tape and punch card readers and endeavours wherever possible to 'rescue' important data collections held in old formats or on obsolete media.

But what does it mean?

Data, it might be claimed, are only as useful as the metadata which support them. Data cannot be properly analysed and interpreted without adequate accompanying documentation (or what is nowadays often termed 'metadata'). In terms of social science data this could take the form of codebooks, technical manuals describing sampling techniques, explanations of derived variables, application of weightings, and so on - in essence, all that is needed to explain the provenance of the data and the methodologies used in their creation.

From the 1960s and the days of punch cards onwards, through the development of statistical software packages and virtual data environments at the turn of the 21st century, the Archive has endeavoured to strike a balance between producing metadata and documentation files in user-friendly formats, and keeping them in a platform-independent state for preservation and long-term usability.

paper documentation

One major project, undertaken during the 1990s, involved the scanning and transfer of paper documentation for over 3,000 datasets (the vast bulk of the UKDA's collection at that time), produced both suitable image preservation format and easy to use, universally-readable Adobe Portable Document Format (PDF) files. This enabled the easy distribution of documentation via the web-based catalogue, therefore maximising the amount of free-to-download information at users' finger-tips.

A major advance witnessed over recent years, but with much potential still to be realised, is the development of internationally recognised metadata standards. These greatly facilitate the transfer of information between systems (for example, across countries) and likewise help in addressing the problem of preserving materials over time. One particular development of importance has been the establishment of the Data Documentation Initiative (DDI). The Archive has been a long-standing member of and contributor to this organisation, and which has produced an eXtensible Markup Language (XML)-based interoperable standard. One advantage of the DDI approach is that it retains both data and metadata within the same XML structure.