Processing and preservation
Some things never change...
"The greatest misconception about survey archives is the belief ... that when data ... arrive at the archive their
transfer is complete".
This quote from Allen Potter, the Archive's first director, is taken from an article in New
Society published in 1969, yet it is in many ways timeless. Preparing data both for use by others unrelated to the
collection process and for long-term access through a managed preservation policy, can be arduous, time-consuming and
in some respects unrewarding, in as much that it is often taken for granted by the user community. The 'ingest
process' - as it is now referred to - consists of several distinct but interlinked steps. A valiant
attempt to capture these in diagrammatic form was made in 1976 and is reproduced here.
Whilst this flow diagram may
seem somewhat politically incorrect thirty years on, what is striking is the fact that in many respects the
representation of the ingest process (essentially on the left hand side of the diagram) is conceptually little
different from today.
The internationally-recommended standard for the process management and organisation of
digital archives is now the Open Archival Information System (OAIS) reference model, which was first developed
by the Consultative Committee for Space Data Systems and is now being applied to digital repositories of all
kinds. What is very heartening and reassuring is that the Archive measures up to the OAIS very well and is
compliant in nearly all regards, and that a system of internal processes which evolved over time out of practical
necessity fits well with the theoretical gold standard of the OAIS.
"As a new member of staff in 1978, my induction started with a lesson on the mysteries
of the punched card, followed by a session at a floor-standing computer terminal which
noisily emitted realms of paper charting its communication with the university's huge
In the secretary's office, the desks were piled high with large magnetic tapes waiting
to be packaged in protective containers to be shipped off to waiting researchers."
Senior Census Registration Service Officer, UKDA
Although the conceptual nature of the various ingest activities may not have changed significantly, technological
change has had a major impact on the way in which the ingest tasks are undertaken. The preparation of data for users was
highly complex in the 1970s, and was considerably reliant on access to computer time, something which is inconceivable today.
As Eric Roughley commented:
"At this time  the Archive remained a small, and rather cosy, organisation. The eight or
nine of us would meet each morning to decide which job would be run that afternoon on our two-hour dedicated time slot on the
University's ICT 1909 mainframe".
The later switch to a DEC PDP-10 computer improved matters, but changing the electricity supply
to this machine in the summer of 1976 meant a downtime of over a week, leading to considerable delays in the production of data
and codebooks for users. Compare this to 2007 when the entire Archive moved building and relocated its server room with just one
hour of downtime for users!
The multiplicity of data formats and file size also created problems in the 1970s, as illustrated by this
quote from Bulletin No. 4 June, 1976 relating
to processing of the Family Expenditure Survey (FES) in 1976:
"The FES is a huge data set that usually requires two or three magnetic tapes per survey... The data are
stored ... in a format suitable for the Department of Employment computer programmes, but not, however, in a format that allows
analysis with the kind of computer programmes available to our users. That is, one cannot apply SPSS or TSP or OSIRIS to these
data without a considerable amount of pre-analytical processing. What this meant, in effect, is that we could not simply convert
the data tapes to our own in-house computer code ... [as] ... it would render the data virtually inaccessible to all but the most
skilled programmer....[w]e designed a system which allows the user to determine whether she/he is interested in household or
individual data... Then given a set of variables requested by the user, we shall be able to supply a data set that can be immediately
input into any of the popular statistical programmes. Thus all that users will have to do is complete a specially designed user
request form, on which they describe the variables of interest. This form will serve as the input to our extraction programme"
Where things have not changed is in the belief that, regardless of how data are supplied to the Archive, they should be made
available to researchers in a form which suits their requirements. In this regard one size will never fit all, so now, as then,
processing has to employ the latest technical solutions to continuously ensure that users can choose from a range of formats and
media types when receiving the data.
This is your last chance!
"This is your last chance!", announced the Bulletin in 1990, noting that:
"The University of Essex will be replacing its mainframe computer this Summer
and, in consequence, we shall no longer be able to handle seven-track magnetic tapes, paper tape or punched cards".
In 1967 when the Archive was established no-one really thought about the consequences of having to keep the data files that were being
collected and stored, accessible over time. As technological change, both hardware and software, renders computer files unreadable,
and as magnetic-based storage media deteriorate over time, digital preservation has now become a major topic in its own right. In
the past 40 years the Archive has witnessed a succession of changes in both storage media and file formats. As someone
recently observed: "if digital preservation did not exist, the Data Archive would of [sic] had to invent it!".
"The Data Archive is repeating its 1984 special offer on data supplied on floppy disks. The special offer
charge for any dataset ordered (by 30 September 1985) on floppy disks will be only £9.95, which will
include the handling-charge, the cost of one disk and postage and packing."
Bulletin No. 31 May, 1985
Over the years
the Archive has had to respond and adapt to changes by developing a flexible yet pragmatic policy to preservation - for example,
by holding data in non-proprietary migratable formats across multiple storage media. There is no simple solution to the problem
of digital preservation. It is more about minimising risk of data loss or redundancy rather than trying to double guess what
the future holds. The comforting thing is that in its 40-year history the Archive has yet to 'lose' a dataset - let us hope that
this track record can continue!
The last chance has not exactly passed. The Archive still maintains magnetic tape and punch
card readers and endeavours wherever possible to 'rescue' important data collections held in old formats or on obsolete media.
But what does it mean?
Data, it might be claimed, are only as useful as the metadata which support them. Data cannot be properly analysed and
interpreted without adequate accompanying documentation (or what is nowadays often termed 'metadata'). In terms
of social science data this could take the form of codebooks, technical manuals describing sampling techniques, explanations of
derived variables, application of weightings, and so on - in essence, all that is needed to explain the provenance of the data
and the methodologies used in their creation.
From the 1960s and the days of punch cards onwards, through the development of statistical software packages and virtual
data environments at the turn of the 21st century, the Archive has endeavoured to strike a balance between producing metadata
and documentation files in user-friendly formats, and keeping them in a platform-independent state for preservation and long-term usability.
One major project, undertaken during the 1990s, involved the scanning and transfer of paper documentation for over 3,000
datasets (the vast bulk of the UKDA's collection at that time), produced both suitable image preservation format and easy to use,
universally-readable Adobe Portable Document Format (PDF) files. This enabled the easy distribution of documentation via the web-based
catalogue, therefore maximising the amount of free-to-download information at users' finger-tips.
A major advance witnessed over recent years, but with much potential still to be realised, is the development of
internationally recognised metadata standards. These greatly facilitate the transfer of information between systems
(for example, across countries) and likewise help in addressing the problem of preserving materials over time. One
particular development of importance has been the establishment of the Data Documentation Initiative (DDI). The Archive has
been a long-standing member of and contributor to this organisation, and which has produced an eXtensible Markup Language (XML)-based interoperable standard.
One advantage of the DDI approach is that it retains both data and metadata within the same XML structure.