A Study Group on Data Preservation and Long Term Analysis in HEP was initiated end 2008 and involving major running experiments and associated computing centers.
The first workshop of the Study Group took place January 26-28th 2009,
at DESY. There is a general feeling that a lot of information was
provided and the meeting was indeed very helpful in giving shape to
the issue of data preservation in HEP. The problem is substantial and
past experience shows that early preparation is needed and sufficient
resources should be allocated.
The synthesis work is still to be done and more input is needed in
many areas. The "raison d'etre" of data preservation should be clearly
and convincingly formulated, including a viable economic model. All
experiments have the capability to take some concrete actions now and
propose models for data preservation based on real life HEP
computing. The technological survey also plays a very important role,
since one of the crucial factors may indeed be the hardware
evolution. The whole process must be supervised by well defined
structures and steered by clear specifications, endorsed by major
laboratories and computing centres.
In order to sharpen our focus along these lines and to factorise the
discussions, four working groups were defined:
* WG1: Physics Case
* WG2: Preservation Models
* WG3: Governance
* WG4: Technologies
and are described more in details below.
Regular meetings of the Study Group and parallel discussions will take
place during the next few months. The aim is to produce a draft
document for the next workshop to be held at SLAC in mid 2009.
In the following, a short summary of the five sessions of the first
workshop is given. A longer description can be found in the minutes
taken by Andre Holzner and also in the slides attached to the workshop
agenda, which can be found here:
http://indico.cern.ch/conferenceDisplay.py?confId=42722
Short Summary of the Workshop
The workshop was organised into five sessions: Reports of Analysis
Models from HEP Experiments, Computing Centres and Technologies, Past
Experience with Data Preservation, General Initiatives and Funding
Programs and Final Discussions.
This first session was devoted to a review of the analysis models used
by some representative high energy physics experiments: H1, ZEUS, CDF,
D0, BaBar, Belle, BES, CLEO. The presentations were prepared according
to a pre-defined template and allowed a rough comparison of the
various configurations. Not surprisingly, the data structures are
similar: multi-level, from RAW data to higher abstraction. The
analysis is almost everywhere based on C++, using ROOT and is
performed mostly on local computing farms. The Monte Carlo simulation
also uses a farm based approach but it is striking to see how popular
the GRID is for mass production. The amount of data that should be
stored for analysis varies between 0.5 and 10 Pb per experiment, not
huge, but respectable.
The most relevant question was the degree of preparation for long term
analysis. It is clear that the issue is quite fresh in the community,
in other words: not defined. The HERA community is particularly
concerned since no other data of this type will be collected in the
next two decades. BaBar have started preparing the software for long
term hibernation and also plans to explore some novel (to HEP)
techniques like virtualisation. The other communities are less
concerned in the near future since the data taking is still going on
(for example at Tevatron) or the transition to a new generation of
experiments is foreseen. In the latter case, the preserved data could
be used for further checks of the new data sets, to come however not
immediately. The general feeling was that: we should do something now,
since the collected data is in all cases unique and may serve for
further purposes.
In the next session on Monday afternoon the computing centres were
invited to present what they can contribute in the area of data
preservation. It was quite clear that the HEP IT centres have no
specific mission so far, and this is generally perceived as a service
to the HEP collaborations (which never defined it in fact). The "data"
could easily end up in a (well painted) cupboard in a basement. No
guarantee is given that certain media can still be read in a practical
time and there is even less of a guarantee that the meta-data will be
available. On the positive side, expert knowledge on long term storage
exists, since the HEP centres deal with complex experimental requests
and are operators for the "alive" data storage.
On the Monday evening familiar experiments (ee, pp, ep) met for
specific discussions. The conclusions were presented in the Wednesday
plenary.
Past experiences of data preservation were presented on Tuesday
morning. The most striking example is JADE. The re-analysis of jet
production data, taking into account the improved theory and improved
simulation lead to a really fundamental measurement, where the
original data was imprecise and non-conclusive.
If there was ever an example to convince ourselves about the
fundamentals of data preservation, this is it. LEP experience was also
presented and shed some light on the difficulty of the issue. The
preservation was in general not planned early enough. Less than a
decade after the end of the running, the LEP data are not available
anymore. Some restricted, high level subsets are available as an
heritage of the transverse working groups, but these subsets are
neither useful for a new analysis nor suited for improving the
precision of an existing analysis. Most of the LEP data is probably
lost or will be lost soon if no vigorous revival action is taken. In a
better position is the OPAL collaboration, with a defined structure of
data supervision. However, the resources are not defined for this
activity and the technological evolution can lead to a dead end too.
Experience from astrophysics (Virtual Observatory) was presented. A
common framework was initiated in order to give access to the full
spectra of data taken by various observation missions. The effort to
converge to a common format is supported by an European
programme. Scientific output from such an initiative seem promising,
several papers were produced using the framework and the stored code
was used for simulations. A general feeling was that many ideas can be
applied to our field.
The "HEP analyst friend" is nowadays ROOT, so we had an interesting
overview of the history and perspectives of this central HEP
software. We were (not) surprised to learn that ROOT can be used in a
simple way, with an integrated documentation, schema evolution,
extension for multi-core computing and many other goodies that may
fulfill the requirements of stability over long periods. Simplicity
was the main recommendation from the ROOT team.
A sociological survey of the community was presented by Salvatore
Mele, within the framework of the PARSE program. About 1200 physicists
answered a questionnaire and the opinion in the community is sound: a
strong support to preserve HEP data can be read across the
questions. The majority of people think that it takes 1-10% more
effort to preserve data properly and that the effort for preserving
data should start concurrently to data taking.
Various programs dedicated to more general initiatives of digital data
preservation were presented from UE/FP7 (Salvatore Mele), DOE (Amber
Boehnlein) and STFC (David Corney). The various projects already
contain a rich knowledge in particular on the issues related to the
data ownership and open access, but also on technological survey and
related infrastructure. It seems very useful to scrutinise such
initiatives and try to identify project characteristics that are
applicable in our field, such that relevant aspects of the proposals
are in line with the common knowledge in other fields.
The other friend of the HEP physicist is SPIRES (don't say no, 50000
of you looked for a job there, it seems). A new phase INSPIRE will
greatly extend the capabilities to store information beyond the
classical pdf's. Indeed it may be possible to store more data, macros
to produce the plots, the analysis code and even data sets. It is
clear that even the simplest extension of the normal way to provide
the public information will be highly beneficial. This approach of
preservation would proceed from the more abstract to the more basic
and it is interesting to watch in parallel with the "in-house"
approaches, starting from RAW data and the full code.
The discussions on Wednesday started with a summary from the sister
experiments. There are obvious points of convergences and even
initiatives to proceed to a common data format were mentioned. It is
understood that not saving the bits is the issue, but creating the
complex environment needed to retrieve and to use the information in
order to extract a physics message.
A level-based preservation model was presented by H1/ZEUS. The system
can be locked successively at any level, but has the ambition to
conserve the full analysis capability. An "encapsulated" model was
presented by BaBar: re-naturalise the data and the software to a
single machine, betting on hardware capabilities increase (all data on
one disk, a powerful CPU for mono-machine analysis). No model yet for
pp, but past lessons and present status of the analysis (stable,
understood, "easy") give good hopes that the action will be
successful. It should be mentioned however, that the CDF Run I data
are already lost. All experiments expressed worries about the
associated person power.
Stephen Wolbers nicely listed all critical issues debated during the
workshop. A few stand out points are in particular the need for having
a physics case for data preservation, and that documentation shoulod
replace expertise in the long term. For running experiments (eg
Tevatron) more expertise is still around. Thre question of ownership
and public access should also be addressed. An agreement / long term
plan between experiments and computing centres would be useful,
including budgeting.
Homer Neal summarised the most promising working directions as:
* The need to clarify what is the outcome of this effort,
* Justification for preservation needs to be documented.
* A common data format for ee/ep/pp experiment could serve as a
test case.
* We could provide some guidance in the early stages of
experiments.
* The question on how to enforce correct documentation,
how to verify that it is sufficient must be addressed.
* It is proposed that reports should be presented
at the next DPLTA meeting, in the mean time,
progress could be monitored through EVO virtual meetings.
Finally, from the result of all discussions we agreed on four working
groups, reflecting the "main component" analysis on the data
preservation issue:
1) Physics Case for Data Preservation in HEP
* Survey of possible benefits from data preservation
* Including business models
* Including links with other research fields
2) Preservation Models
* ee,ep,pp input
* Priorities, costs and benefits, link to technologies
3) Collaborations, Governance and Data access policies
* Including contacts with general initiatives
4) Technologies and facilities
* Survey and assessment of existing infrastructures in HEP and their
adaptability to data preservation requirement
* Reflection on the impact of the new technologies on the data
preservation methods
Proceedings are to be written for March 3rd.