Report of the

High Data-Rate Macromolecular Crystallography Meeting

Brookhaven National Laboratory 26 -- 28 May 2016

Report Date: 7 June 2016

 

This is the first of a series of three meetings in spring and summer 2016 on changes needed to existing major software packages for support of very high data rate macromolecular crystallography. The first meeting was held at Brookhaven National Laboratory, 26 -- 28 May 2016, and was organized by Herbert J. Bernstein of Rochester Institute of Technology, Nicholas K. Sauter of Lawrence Berkeley National Laboratory and Robert M. Sweet of Brookhaven National Laboratory.

 

The report is the collaborative result of the work of the participants in the meetings, many of whom have approved the wording in the earlier drafts. The editor of the text is HJB (yayahjb at gmail dot com) to whom comments and corrections should be directed.

 

The meeting had several sponsors: Funding from Dectris, Ltd of Baden Switzerland to Rochester Institute of Technology, from the National Institute of General Medical Sciences of the National Institutes of Health under grant 3R01GM117126-01S1 to Lawrence Berkeley National Laboratory, from the Department of Energy Offices of Biological and Environmental Research and of Basic Energy Sciences grants DE-AC02-98CH10886 and E-SC0012704, and from NIH grants P41RR012408, P41GM103473, and P41GM111244 to Brookhaven National Laboratory. The opinions expressed in this report are those of the meeting participants and not necessarily those of the funding sources.

 

The attendees at the meeting were:

Name

Institution

On-site Participants

Mark Hilgart

Argonne National Laboratory

Jun Aishima

Australian Synchrotron

Tom Caradoc-Davies

Australian Synchrotron

Kaden Badalian

Binghamton University

Frances C. Bernstein

Brookhaven National Laboratory (ret.)

Andreas Förster

DECTRIS Ltd.

Markus Mathes

DECTRIS Ltd.

Eugen Wintersberger

Deutsches Elektronen-Synchrotron

David Hall

Diamond Light Source

Graeme Winter

Diamond Light Source

Andrew Hammersley

European Synchrotron Radiation Facility

Gerard Bricogne

Global Phasing Ltd.

Clemens Vonrhein

Global Phasing Ltd.

Aaron Brewster

Lawrence Berkeley National Laboratory

Nicholas K. Sauter

Lawrence Berkeley National Laboratory

Jie Nan

MAX IV Lund University

Harry Powell

MRC Laboratory of Molecular Biology (ret.)

Matt Cowan

NSLS-II Brookhaven National Laboratory

Martin Fuchs

NSLS-II Brookhaven National Laboratory

Jean Jakoncic

NSLS-II Brookhaven National Laboratory

Robert Petkus

NSLS-II Brookhaven National Laboratory

Alexei Soares

NSLS-II Brookhaven National Laboratory

Dieter Schneider

NSLS-II Brookhaven National Laboratory

John Skinner

NSLS-II Brookhaven National Laboratory

Bob Sweet

NSLS-II Brookhaven National Laboratory

Kerstin Kleese van Dam

CSI Brookhaven National Laboratory

Xiaochun Yang

NY Structural Biology Consortium

Seetharaman Jayaraman

NY Structural Biology Consortium

Herbert J. Bernstein

Rochester Institute of Technology

Simon Ebner

SLS Paul Scherrer Institut

Ezequiel Panepucci

SLS Paul Scherrer Institut

Justyna Aleksandra Wojdyla

SLS Paul Scherrer Institut

Martin Savko

SOLEIL Synchrotron

Elena Pourmal

The HDF Group

James Holton

UCSF/LBNL/SLAC

Wladek Minor

University of Virginia

Electronic Participants

Nukri Sanishvili

APS Argonne National Laboratory, GMCA-CAT

Kevin Battaile

APS Argonne National Laboratory, IMCA-CAT

Joe Digilio

APS Argonne National Laboratory, IMCA-CAT

Erica Dugrid

APS Argonne National Laboratory, IMCA-CAT

Spencer Anderson

APS Argonne National Laboratory, LS-CAT

Joe Brunzelle

APS Argonne National Laboratory, LS-CAT

Keith Brister

APS Argonne National Laboratory, LS-CAT

Surajit Banerjee

APS Argonne National Laboratory, NE-CAT

David Neau

APS Argonne National Laboratory, NE-CAT

Frank Murphy

APS Argonne National Laboratory, NE-CAT

K. Rajasankar

APS Argonne National Laboratory, NE-CAT

Jon Schuermann

APS Argonne National Laboratory, NE-CAT

James P. Withrow

Argonne National Laboratory, NE-CAT

Steve Ginell

APS Argonne National Laboratory, SBC CAT

Chris Lazarski

APS Argonne National Laboratory, SBC-CAT

John Chrzas

APS Argonne National Laboratory, SER-CAT

Albert Fu

APS Argonne National Laboratory, SER-CAT

Zhongmin Jin

APS Argonne National Laboratory, SER-CAT

Daniel Eriksson

Australian Synchrotron

Vesna Samardzic-Boban

Australian Synchrotron

Stefan Brandstetter

DECTRIS Ltd.

Gleb Bourenkov

EMBL

Alexander Popov

European Synchrotron Radiation Facility

Go Ueno

SPring-8

Kazuya Hasegawa

SPring-8

Keitaro Yamashita

SPring-8

Thomas Eriksson

SSRL SLAC National Accelerator Laboratory

Takanori Nakane

The University of Tokyo

Kay Diederichs

Universität Konstanz

 

35 participants attended on-site the first day, 31 the second day, and 16 the third day for report draft editing.

 

28 electronic participants attended the presentations on the first day. On the second day fewer electronic participants attended; eight connected and at least two were active participants in the discussion. There were no electronic participants for the report editing session on the third day.

 

The first day was primarily devoted to the presentations shown on the meeting web site: http://medsbio.org/meetings/BNL_May16_HDRMX_Meeting.html

 

Statement of the Problem and Charge to the Meeting

 

Macromolecular crystallography (MX) is the gold standard for the determination of the atomic-resolution three-dimensional structure of large biologically active molecules. MX is becoming a big data science straining the capabilities of computers and networks. New techniques of serial crystallography are allowing new science to be done but they are increasing the heterogeneity of the data that must be handled.

 

There are issues in data handling:

 

      We are dealing with much more data than in the past.

      In the short term we need to store a lot of data that is retrievable quickly.

      In the medium term we need to store some version of much of the same data for processing and for users to take home.

      In the longer term we may need to store the publishable data.

      We need to consider issues of compression and background removal.

 

Eiger 16M detectors produce 2.4 gigapixels per second, 76 raw gigabits per second. This is often more than 10 Gb/s networks can handle, even when compressed 4:1. We face an Increasingly daunting flood of image data We should try to reduce movement of data, reduce transformations of data, and move data in large blocks. Compressions could be improved, but that will not be sufficient. No single compression is ideal. No single compression is sufficient.

 

Why be concerned?

 

      For any stochastic system, the delays and lengths of queues rise sharply as the rate of arrival of information to process approaches the rate at which it can be processed.

      For any information processing system, the rate at which you can move information through the system is limited by the capacity of the narrowest bottleneck.

      When you work close to the capacity of a system you are dancing on the edge of a cliff.

 

This meeting was charged with finding answers to the following questions:

 

      What stumbling blocks inhibit direct processing of HDF5 data?

      Is there a way to have the data produced by the detector processed by all the major packages without conversion?

      What are best practices to process Eiger images?

o   Using C, Fortran, Python?

o   In large compute clusters at synchrotrons?

o   In users• home lab computers?

 

Discussion Points

 

The second day started with parallel software and beamlines/controls breakout sessions, which were recombined to discuss joint issues. The combined result was a great deal of agreement, initially as follows:

 

It was agreed that we will set up, as a community resource, an HDRMX web site that provides pointers and useful information on open-source software for high data-rate MX as a one-stop shopping page (for details, see consensus recommendation #5 below).

 

The continued discussion in the joint meeting produced the following preliminary best practices recommendations:

 

           Spot finding. For screening purposes, at present allocating one image per process is most effective, and keeping up with an Eiger 16M at full rate requires approximately ten very competent nodes with normal cores. GPUs are not at present appropriate. J. Holton, J. Jakoncic and G. Winter will carefully consider the evidence, with input from the LBNL group, and make a firm best practices recommendation.

 

            Metadata. It is agreed that what is needed is a way to simply and reliably integrate the full equivalent to the CBF metadata into master files. People need to be made aware of the NXmx definitions that were jointly defined by IUCr COMCIFS and NIAC for exactly this purpose. Easier to follow information will be added to the web site by HJB in consultation with H. Powell, E. Wintersberger and other interested people.

 

The beamlines/controls group noted that Dectris has agreed to work on optimizing a parallel file writer and streamer on the DCU using the 2 x 10 Gb links. There were also requests to possibly do that on a single 40 Gb link. Dectris will not guarantee failsafe performance using the two in parallel.

 

After further discussion, the discussion points were refined to produce the consensus recommendations supported by almost all participants at the meeting. The following recommendations are appropriate to those for whom speed and efficiency in MX data collection are of great significance.

 

DIALS Workshops

 

There was a separate discussion focused on the issues of dissemination of DIALS and best use of DIALS when working with Eigers. The principal request was for DIALS developers to organize workshops in Europe and the US, perhaps also in Japan / China / Australia, to help users to learn the best use of DIALS. Also there was interest in local / smaller facility-based presentations, perhaps for one day, for a greater number of local users.

 

Future HDRMX Meetings

 

The attendees were reminded of the currently planned HDRMX meetings in association with the ACA meeting in Denver on 23 July 2016 and as a satellite meeting to the ECM meeting in Basel on 2 September 2016. There was particularly strong interest in attendance at the ACA session.

 

Consensus recommendations of the meeting

 

           The meeting notes that all major applications (DIALS, HKL, MOSFLM, XDS) have now worked out ways to read Eiger data, and most (DIALS, HKL, XDS) have ways to read it directly from HDF5-formatted files, but improvements are needed in the support documentation and software tools for creating appropriate HDF5 master files and also in the timing.

1.    The meeting notes that Python wrappers have become very important in the development of MX workflow pipelines. There is concern that the use of Python rather than C, C++ or Fortran might reduce efficiency by introducing additional data copying. Comments at the meeting were made about the good handling of the data copying issue by numpy. The DIALS project has volunteered to profile use of h5py to make sure it is as efficient as possible and to check that motion is indeed being minimized as has been suggested is the case with numpy arrays.

 

2.    The meeting notes that the DECTRIS/XDS plugin now available, as discussed in Markus Mathes' talk, appears likely to help improve timing for reading of HDF5 images directly in a wide range of applications. The meeting recommends that application developers should try the DECTRIS/XDS plugin in their applications.

 

3.    The meeting notes that an effort is necessary in an increasing number of cases to provide full CBF-based metadata in HDF5 master files. HJB and CV have volunteered to gather and curate the data from all beamlines willing to contribute. The meeting respectfully asks that beamline scientists, please, for the love of our science, provide full beamline metadata information for posting on the HDRMX web site

 

4.    The meeting implores all concerned to work toward full, unconditional NeXus compliance.

 

5.    The meeting notes that there is a critical need to improve dissemination of both software and best practices and recommends that the HDRMX website be established to provide one-stop shopping for the community to get the open-source resources they need to do software development for processing Eiger data. The meeting notes with gratitude that the necessary permissions have been granted by the owners of the intellectual property in this list, that Dectris has agreed to allow use and extension of relevant portions of their documentation and that Global Phasing has agreed to work towards adding hdf2mini-cbf to the site. On that site we will include

   an extended version of the Dectris documentation of the Eiger master file / data file structure, especially for multi-axis metadata, including programmer's reference material explaining clearly the relationships among the NXmx-based NeXus/HDF5 format, the imgCIF/CBF format, and the coordinate systems

   links to the NeXus format documentation (including documentation regarding the NeXus NXmx application definition) and reference copies of the software and guidance to the portions relevant to MX

   links to the imgCIF/CBF documentation and reference copies of the software (CBFlib) and guidance on the portions relevant to MX

   links to the HDF5 documentation and reference copies of the software (HDF group version) and guidance on the portions relevant to MX

   links to LZ4 compression documentation and software (NIAC version)[1]

   links to BitShuffle compression documentation and software (NIAC version)[1]

   links to eiger2cbf documentation and reference copies of the software and guidance on the portions relevant to MX

   the dectris-xds-plugin and plugin API that permits direct reading of HDF5 images from applications and provides a framework to help insulate application design from image format issues

   useful scripts (starting with XDS fork scripts)

   guidance for and examples of adding beamline and experiment metadata into an existing Nexus/HDF5 master file as written by Dectris, including both CBF and HDF5 metadata templates, and the necessary software tools

   guidance for and examples of writing and adding beamline and experiment metadata into new master file

   guidance for simulating the output of the Eiger streaming interface starting from existing HDF5 or CBF files (to be eventually followed by full software implementations)

   tools to calibrate and verify beamline metadata, with links to beamline metadata examples from all beamlines willing to contribute

   a repository of example Eiger datasets from different synchrotrons, including example raster scans in HDF5 format preferably with micro and macro cases. Inasmuch as most currently available raster scans are in CBF format, GW will provide CBFs and HJB will convert those to HDF5. EHP will provide some native Eiger raster scans.

HJB will be secretary for the website assisted by CV. As per WM, storage of up to 200TB for data will be provided by Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC, http://proteindiffraction.org). As additional storage sites for data are volunteered, access will be coherently integrated with access to the IRRMC storage.

 

6.    The meeting accept GW's generous offer to work with Pilatus users to try to increase use of full CBF headers.

7.    The meeting endorses an effort to create a reference implementation of DIALS spot finder running as a script, not as a server.

8.    Inasmuch as the metadata issues discussed above will be providing metadata from multiple sources, Dectris has agreed to endeavor to provide the data that appears in miniCBF headers. The coordinate system is the one defined in the NeXus standard and used in the NeXus NXmx application definition. A link to the default NeXus geometry definitions will be provided on the HDRMX web site. The beamline scientists continue to be responsible for additional geometry data such as complex nesting of axes and non-NeXus-default rotation directions or any vector/axis definitions not covered by the standard.

9.    The meeting recommends that for use cases beyond the current Dectris assumptions a working group of this meeting will prepare descriptions of those use cases, forward them to NIAC for appropriate action and the template merge and edit capabilities will be extended to allow beamlines to provide the necessary new master files.

10. The meeting notes that for accepting and processing this data in a timely manner, high bandwidth networks (more than 10 Gb/s) and 10 or more substantial processing nodes are likely to be needed.

11. GW (chair), AB, NKS, WM, TCD, JMH, MS, JN, JJ, MCH, JAW are forming a benchmark committee that will define standard benchmarks, run them and forward results to HJB for the web site. This committee will also investigate the issues surrounding compression of the x-ray data and gather evidence on effectiveness of the various schemes and provide useful examples for the web site. HJB will assist in necessary format conversions and re-bricking data files.

12. The meeting respectfully asks Dectris to investigate if it would possible to have the option of a 40Gb/s interface from the DCU.

 

We are pleased to note the following useful information provided by Takanori Nakane: "Keitaro Yamashita has adapted Cheetah's spot finding routine (mostly used for SFX at XFEL) to receive frames from EIGER ZeroMQ interface; https://github.com/keitaroyam/cheetah/tree/eiger-zmq/eiger-zmq It is used at SPring-8 BL32XU with EIGER 9M."

 

Conclusions

 

This was a particularly collegial, collaborative and effective meeting that achieved its goals and laid the groundwork for future collaborative efforts that should help to improve the efficiency and effectiveness of high data-rate macromolecular crystallography. Having met in person makes it more likely that people will continue to collaborate efficiently in the future.

 



[1] Because these compression filters are heavily used, there are many copies held at many different sites with minor configuration differences that have caused difficulties in integration with various packages. Settling on the NIAC version as the reference copy for this community will hopefully allow for smoother integration.