Supplement to the manuscript in the journal

Bioinformatics

mspire: Mass spectrometry proteomics in Ruby

John T. Prince and Edward M. Marcotte

Contents

Data Models and Examples

MS::MSRun

An object model and additional examples of the use of the MS::MSRun object and associated lazy evaluation methods can be found at http://mspire.rubyforge.org/ms/msrun.html

SpecID

Usage and explanation behind the SpecID, the SpecID::Pep and SpecID::Prot mixins can be found at http://mspire.rubyforge.org/spec_id/spec_id.html

SRF

The SRF object model is presented and discussed with accompanying usage examples: http://mspire.rubyforge.org/spec_id/srf.html

False Identification Rate Determination

mspire can determine false identification rates using a variety of methods (e.g., separate or concatenated DBs). Usage instructions for both objects and the commandline interface are found at http://mspire.rubyforge.org/spec_id/fir/index.html

OBI-Warp

Additional usage examples explaining how mspire can be used to generate rectilinear matrices for chromatographic alignment with OBI-Warp can be found at http://mspire.rubyforge.org/ms/obiwarp.html

Complete Documentation

Documentation for all modules and classes in mspire is here: http://mspire.rubyforge.org/rdoc/index.html

Additional Functionality

Mspire also has other functionality useful to developers and users. It has convenience wrappers for converting batches of .RAW files into mzXML and running multiple-run datasets with Percolator. Mspire can perform in silico protein digestion with different enzymes and missed cleavages. It can transform spectral data into the 'lmat' format for OBI-Warp alignment (see above) and extract chromatographic gradient programs from binary RAW or method files. These and other methods, executables, and scripts ease the burden of working with MS proteomics data.

Behavior Driven Development

Mspire has been written following behavior driven development (BDD) practice with RSpec (http://rspec.info). Specifications for desired behavior are tested with examples to ensure the software is operating correctly, allowing features to be added and implementation details rapidly altered while preserving correct behavior. These extensive examples (currently 269) also provide programmers a comprehensive reference for package usage.

Licensing and Operating Systems

The open source Ruby package mspire is released under an MIT style license, permitting commercial and non-commercial users alike near unrestricted usage (see the mspire package 'LICENSE' file for details). It should run under any Linux/Unix system and Windows, under cygwin or natively. Mspire has been tested extensively on Ubuntu 32 and 64 bit systems and cygwin.

Methods

Memory and speed testing for Arrayclass

(see Fig. 1B) Memory testing was performed on an Intel dual processor 3.2 GHz Pentium IV with 2GB RAM running Ubuntu 7.04, Ruby 1.8.5, arrayfields 4.5.0, and SuperStruct 1.01 Arrayclass was v. 0.0.1. Swap was flushed before each test and garbage collection was disabled (GC.disable) or left on. Objects were initalized with a fifteen member array of floating point numbers (which are still objects in Ruby) passed in as a list (for Struct and SuperStruct objects) or as an array object (for all the remaining objects) depending on the requisite type of initialization. Objects were created in this manner until the system began to write to swap (determined by querying /proc/meminfo [SwapCached] after the creation of every 100 objects). Total memory consumed was then calculated as initial memory available less final memory available plus memory written to swap (in meminfo this is initial 'MemFree' less final 'MemFree' plus 'SwapCached'). This test was performed four times for each class in random order.

Time to read mzXML files from Peptide Atlas

Performed on an Intel dual processor 3.8 GHz Pentium IV with 2GB RAM running Ubuntu 6.10 and Ruby 1.8.4. Due to the extensive time involved, the system was not dedicated exclusively to this task so some small aberrations in timing appear.

Extended Figure Legends

Fig. 1A

Overview of mspire functionality. Mspire gives developers the ability to work with MS data across the mass spectrometry analysis pipeline and provides useful tools for end-users. Colored boxes represent software state or actions. Italics indicate a file format, or if boxed then software objects. Blue colored boxes represent third party software while magenta colored boxes are features of mspire. Black arrows represent input/output independent of mspire while red arrows are input/output performed by mspire. (Top left) fasta_shaker.rb offers alternative methods for fasta database shuffling for use in false identification rate determination. (Mid left) MS::MSRun is a unified data format for working with LC-MS/MS data sets, regardless of file format. (Top middle) raw spectral data and a fasta database are fed into a database search engine which produces peptide spectral matches (PSMs) in a .srf binary file. These may alternatively be exported from Biowork's Multiconsensus View. Mspire extracts information from the srf file and presents PSMs via a simple interface (and access to the full, underlying data structures). (Right) These objects can be converted into the .sqt or pepXML file formats for downstream processing, or they can be filtered with common SEQUEST filtering parameters. After filtering or analysis with Peptide/Protein Prophet of Percolator, false identification rates can be summarized and supplemented using SBV (sample bias validation).

Fig. 1C

The majority of PeptideAtlas data repository mzXML files (as of 2008-07-08) were read using the lazy evaluation options :string and :io (3650 and 5378). The :string option (which is the default) decodes base64 spectra into a more compressed string (which it splits into two arrays of characters corresponding to m/z and intensity arrays). Only when the spectral information is called are the values cast into Ruby Floats. The :io option reads byte indices and then only reads (from an open file handle) and decodes spectra when their information is accessed. The :string evaluation method can process files up to ~400 MB on a machine with 2GB of RAM while the :io method has essentially no file size limit.

Fig. 1D

MSRun object model and 'use case'. The diamond represents composition. Lines represent associations. 0..* means that zero or more of these objects are contained in the associated object and 1..* means one or more. An MSRun is composed of scans, which are in turn composed of spectra. Each scan may be associated with 0 or 1 precursors. Each precursor is associated with one parent scan.

Fig. 1E

Line 3: An MSRun object can also be instantiated without parsing or storing spectra for faster and more memory efficient use. Line 4: Typical instantiation (includes spectra). Note that either mzXML or mzData files can be input. Line 6: accessing the total number of scans [equivalent to scan_count(0)]. Line 7: number of MS scans. Line 8: number of MS/MS scans. Line 9: retrieves the start and end m/z values for all MS/MS scans (this function will determine these values even if the underlying XML run information does not contain it). Line 11: a Ruby block that selects only MS/MS scans. Line 13-16: the scans are mapped to intensities; the block (designated between the 'do' and 'end' receives the scan object and returns the value of the last line which is collected as an array (list_of_intensities). Line 14-15: objects calls are chained to easily move between the precursor and a spectrum method call. The intensity determined here may also be more succinctly determined using the prc.intensity method call. This call uses the intensity recorded as meta-information in the XML file if it is available, or if it is not available, it will use the procedure shown here. The lookup of the index of the m/z value is implemented with a fast binary search.

Mail comments/questions to Edward Marcotte at marcotte@icmb.utexas.edu.

Last modified: July 15, 2008

Copyright © 2008 John Prince and Edward Marcotte.