Supplementary Information: Date & Marcotte, Nature Biotechnology 21(9), September 2003.

Important note (added Aug 3, 2006): Click here


Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages

Shailesh V Date1 & Edward M Marcotte1,2

1Center for Computational Biology and Bioinformatics, Institute for Cellular and Molecular Biology, and
2Department of Chemistry and Biochemistry

1 University Station A4800, University of Texas, Austin, Texas 78712. Address all correspondence to EMM: marcotte AT icmb dot utexas dot edu

Abstract

We introduce a general computational method, applicable on a genome-wide scale, for the systematic discovery of uncharacterized cellular systems. Quantitative analysis of the coinheritance of pairs of genes among different organisms, calculated using phylogenetic profiles, allows the prediction of thousands of functional linkages between the corresponding proteins. A comparison of these functional linkages to known pathways reveals that calculated linkages are comparable in accuracy to genome-wide yeast two-hybrid screens or mass spectrometry interaction assays. In aggregate, these linkages describe the structure of large- scale networks, with the resulting yeast network composed of 3,875 linkages among 804 proteins, and the resulting pathogenic Escherichia coli network composed of 2,043 linkages among 828 proteins. Application of the method, which involves a systematic search of such networks for groups of uncharacterized, linked proteins, led to the identification of 27 novel cellular systems from one nonpathogenic and three pathogenic bacterial genomes.

PDF  |  NCBI

Sections

1. Downloadable files containing phylogenetic profiles and content description
2. Details of calculation of mutual information
3. Downloadable files containing proteins pairs with mutual information scores and content description
4. Supplementary Table 1 : Proteins predicted to function with the E. coli K12 flagellar biosynthesis protein FlgD
5. Supplementary Figure 1 : Complete list of clusters identified from 4 different organisms
6. Supplementary Figure 2 : Varying homology thresholds does not affect algorithm performance


Notes on compressed files:
- gzipped files (.gz) can be opened with Powerarchiver 6.1 or WinZip on Windows.
- bzip2 files (.bz2) can be opened with bzip for windows on Windows.
1. Files containing phylogenetic profiles:

  • Content

  • identifier 1 st genome 2 nd genome ...... 59 th genome





    >ccrescentus|gi|13421087|gb|AAK21989.1| 1.00 1.00 ...... 1.00
    >ccrescentus|gi|13421088|gb|AAK21990.1| 1.00 1.00 ...... 1.00
    >ccrescentus|gi|13421089|gb|AAK21991.1| 0.03 0.03 ...... 0.07

    Values for each genome indicate -1/log(BLAST E-value) for the given gene
    (see
    Genome Key for information about genomes in the profiles).

  • Download

  • Organism: Phylogenetic profile set
    C. crescentus [ 1.3 MB, uncompressed | 116 KB, compressed, .gz ]
    E. coli K12 [ 1.4 MB, uncompressed | 128 KB, compressed, .gz ]
    E. coli O157H7 [ 1.8 MB, uncompressed | 172 KB, compressed, .gz ]
    P. aeruginosa [ 1.9 MB, uncompressed | 196 KB, compressed, .gz ]
    S. aureus [ 864 KB, uncompressed | 92 KB, compressed, .gz ]
    V. cholerae [ 1.3 MB, uncompressed | 120 KB, compressed, .gz ]
    S. cerevisiae [ 2.1 MB, uncompressed | 88 KB, compressed, .gz ]


    2. Calculating mutual information values:
    
    We know that
    
    Intrinsic entropy H(A) = -p(a) x log p(a)......equation 1
    Relative/Joint entropy H(A,B) = -p(a,b) x log p(a,b)......equation 2
    MI = H(A)+H(B)-H(A,B)......equation 3
    
    Consider 4 different profiles, each with a different distribution of its components. 
    Values towards 1 indicate absence, and vice-versa. The distribution is plotted in Figure C1.
    
    
    Protein A  0.1 0.2 0.2 0.1 0.3 0.7
    Protein B  0.0 0.0 0.0 0.0 0.0 0.0
    Protein C  1.0 1.0 1.0 1.0 1.0 1.0
    Protein D  1.0 1.0 1.0 0.0 0.0 0.0
    
    
    
    
    Distributions of pij values for the four example profiles. 
    
    
    
    If the values in the profiles are binned in intervals of 0.1, we will get the following bins.
    
    ----------------------------------
    Bins 	A 	B 	C 	D 
    ----------------------------------
    0	0	6	0	3 
    0.1	2	0	0	0 
    0.2	2	0	0	0 
    0.3	1	0	0	0 
    0.4	0	0	0	0 
    0.5	0	0	0	0 
    0.6	0	0	0	0 
    0.7	1	0	0	0 
    0.8	0	0	0	0 
    0.9	0	0	0	0 
    1	0	0	6	3
    ----------------------------------
    
    
    Let's calculate p(a) x log p(a) for the first non-zero bin in profile of protein A - the 0.1 bin.
    
    Total number of elements in the profile = 6.
    p(a) = Total number of elements in bin / Total number of elements in profile
    
    p(a) = 2/6 = 1/3 = 0.3333
    log p(a) = ln (0.3333) = -1.0986
    p(a) x log p(a) = -0.3661
    
    Similarly, 
    p(a) x log p(a) for 0.2 bin is -0.3661 
    p(a) x log p(a) for 0.3 bin is -0.2986
    p(a) x log p(a) for 0.7 bin is -0.2986
    
    Substituting these values in eqn. 1, we get
    
    H(A) = -[-0.3661 + (-0.3661) + (-0.2986) + (-0.2986)]
         = -[-1.3294] 
         = 1.3294 
    
    Therefore, Intrinsic entropy for protein A is 1.3294.
    
    Similarly, entropies for other protein profiles are
    H(B) = 0
    H(C) = 0
    H(D) = 0.6931
    For calculating joint entropies H(A,B) in equation 2, we perform similar calculations.  
    Bin counts are incremented if identical values are observed for a given position in both profiles. 
    
    Protein X  0.1 0.2 0.2 0.1 0.3 0.7
    Protein Y  0.1 0.3 0.3 0.1 0.5 0.7
    
    Bin (0.1,0.1) = 2
    Bin (0.2,0.3) = 2
    Bin (0.3,0.5) = 1
    Bin (0.7,0.7) = 1
    
    If profile of protein A is compared with itself, we will see identical bin counts, and the joint/relative 
    entropy will be the same as intrinsic entropy.
    
    H(A,A) = 1.3294
    
    Substituting the entropy values in equation 3, we get
    
    MI (A,A) = 1.3294 + 1.3294 - 1.3294 = 1.3294
    
    -----------------------------------------------------
    NOTE THAT IN ACTUAL CALCULATIONS, LOG BASE 2 IS USED.
    
    

    3. Files containing protein pairs with mutual information values:

  • Content
  • QUERY|SUBJECT|MU VALUE|
    ------------------------------------
    13699918|13699919|0.887417927068569|
    13699918|13699920|0.122577636489111|
    13699918|13699921|0.391519316462225|
    ...
    ...
    ...
    Both query and subject are identified by Genbank Identifiers (GIs).
    
  • Download

  • Organism: Mutual Information Value Files Filtered Files* Highest MI Lowest MI
    C. crescentus [ 86 MB, compressed, .bz2 ] ccrescentus-pairs-above-0.7.gz 1.19895127672827 5.55111512312578e-17
    E. coli K12 [ 123 MB, compressed, .bz2 ] ecoli-K12-pairs-above-0.7.gz 1.28267007200588 9.89124478767422e-07
    E. coli O157H7 [ 165 MB, compressed, .bz2 ] ecoli-O157-pairs-above-0.7.gz 1.27509781507474 1.11022302462516e-16
    P. aeruginosa [ 205 MB, compressed, .bz2 ] paeruginosa-pairs-above-0.7.gz 1.26654892510335 5.55111512312578e-17
    S. aureus [ 45 MB, compressed, .bz2 ] saureus-pairs-above-0.7.gz 1.20788005888389 1.11022302462516e-16
    V. cholerae [ 91 MB, compressed, .bz2 ] vcholerae-pairs-above-0.7.gz 1.34176921052772 5.55111512312578e-17
    S. cerevisiae [ 187 MB, compressed, .bz2 ] scerevisiae-pairs-above-0.7.gz 1.32867461118848 5.55111512312578e-17

    *Filtered Files: Filtered files contain protein pairs with mutual information scores above random (above 0.7).
    4. Supplementary Table 1 : Proteins predicted to function with the E. coli K12 flagellar biosynthesis protein FlgD.

    ProteinMutual
    information
    Protein function
    FlgB 0.82 Flagellar biosynthesis, cell-proximal portion of basal-body rod
    FlgK 0.80 Flagellar biosynthesis, hook-filament junction protein 1
    FlgL 0.78 Flagellar biosynthesis; hook-filament junction protein
    FliF 0.75 Flagellar biosynthesis; basal-body MS(membrane and supramembrane)-ring and collar protein
    FlgE 0.75 Flagellar biosynthesis, hook protein
    FliN 0.75 Flagellar biosynthesis, component of motor switch and energizing, enabling rotation and determining its direction
    FlgF 0.75 Flagellar biosynthesis, cell-proximal portion of basal-body rod
    FliG 0.75 Flagellar biosynthesis, component of motor switching and energizing, enabling rotation and determining its direction
    FlgG 0.75 Flagellar biosynthesis, cell-distal portion of basal-body rod
    FlgC 0.75 Flagellar biosynthesis, cell-proximal portion of basal-body rod
    MotA 0.69 Proton conductor component of motor; no effect on switching
    FliQ 0.69 Flagellar biosynthesis
    FliS 0.68 Flagellar biosynthesis; repressor of class 3a and 3b operons (RflA activity)
    FliR 0.68 Flagellar biosynthesis
    FliC 0.67 Flagellar biosynthesis; flagellin, filament structural protein
    Rnk 0.67 Regulator of nucleoside diphosphate kinase
    FliM 0.64 Flagellar biosynthesis, component of motor switch and energizing, enabling rotation and determining its direction
    YedA 0.63 Putative transmembrane subunit
    FliD 0.62 Flagellar biosynthesis; filament capping protein; enables filament assembly
    CsrA 0.60 Carbon storage regulator; controls glycogen synthesis, gluconeogenesis, cell size and surface properties

    Legend:The results are shown from a comparison between the phylogenetic profile of FlgD with the phylogenetic profiles of all other proteins in E. coli. All proteins in E. coli were ordered by decreasing mutual information values; the 20 highest scoring proteins are shown, nearly all of which are found to belong to the same pathway, that of flagellar biosynthesis.
    5. Supplementary Figure 1 : Complete list of clusters identified from 4 different organisms



    Legend: 27 cluster cores are described from 4 organisms. Each core contains at least 3 proteins, of which >50% are uncharacterized. Several cores have been extended as described in the paper. Colored boxes represent homologs, filled circles represent clusters that described in the paper (click on the figure for a larger version).

    6. Supplementary Figure 2 : Varying homology thresholds does not affect algorithm performance


    Legend:Varying the homology threshold used to filter out linkages between homologs shows little effect on algorithm performance. Shown here are results (comparable to Figure 2B of the paper), in which the homology threshold is varied from a BLAST E-value of 10-3 to 10-5, without degrading the ability to reconstruct networks. (click on the figure for a larger version).

    Important note (added Aug 3, 2006)

  • Correction
    One of our careful readers recently pointed out that in several places in the paper, we mention that we used 57 genomes, but in fact we use 59. It should indeed be 59, and we apologize for this error. All results and profiles used and available for download contain results with 59 genomes (including Figure 6).

  • Pseudocode for calculating mutual information between profiles
    Due to popular demand, we are posting pseudocode for calculating mutual information between profiles. Click
    here for a copy.

  • Copyright © 2003, Shailesh V Date and Edward M Marcotte
    For Questions/Comments, please mail: marcotte AT icmb.utexas.edu