Iterative cluster analysis of protein interaction data

Espaņol

Spanish version 

University of Valencia
Faculty of Biology
 

DESCRIPTION

 

 

1. Using UVCLUSTER

2. UVCLUSTER flow chart

3. Output files

4. Speed of execution

5. Analysis of a synthetic graph

 

             Using UVCLUSTER

             UVCLUSTER must be executed on a Command Window using the following sintaxys:

                                                      uvcluster NN AC

             When:    NN = Number or random solutions1

                            AC = Affinity coefficient2  [1-100]

                                       

                                        1.- Using at least 10 times the number of elements is recommended.

                                        2.- Affects the strictness of the analyses, see Arnau et al. 2004 for details.

 

                UVCLUSTER Flow chart

 

 

 

<

UVCLUSTER analyses begin by importing a text file containing a dataset of direct protein-protein interactions.

<

First optional filter: Use only interactions between pairs of proteins in a list or exclude all interactions incluiding a protein of the list.

<

Second optional filter: Establishing a cutoff value for the maximum/minimum number of interactions required to be included in the analysis.

<

Generation and saving of the matrix of primary distances (that can be thereafter used as entry for UVCLUSTER).

Two files with extensions: .pro and .tab

<

Selection of the proteins of interest: From a list or selecting every protein within distance N from a choosen one.
   

<

Generation of the matrix of secondary distances and saving of the output files

               

              Output files

 

   The files are automatically named accordingly to the following formula:

           (Selection)_(Name of matrix of primary distances employed)_NN_AC#

           Adding respectively: "_S1.txt", "_S2.txt", ".pgm" or "_pgmpro.txt"

          (Selection) can be:         (Selected protein)_(Distance)_

                                              (Name of the list of proteins)_

 

 

S1_Output File:

 The first output file contains the tables of primary and secondary distances among the chosen elements plus the values of several significant parametres used in the analyses. Also contains a table of secondary distances suitable to be copied to a text file and directly imported into MEGA 2.1.

S2_Output File:

The second output file shows the results of an agglomerative hierarchical clustering using UPGMA performed with the secondary distance data.

PGM_Output Files:

The third output file is a graphical representation of the data in PGM (Portable GreyMAp). Consists in a square formed by K2 smaller gray-shaded squares indicating the degree of interaction between each pair of proteins. PGM files can be read using freeware programs as IrfanView.

The order of the proteins in the matrix is optimized to highlight the clusters. This ordering is provided in the file *_pgmpro.txt and correspond to a UPGMA clustering.

 

              Speed of execution

All times obtained on a standard PC computer (Intel Pentium IV 2.8 GHz processor with 512 MB RAM memory).

 
The complete data available for S.cerevisiae in the January 2004 release of the DIP database (4721 proteins, 15210 interactions) can be converted into a primary distance table in 14 minutes. This matrix of primary distances was employed as initial input for the analyses.

Time increases linearly with the number of iterations and also increases in a non-linear way as AC becomes smaller.

 

Elements in the dataset

Nš iterations

AC = 100

AC = 50

34 elements and 561 primary distances

(see example for details)

10,000 < 2 sec. < 2 sec.

34 elements and 561 primary distances

(see example for details)

100,000 9 sec. 13 sec.

150 randomly choosen proteins

(11,175 primary distances)

10,000 9 sec. 125 sec.

500 randomly choosen proteins

(124,750 primary distances)

10,000 23 min. 160 min.

 

  Synthetic example

 

The following graph exemplifies the ties in proximity problem, inherent to protein interaction networks, that can be tackled using the hierarchical clustering strategy of UVCLUSTER.

Two clusters (units 1-4 and 8-11) are obvious.

 

 

Next, is presented an UPGMA tree obtained using primary distances that clearly fails to detect the two clusters.

This error will occur anytime that ties are solved in such a way that, by chance, units 4 and 5 (or, alternatively, 7 and 8) become clustered.

 

 

When applying the UPGMA algorithm to the set of secondary distances obtained using UVCLUSTER, with N=10000 and AC=100 the topology of the tree very closely corresponds to the graph.

Distances among units 1-3 or 9-11 are equal to zero, and units 4 and 8 are the most closely connected to them.

 

 

Download Input and Output files

PGM Output file.

The order of the proteins in the matrix is as follows: 1, 2, 3, 4, 11, 10, 9, 8, 6, 5 and 7.

The lighter the grey the closer.