Novel methods to protein mutation prediction based on Naïve Bayes classifier – An application in Influenza A virus
**Anh Tuan Tran 1, Ly Le2, **and Bao The Pham 1
1VNUHCM – University of Science, 227 Nguyen Van Cu street, District 5, Ho Chi Minh City
2VNUHCM – International University, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City
Abstract
The high mutation rate of several viruses is one of the main reasons leading to dangerous epidemics or pandemics. Therefore, there is a demand for accurate prediction of dangerous mutations which lead to new phathogenic strains that resist to current drugs and vaccines. In this research, we proposed novel methods for mutation prediction based on Naïve Bayes Classifier and conducted experiments for Neuraminidase and Hemagglutinin on influenza A virus. Our method gives considerable accuracy, also matching score. These results show that our predicted sequences have high likelihood of being the same with protein families in structures and provide useful information for antiviral drug design.
Citation: Tran AT, Le L, & Pham BT (2015) Novel methods to protein mutation prediction based on Naïve Bayes classifier – An application in Influenza A virus. Genomic Medicine 2015 , eds Le L & Pham S (Ho Chi Minh City, Viet Nam).
Full-text Download: PDF
VJS Editor: Phuc Le, Center for Value - based Care Research, Cleveland Clinic, USA
Introduction
The rapid evolution rates of several viruses, such as HIV and influenza, cause threats to the health care system in the world (1-4). These species have high error (mutation) rates during reproduction process (transcription or DNA duplication), that make the variety of different genotypes. The advantage of high error rates is to give the viral abilities to adapt to new environments by escaping our immume system and resisting to current drugs. The mutations in protein sequences lead to the changes of respective polypeptide chains and protein structures. Consequently, designing drugs or vaccines counteracting these viruses becomes a challenge for biologists. Moreover, excluding natural hosts, these viruses possibly infect and cause severe diseases in other hosts. Since 1977, for example, there have been numerous report avian or swine influenza viruses infecting humans (5-8).
A mutated genotype can be a synonymous version of the original genome that seems not to influence the respective proteins, but in case of non-synonym, the mutation significantly contributes to the evolution of protein sequences and structures. Only mutated proteins surviving under selection pressure can make species evolve and adapt to new environments.
Mutations occur randomly and unpredictably, but the mutations surviving under selection possibly have predictable pattern. As far as we know, protein sequences evolve more rapidly than structures (9). This may be the result of that protein structures assign protein functions. A mutated sequence which causes a change into structure may influences protein’s function, so an individual having a mutated protein sequence cannot live as normal. Thus, we can expect that surviving mutated proteins must reserve the protein structure and function.
There are some tools relating to this subject has been developed, especially in correlated mutation prediction and protein residues-residues contact map prediction (10-16). Scientists also interest in analysis of mutation impact on protein (17, 18). However, we probably cannot find any related researches directly predicting protein mutation. These researches do not indicate what mutants occur in future. In this paper, we investigate how to predict specific protein mutations which survive under natural selection, in future, from protein family data of sequences.
Protein sequences set which is mainly-used data for this study are non-orderable discrete data type. Therefore, a statistical learning method, like Naïve Bayes Classifier, is appropriate with this problem (19). These data sets are usually big with thousands of instances and hundreds of attributes. While other complex accurate learning methods take long execution time, Naïve Bayes Classifier can learn data during an acceptable period, with reasonable accuracy. Hence, we proposed a method for predicting protein mutation which surviving under natural selection pressure by using Naïve Bayes Classifier and also conduct experiment for Membrane Glycoproteins of Influenza A Virus in Viet Nam.
Methodology
Our method included four main steps which were preprocessing, multiple sequences alignment, target determination, and mutation prediction. The overall procedure was shown in Figure 1.

Figure1. Overall methodology
** Preprocessing**
The first step we automatically eliminated sequences which contained strange characters, Algorithm 1. There are several sequences that were incomplete and had uncommon amino acids, so it was necessary to preprocess data in order to reduce noise.
Algorithm 1. Preprocessing

** Multiple sequences align ment**
Protein sequences in data set were not equal in length, so we had to normalize the data to the same length. Multiple sequences alignment using dynamics programming algorithm is time consuming (20). Therefore, we used progressive method with join-neighbor algorithm and Jukes-Cantor evolution distance for solving this task (21-24).
Target determination
According to Mitchel, the i th protein sequence in data set P was called an instance (sample) and denoted by p i; then, the j th position
of p i is an attribute (19). In addition, we needed to determine the target t i with respect to each instance p i. We proposed three methods:1)The first method is based on phylogenetic tree, 2)The second method is based on the order of protein sequence.3) The third method is a combination of both two above methods.
Phylogenetic tree
We assumed that phylogenetic trees can accurately reflect evolutionary processes of protein families. Then, we could learn rules of protein families and predict surviving sequences occurring in future, through these processes and phylogenetic trees. Target for each sequence is the sequence’s child, equation (1)
![]() | (1) |
|---|
where
are children of p i.
We constructed binary guide trees by using join-neighbor algorithm, Jukes-Cantor evolution distance and dynamics programming for pair wise alignment (20, 24, 25). From guide trees, we constructed phylogenetic trees by determining father for each pair of sequences which have the same higher level node, Algorithm 2.
Algorithm 2. Phylogenetic tree construction

Protein sequences’ order and component
As structural evolution rate of a surviving sequence is lower than its evolution rate of sequence to maintain its function. Thus, we assumed that the order and components of sequence contain information about structure. We could predict future sequence having the same structure with protein family’s structure. Target for each sequence were determined by equation (2).
| s i = pi | (2) |
|---|
Mutation prediction
For each position in protein sequences, we generated a learner based on Naïve Bayes Classifier to predict mutation at this residue. After that, we combined all predicted mutation position to get the final mutated sequence. Given a protein sequence p i, the probability that the j thresidue
mutates into a specific amino acid
was expressed by (3), based on Bayesian formula, where
was the k th element in the possible amino acid set
.
![]() | (3) |
|---|
We assumed that attributes of each instance are independent. Then, the probability could be rewritten as (4).
and
were given by (5) and (6), respectively
![]() | (4) | |
|---|---|---|
![]() | (5) | |
![]() | (6) |
where |.| is number of member in a set, L(p i) = |p i|is length of string, t is a target (t= f in case of the first method and t=s in the second method).
The mutated j thresidue was predicted by equation (7).
![]() | (7) |
|---|
Finally, the predicted mutant of p i was
.
Denote that
and
are the probability with respect to the first and the second method, according to the third method, the mutated j th residue was predicted by equation (8)
![]() | (8) |
|---|
where
![]() | (9) |
|---|
Experiments and results
** Data collection and preprocessing**
We collected sequences of Neuraminidase (NA) and Hemagglutinin (HA) of influenza imported from Vietnam from National Center of Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/), in FASTA format. We also collected sequences from Southeast Asia for testing purpose. After preprocessing, we obtained Table 1.
Table 1. Data selection and preprocessing
| Region | Type of protein | Raw | Preprocessed |
|---|---|---|---|
| Viet Nam | NA | 442 | 392 |
| HA | 373 | 348 | |
| Southeast Asia | NA | 2367 | 2294 |
| HA | 2337 | 2248 |
1.1. Results
Firstly, we defined two evaluation criteria which are accuracy and matching score. The accuracy of our method was expressed by mean of accuracy rate of each predicted position
as equation (10)
![]() | (10) |
|---|
where
![]() | (11) |
|---|
We also computed matching score between predicted sequences and reference sequences. Gaps in predicted sequences were eliminated; then, each sequences was aligned with all reference sequences and computed Jukes-Cantor distance to each sequence in reference set. Denoted that the nearest aligned reference sequence to aligned a i was r i, the matching score was defined by equation (12)
![]() | (12) |
|---|
where
![]() | (13) |
|---|
In this research, we used NA and HA data set from Vienamfor training and testing process due to the lack of data. Table 2 describes the accuracy of our methodology. As regards NA, the accuracies of the first and the second method are almost the same, 67.87% and 65.42% respectively. Which respect to HA, the difference between two methods is just 4.58%. We also investigated the correlation between the conservation of protein family and the accuracy. As Figure 1 illustrated, almost high conservation positions (
80%) gave high accuracy (
80%).
Table 2. Accuracy
| Type of protein | First method | Second method |
|---|---|---|
| NA | 67.87% | 65.42% |
| HA | 73.59% | 78.17% |

Figure2. Correlation between accuracy and conserved positions. (A) The first method for NA. (B) The second method for NA. (C) The first method for HA. (D) The second method for HA.
Countries in Southeast Asia are similar to Viet Nam in terms of environment, biological resources, etc., so we considered NA and HA data set in Southeast Asia as a reference set, in order to evaluate matching score. As Table 3 described, regarding NA, both three methods give moderate matching scores, 69.98%, 66.35%, and 69.17% respectively. In HA set, the accuracies of three methods are considerable, 77.92%, 78.17%, and 78.49% respectively. In addition, we randomly pick up a predicted sequence with respect to two methods and protein type to predict 3d structure (26-29). According to
Table 4, modeled positions rates are higher than 80% in both six cases.
Table 3. Matching score
| Type of protein | First method | Second method | Third method |
|---|---|---|---|
| NA | 69.98% | 66.35% | 69.17% |
| HA | 77.92% | 78.17% | 78.49% |
Table 4. Percentage of residues were modeled
| Type of protein | First method | Second method | Third method |
|---|---|---|---|
| NA | 83% (4b7qA) | 82% (3tiaA) | 83% (4gzoA) |
| HA | 91% (2wr0A) | 89% (2yp7A) | 89% (2yp7A) |
Conclusions
We proposed a novel method for predicting mutated protein surviving under selection pressure, with reasonable accuracies and matching scores, but we almost could not find out any related researches in order to compare. We did not expect a perfect accuracy because it will not suggest any novel mutated sequences. Therefore, the results are reasonable, although both two methods give under 80% of accuracies. Moreover, our method gives high accuracies with respect to conserved position which play important role in structure and function of protein. The matching score is more important than the accuracy since it indicated how our predicted sequences similar to the sequences in reference set. The more predicted sequences similar to reference set, the more their structures have high probability of resembling structure of protein family. In our research, matching scores just reach moderate levels, but the accuracy is expected to increase when using bigger reference sets. Additionally, the proportions of modeled positions are higher than 80%. These results show that our predicted sequences have high likelihood of being the same with protein families in structures.Othereffective machine learning methods are suggested to apply for the same data sets to improve accuracy and matching score. After that protein structure of predicted sequences can be constructed using homology modelling for further study.
References
1. Shankarappa R , et al (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection . J Virol 73(12): 10489-10502.
2. Buonagurio DA , et al (1986) Evolution of human influenza A viruses over 50 years: Rapid, uniform rate of change in NS gene . Science 232(4753): 980-982.
3. Parvin JD, Moscona A, Pan WT, Leider JM & Palese P (1986) Measurement of the mutation rates of animal viruses: Influenza A virus and poliovirus type 1 . J Virol 59(2): 377-383.
4. Rambaut A, Posada D, Crandall KA & Holmes EC (2004) The causes and consequences of HIV evolution . Nat Rev Genet 5(1): 52-61 (View Article).
5. Kimura K, Adlakha A & Simon PM (1998) Fatal case of swine influenza virus in an immunocompetent host . Mayo Clin Proc 73(3): 243-245.
6. Rota PA , et al (1989) Laboratory characterization of a swine influenza virus isolated from a fatal case of human influenza . J Clin Microbiol 27(6): 1413-1416.
7. Dowdle WR & Hattwick MA (1977) Swine influenza virus infections in humans . J Infect Dis 136 Suppl: S386-9.
8. Wells DL , et al (1991) Swine influenza virus infections. transmission from ill pigs to humans at a wisconsin agricultural fair and subsequent probable person-to-person transmission . JAMA 265(4): 478-481.
9. Chothia C & Lesk AM (1986) The relation between the divergence of sequence and structure in proteins . EMBO J 5(4): 823-826.
10. Fariselli P, Olmea O, Valencia A & Casadio R (2001) Prediction of contact maps with neural networks and correlated mutations . Protein Eng 14(11): 835-843 (View Article).
11. Di Lena P, Nagata K & Baldi P (2012) Deep architectures for protein contact map prediction . Bioinformatics 28(19): 2449-2457 (View Article).
12. Neher E (1994) How frequent are correlated changes in families of protein sequences?. Proc Natl Acad Sci U S A 91(1): 98-102.
13. Gobel U, Sander C, Schneider R & Valencia A (1994) Correlated mutations and residue contacts in proteins . Proteins 18(4): 309-317.
14. Olmea O & Valencia A (1997) Improving contact predictions by the combination of correlated mutations and other sources of sequence information . Fold Des 2(3): S25-32.
15. Cheng J & Baldi P (2007) Improved residue contact prediction using support vector machines and a large feature set . BMC Bioinformatics 8: 113 (View Article).
16. Ashkenazy H & Kliger Y (2010) Reducing phylogenetic bias in correlated mutation analysis . Protein Eng Des Sel 23(5): 321-326 (View Article).
17. Bromberg Y, Yachdav G & Rost B (2008) SNAP predicts effect of mutations on protein function . Bioinformatics 24(20): 2397-2398 (View Article).
18. Worth CL, Preissner R & Blundell TL (2011) SDM-a server for predicting effects of mutations on protein stability and malfunction . Nucleic Acids Res 39: W215-22 (View Article).
19. Mitchell TM (1997) Machine Learning,(McGraw-Hill, New York),
20. Lipman DJ, Altschul SF & Kececioglu JD (1989) A tool for multiple sequence alignment . Proc Natl Acad Sci U S A 86(12): 4412-4415.
21. Chenna R , et al (2003) Multiple sequence alignment with the clustal series of programs . Nucleic Acids Res 31(13): 3497-3500 (View Article).
22. Higgins DG & Sharp PM (1988) CLUSTAL: A package for performing multiple sequence alignment on a microcomputer . Gene 73(1): 237-244.
23. Thompson JD, Higgins DG & Gibson TJ (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice . Nucleic Acids Res 22(22): 4673-4680.
24. Jukes TH & Cantor CR (1969) in Mammalian Protein Metabolism, ed Munro HN (Academic Press, New York), pp 21-132.
25. Saitou N & Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees . Mol Biol Evol 4(4): 406-425.
26. Kallberg M , et al (2012) Template-based protein structure modeling using the RaptorX web server . Nat Protoc 7(8): 1511-1522 (View Article).
27. Ma J, Wang S, Zhao F & Xu J (2013) Protein threading using context-specific alignment potential . Bioinformatics 29(13): i257-65 (View Article).
28. Peng J & Xu J (2011) A multiple-template approach to protein threading . Proteins 79(6): 1930-1939 (View Article).
29. Peng J & Xu J (2011) RaptorX: Exploiting structure information for protein alignment by statistical inference . Proteins 79 Suppl 10: 161-171 (View Article).











