Skip to main content
Fig. 4 | Genomics & Informatics

Fig. 4

From: A prediction of mutations in infectious viruses using artificial intelligence

Fig. 4

Prediction of future mutations by each learning model after constructing optimal training data for learning. A The performance of the models trained using the XGBoost method was evaluated, and the highest accuracy was observed in the models trained on datasets from just after the Omicron variant (wave 5), specifically training set 5 and training set 6. Training set 5 (accuracy: 0.999957, precision: 0.999995, recall: 0.999995, F1 score: 0.999995), training set 6 (accuracy: 0.999977, precision: 0.999998, recall: 0.999997, F1 score: 0.999997). B The mutation occurrence positions of 24C (KP.3), which was not collected by Nextstrain (no sequencing information of SARS-CoV-2), were predicted and analyzed through the clade data of 24A and 24B, which recently emerged and were not used in the training process of the existing models. Through this analysis, the mutation positions in the RBM region that may appear in 23C were identified, and amino acid substitutions were confirmed. Positions 441, 444, 453, 475, 493, and 500 were predicted as positions for potential new mutations, and 24C (KP.3) was confirmed as a mutation that occurred at the position marked by the blue box (493). The new predicted mutations were predicted with very low frequencies, with position 441 having the highest value of 0.00055 and position 493, which actually occurred, showing a probability of 0.00024. For amino acid substitutions, L441F and Q493E are likely to occur at the respective locations with Q493E actually occurring. However, considering that the amino acid substitution Q493R had previously occurred, it is possible that it may temporarily appear and then disappear again due to its low predicted rate

Back to article page