Authors:  A. KOUAME, M. LAHLOU KASSI, H. HANZOULI, Solution BI France


Within a globally interconnected and densely populated world, pathogens can spread more easily than they ever did before. Appeared in China in December 2019, the new pandemic called Coronavirus has spread worldwide, killing thousands of people and affecting millions one. 

Facing this global outbreak, researchers are working actively to find a way to slow down and eradicate the virus. In the context of this global mobilization, artificial intelligence research labs all over the world are bringing their expertise to the scientific community to  find potential solutions by enhancing ongoing research efforts, improving the efficiency and speed of existing approaches, and proposing original lines of research. 

 In this article, we will present some of data science’s contributions to tackle many aspects of the COVID-19 crisis at different scales.



A. COVID knowledge base building


Due to the rapid acceleration in new coronavirus related literature, it becomes difficult for the medical and more broadly the scientific research community to keep up with everything happening within these multidisciplinary studies. 

Thus, there is a growing urgency for innovative approaches to tackle this issue and bridge the gaps between researchers all over the world. In this context, the White House and a coalition of leading research groups have prepared the  COVID-19 Open Research Dataset (CORD-19) [1]. The dataset is a resource of over 51.000 documents in different text formats (pdf, json, csv …)

This freely available dataset is provided to the global research community to apply recent advancements in Natural Language Processing (NLP) methodologies such as Tf-idf, t-SNE, Topic Modelling and other AI approaches to generate new insights to support the ongoing fight against this infectious disease. 



Figure 1 : Topic Modelling interactive visualization of a model trained on CORD-19 using a private kernel


Figure 1 shows insights obtained from applying NLP Algorithms. On the left panel, a global view of the discovered topics is displayed. Each circle represents a topic and the area of the circles indicate the overall prevalence of the topics among the research papers. The closer the topics are located to each other, the more they are related to each other.

On the right panel, the individual keywords are listed which are most useful for interpreting the currently selected topic. So, selecting a topic in the left panel reveals the most useful keywords in the right panel for interpreting that selected topic. Note that it is also possible to click on individual keywords to get more insights in which topics they occur.


B. Analysis of the structure of the virus

To find an efficient vaccine, it is necessary to understand the structure of the virus. That’s why researchers are interested in the virus’s behavior in the human body and try to qualify its similarity with other viruses of the same family (SARS and MERS).

In this context, researchers from National Center for Biotechnology Information [2] identify key genomic features that differentiate SARSCoV-2 and the viruses. behind the two previous deadly coronavirus outbreaks, SARS-CoV and MERS-CoV, from less pathogenic coronaviruses. They applied Support Vector Machine (SVM) with text data as input which represents the sequence of the virus. Figure 2 represents the structure analysis made with the help of SVM.



Figure 2 : COVID structure analysis with SVM


Another analysis aims to predict whether the virus can infect people from the next generation by sequencing reading. For this study, researchers at Robert Koch’s Methodology and Research Infrastructure Department [3] used a distributed orthographic representation

In this representation, where each nucleotide {A, C, G, T} in a sequence is represented by a one-hot encoded vector of length 4. An “unknown” nucleotide (N) can be represented as an all-zero vector.

With these data, they applied Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM). These algorithms performed better than the existing ones. In fact, they obtained 50% better performance than traditional genome-based methods.


C. Search of a vaccine 

To stop the spread of the coronavirus, the most efficient way is to find a vaccine to improve the immunity. Therefore, scientists from Laboratory of Innovative Drug Target Research [4] in Fujian implemented a deep Q-Learning network with the fragment-based drug design for generating potential lead compounds targeting COVID-19. The data used was text documents mostly representing molecules, inhibitors. The reinforcement learning algorithm applied, rewards three aspects of discovered molecules: drug-likeness score, inclusion of pre-determined fragments and the presence of known abstract design patterns believed to be correlated with a compound’s effectiveness. Data pipeline developed provides a framework to identify strong epitope-based vaccine candidates beyond 2019-nCoV and might be applied against any unknown pathogens.

To make the AI model work well, the first step is to prepare the molecular fragment library as shown in Figure 3. They used to collect SARS-CoV inhibitors (284 molecules) as the initial molecule they split this set of molecules into fragments with a molecule weight no more than 200 daltons. At the end, they applied an advanced deep Q-learning network with the fragment-based drug design for generating potential lead compounds.



Figure 3: Flowchart for lead compounds development


Moreover, in a Stanford medicine paper [5], authors used neural networks to find epitopes in the COVID-19 genome. In fact, epitopes are elements targeted by antigens (B-cells and T-cells) in our body. In summary, they analyzed the 2019-nCoV viral genome for epitope candidates and found 405 likely T-Cell epitopes, with strong MHC-I and MHC-II presentation scores, and 2 potential neutralizing B-Cell epitopes on S protein. This is a good breakthrough for the search of a vaccine.


Figure 4: Data pipeline to identify T-cells and B-cells epitopes in 2019 nCoV


In figure 4, Spike(S), Envelope(E), Membrane(M), Nucleocapsid(N) represent genomes code of the 2019 n-CoV, possible target for antibodies. All these protein fragments have the potential to be presented by MHC-I or MHC-II and recognized by T-cells. Once the data cleaned and prepared, they applied NetMHCpan4 MARIA , two artificial neural network algorithms, to predict antigen presentation and identify potential T-cell epitopes. 

In addition to this, researchers from University of Michigan [6], used several supervised machine learning techniques such as K-Nearest Neighbors, Random Forest and XGBoost to predict the potential useful viral proteins that can serve as effective vaccine targets. The study is still ongoing.

 To conclude, all these efforts from researchers all around the world will facilitate the introduction of a vaccine.


D. Medical imaging for diagnosis

Before finding an efficient vaccine, diagnoses are important to detect infected persons and isolate them to avoid contamination among the population. In this purpose, medical imaging techniques such as X-ray and tomography (CT) play an essential role in the global fight against COVID-19. 

In this context and at the height of the COVID pandemic at Wuhan, chinese doctors used CT coupled with a series of X-ray images to diagnose COVID-19. But this was not very accurate as infections of the lung can be very similar (COVID or not), so the test generated a lot of false positive and false negative diagnoses.[7]

To solve this issue, many research teams implemented a data pipeline based on computer vision AI with an image recognition algorithm to help accurately distinguish covid in CT scans and chest X-rays.

Indeed, an AI research team from Brunel University London [8] propose a Bayesian Convolutional to estimate the diagnosis uncertainty in COVID prediction. This improves the detection accuracy of the CNN applied (VGG-16) from 85.7% to 92.9%.

 Another team from Zonguldak Bulent Ecevit University [9] proposed different CNN models (InceptionV3, Resnet50, Inception-ResNetV2) to detect infection from X-ray images. The classification showed performances behind 90% in most cases.



Figure 5 : Schematic representation of pre-trained models for the prediction of normal (healthy), COVID-19, bacterial and viral pneumonia patients [8].


In this study, as shown in figure 5, researchers built deep CNN based ResNet50, ResNet101, ResNet152, InceptionV3 and Inception-ResNetV2 models for the classification of COVID-19 Chest X-ray images to three different binary classes (Binary Class-1 = COVID-19 and normal (healthy), Binary Class2 = COVID-19 and viral pneumonia, Binary Class-3 = COVID-19 and bacterial pneumonia)In addition, they applied a transfer learning technique based on ImageNet data to overcome the insufficient data and training time. 


Furthermore, research experts from Tsinghua University use a 2D based CNN to segment the lung and then identify slices of positive COVID 19 cases.  The model achieved sensitivity of 94.1%, specificity of 95.5% and AUC of 0.979.

We can conclude that with the advancement of Computer Vision technology, AI helps a lot in order to get, in a short time, an efficient way to detect the presence of COVID 19 based on the latest Deep learning algorithms.


E. Tracking & prediction of the spread of the virus

Facing the risk of the coronavirus in terms of contaminations and deaths, governments across the world took various decisions to slow down the progression of the virus. The strategy applied was different from one country to another. For example,  in northern European countries such as Sweden, the government took no restrictive measures and gave just  recommendations to the population. This coupled with people’s discipline slowed down the virus’s progression. At the same time, in France, the government with the support of the scientific council decided to apply global lockdown, make compulsory masks wearing, recommended physical distancing, curfew and remote work when possible to people. These measures taken in March for three months and renewed recently with some modifications contributed to stop the progression of the virus. Moreover, to follow the progression of the virus to make decisions, data visualization can be useful.

In this purpose, IHME (Institute for Health Metrics and Evaluation) implemented algorithms to make predictions about the progression of the virus. Indeed, they reviewed 384 published and unpublished COVID-19 forecasting models, and evaluated 7 models for which publicly available, multi-country, and date-versioned mortality estimates could be downloaded. These included those modeled by: DELPHI-MIT (Delphi), Youyang Gu (YYG), the Los Alamos National Laboratory (LANL), Imperial College London (Imperial), and 3 models produced by the Institute for Health Metrics and Evaluation (IHME), a curve fit model (IHME-CF), a hybrid curve fit and epidemiological compartment model (IHME-CF SEIR), and a hybrid mortality spline and epidemiological compartment model (IHME – MS SEIR).

In this tool, you can select different areas of the world containing different countries.
Afterwards, you can select different kinds of estimation as the total death since the breakthrough of the pandemic, the daily deaths, the persons tested and estimation of contamination.

Furthermore, you can follow the progress of the hospital resources in terms of bed, equipment etc…



Figure 6: Evolution of the death in France according different scenario


In figure 6, it is possible to observe the prediction of the virus in France according to different scenarios (universal mask, mandates easing).

Indeed, more related to FRANCE, a geospatial evangelist Gaetan Lavenu from ESRI (french company developing products based on geo-localization) built with data from Sante Public France developed some data visualization tools to see the progression of the virus in France. A quick example of the interface in figure 7 show the use of the resources in hospital (intensive care, home’s return) and some information about contamination (number of test, positive test)



Figure 7: Data visualization of the COVID data



Facing the seriousness of the coronavirus, it was essential to put in place strong measures to stop the progression of the virus. Data Science techniques from NLP to CNN through data visualizations helped professionals and governments to make decisions.

Recently BioNtech and Pfizer announced that their trial to fight a vaccine had currently an accuracy more than 90%. Indeed, the vaccine from Russia Spoutnik-5 seems to have the same accuracy (near 92%).

Thanks to these results and following the validation of these vaccines by several official entities such as the European and American Medicines Agencies, vaccination campaigns have been widely launched throughout the world. As a result, France has passed the one million vaccination milestone in almost a month. This quick evolution of the situation gives hope for a normal resumption of activities by the end of 2021.





[2] Gussow, Ayal B., et al. “Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses.” Proceedings of the National Academy of Sciences (2020).
[3] Bartoszewicz, Jakub M., Anja Seidel, and Bernhard Y. Renard. “Interpretable detection of novel human viruses from genome sequencing data.” BioRxiv (2020).
[4] Tang, Bowen, et al. “AI-aided design of novel targeted covalent inhibitors against SARS-CoV-2.” bioRxiv (2020).
[5] Fast, Ethan, and Binbin Chen. “Potential T-cell and B-cell Epitopes of 2019-nCoV.” bioRxiv (2020).
[6] Ong, Edison, et al. “COVID-19 coronavirus vaccine design using reverse vaccinology and machine learning.” BioRxiv (2020).
[7] West, Colin P., Victor M. Montori, and Priya Sampathkumar. “COVID-19 testing: the threat of false-negative results.” Mayo Clinic Proceedings. Vol. 95. No. 6. Elsevier, 2020.
[8] Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection
[9] Ong, E., et al. “COVID-19 coronavirus vaccine design using reverse vaccinology and machine learning. bioRxiv.” Posted on March 23 (2020).