As noted in my previous post, South Africa is in the midst of a lockdown in which most of us are confined to our homes. What better way to pass the time than learn some data science skills that will also help us better understand the COVID-19 pandemic? In this post we will learn how to make a bubble map representing COVID-19 cases by country. A bubble map is a commonly-used infographic that shows geographic regions (in this case, countries) with a circle (bubble) over each one. The radius of the bubble varies to represent the magnitude of some variable (in this case, number of confirmed COVID-19 cases).
Downloading the Data
We first need the same data set used in my previous post, from the European Centre for Disease Prevention and Control. First, use
setwd() to set the working directory to your folder path of choice. Then, the following code will download today’s version of the data.
todayurl <- paste0("https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-",Sys.Date(),".xlsx")
todayfilename <- paste0("COVID-19-geographic-disbtribution-worldwide-",Sys.Date(),".xlsx")
download.file(url = todayurl, destfile = todayfilename, mode = "wb")
The above code will probably throw an error if you run it at 00:02, because the daily spreadsheet is updated sometime in the morning European time. Next, we need to import the Excel spreadsheet into R. For this, we can use the R package
xlsx. If you don’t already have this package installed, make sure you’re connected to the Internet and run the command
install.packages("xlsx"). Then, execute the following code:
# Import spreadsheet and save as data.frame object called dat
dat <- as.data.frame(xlsx::read.xlsx(file = todayfilename, sheetIndex = 1))
Bubble Map of Total COVID-19 Cases by Country
A commonly seen infographic is a world map with a circle (‘bubble’) on each country with the radius of the circle proportional to the quantity of interest (e.g., number of confirmed COVID-19 cases). This is called a bubble map. It is not too difficult to create COVID-19 bubble maps; we just need to get some GIS data and do some data cleaning first. We need the GPS coordinates of the ‘centroid’ (central point) of each country so that R knows where to put the bubbles. A Google search reveals that there is an R package called CoordinateCleaner containing a
data.frame object called
countryref that contains centroids for countries and provinces around the world. We need to install the package CoordinateCleaner by running
install.packages("CoordinateCleaner") and then proceed as follows:
# Subsetting by type == "country" ensures that we filter out the provinces and only keep country-level GPS data
centroid_dat <- CoordinateCleaner::countryref[CoordinateCleaner::countryref$type == "country", ]
# The data set has two sets of centroid coordinates for each country (corresponding to two different sources), so we only want the odd-numbered rows. The columns of interest are 4 (country name), 6 (centroid longitude) and 7 (centroid latitude)
country_centroid <- centroid_dat[(1:nrow(centroid_dat)) %% 2 == 1, c(4, 6, 7)]
Next, we need to aggregate our COVID-19 case data (specifically, number of cases and number of deaths) by country so that we have totals instead of daily figures.
totcases <- aggregate(cbind(cases, deaths) ~ countriesAndTerritories, data = dat, FUN = sum)
# Remember, if the "countriesAndTerritories" column name is different in your file, you will need to amend the code accordingly. Check by typing totcases$ in RStudio and scrolling through the column names.
Now for the messiest step. We need to merge our
totcases object with the
country_centroid object according to country name, so that we have a data table that matches GPS coordinates with COVID-19 case and death totals. The problem is, the country names must be exactly the same for R to make a match, and there are some differences in the country naming conventions between the two data sets. For instance, where a country name is multiple words, one separates the words with spaces and the other with underscores. The following code rectifies the issues except for 18 small countries that will therefore be omitted from the maps.
# Substitute spaces with underscores
country_centroid$name <- gsub(" ", "_", country_centroid$name)
# Substitute ampersands with the word 'and'
country_centroid$name <- gsub("&", "and", country_centroid$name)
# Country-by-country individual changes to make names match
country_centroid$name[country_centroid$name == "Congo_-_Brazzaville"] <- "Congo"
country_centroid$name[country_centroid$name == "Congo_-_Kinshasa"] <- "Democratic_Republic_of_the_Congo"
country_centroid$name[country_centroid$name == "United_States"] <- "United_States_of_America"
country_centroid$name[country_centroid$name == "Tanzania"] <- "United_Republic_of_Tanzania"
country_centroid$name[country_centroid$name == "Cote_d'Ivoire"] <- "Cote_dIvoire"
country_centroid$name[country_centroid$name == "Czechia"] <- "Czech_Republic"
country_centroid$name[country_centroid$name == "Swaziland"] <- "Eswatini"
country_centroid$name[country_centroid$name == "Palestinian_Territories"] <- "Palestine"
country_centroid$name[country_centroid$name == "Myanmar_(Burma)"] <- "Myanmar"
# List of countries in the COVID-19 data set with no matching entry in the centroid data (commented out)
Having done this, we can now merge the two data sets.
casemapdat <- merge(totcases, country_centroid, by.x = "countriesAndTerritories", by.y = "name")
# Again, change 'by.x = "countriesAndTerritories"' if this column is spelt differently in your file
Now we are ready to create our bubble map of number of cases. We just need to install the package called
install.packages("rworldmap") which contains a function called
mapBubbles that produces a bubble map. Here is the code for the bubble map of number of cases.
# This ensures the numbers in legend are not in scientific notation
options(scipen = 8)
addLegend = TRUE, legendVals = c(10, 10000, 100000),
legendTitle = NULL, legendPos = "bottomright",
legendHoriz = TRUE,
nameX = "centroid.lon", nameY = "centroid.lat",
addColourLegend = TRUE)
mtext(paste0("No. of Confirmed COVID-19 Cases by
Country as of ", Sys.Date()), side = 3, cex = 0.75)
There you have it! Without too much effort we have produced some cool infographics. The best part is, if you save your R script and then open it in a week and run it again, your infographics should automatically update to reflect the latest COVID-19 case data.