Web Scraping of South African National Lottery Draw Data in R

//Web Scraping of South African National Lottery Draw Data in R

Web Scraping of South African National Lottery Draw Data in R

Thomas Farrar

04/05/2020

Web Scraping of Static HTML Content vs Dynamic Javascript-Rendered Content

In the previous post, we learned how to create a dataset of English Premier League football match data by scraping the HTML code of Match Centre webpages. In that case, the data we wanted was stored between HTML tags (nodes) in the HTML code itself. Thus, once we had downloaded the HTML document for each match using the read_html function from R’s xml2 package, we had the data. It only remained to locate and extract the exact bits of data we wanted. This required the use of functions from R’s rvest package that help us to interpret an HTML document.

The reality, however, is that many webpages today do not merely consist of static HTML content but also have Javascript-rendered content. This allows webpages to be more dynamic and interactive. A webpage with Javascript-rendered content requires a very different approach to web scraping. As an example, take the Historical Results section for the LOTTO game on the South African National Lottery website. When you open this page, you will see links corresponding to the past few draw numbers. If you click on one of these draw numbers, you will see results for that specific draw, such as which balls were drawn, the number of winners in each prize category, the amount of winnings in each prize category, and below that, more information such as the total sales and the (estimated) next jackpot. Observe, however, that the webpage URL has not changed; it is still ‘https://www.nationallottery.co.za/lotto-history’. We have not moved to a different webpage; we have merely changed the content on the same webpage. Moreover, if we scrape the HTML code for this URL, we will not find the data for any LOTTO draw.

This has two implications for the web scraper. Firstly, we cannot use the URL to cycle through all of the past LOTTO draws (as we did with Premier League matches in the previous post), because the URL remains the same. We must find another way to cycle through all of the past draws. Secondly, instead of merely downloading the HTML content of this webpage (which does not contain the data we want), we need to find out how the webpage uses Javascript to render the draw data and imitate that procedure in R. Before we dig into the details of doing this, a bit more background on our objective.

The National Lottery’s LOTTO Game

One of the games of the South African National Lottery is LOTTO. Draws take place twice per week, on Wednesdays and Saturdays. Each play consists of selecting six numbers between 1 and 52 without repeats. At the draw, a machine knocks around 52 numbered balls and six are drawn without replacement, plus a seventh called the bonus ball. To win the jackpot (Level 1 prize), your six numbers must exactly match the six balls drawn from the machine (order does not matter). The jackpot prize ranges from about R3 million up to over R100 million depending on the draw. Progressively smaller prizes can be won by matching five numbers plus the bonus ball, five numbers without bonus ball, four numbers plus bonus ball, and so on. The lowest prize-winning outcome (Level 8) corresponds to matching two numbers plus the bonus ball; this earns the player R20.

A student of probability may find it an interesting exercise to work out the probabilities of winning each of the prize levels using combinations. Interestingly, there were formerly only 49 balls in the game. The seemingly inconsequential increase from 49 balls to 52 with effect from Draw No. 1732 (August 2017) has actually reduced the probability of winning a jackpot by over 45%, from 1 in 13983816 to 1 in 20358520.

What would scraping historical draw results allow us to do? One thing it would not allow us to do is to improve our chances of winning a jackpot. The statistical principle of independence means that knowing which numbers have been drawn more frequently or less frequently in the past tells us nothing about which numbers might be drawn in the next draw. Two things we could do with the scraped data, however, are (1) test whether the data fits the theoretical probability distributions in terms of the distribution of balls and the number of jackpots won. If not, this could suggest either that the machines are faulty or that some sort of fraud has occurred. (2) We could look at patterns of LOTTO sales over the years. To do either of these, we need the data, and so back to our web scraping problem!

Web Scraping of Javascript-Rendered Content Using R

Certain R packages have functions that will make scraping this data very easy; actually much easier than it was to scrape the Premier League match data from HTML code. We will need to install the packages httr, jsonlite, and (if we have not already) rvest and xlsx:

install.packages(c("httr", "jsonlite", "rvest", "xlsx"))

Now, to harvest Javascript-rendered data from a webpage it is absolutely essential to use Developer tools in Google Chrome (or a similar tool in another browser). First open the the Historical Results page and then press Ctrl+Shift+i to open Developer tools. Click on the ‘Network’ tab in Developer tools and then click on the most recent draw (No. 2013 as I write this). You will see two items appear in Developer tools; we are interested in the item of type ‘xhr’ circled in red in the screenshot below. This gives us a record of the dynamic content that was rendered when we opened the page.

Developer tools after loading Results of Draw No. 2013

Right-click on the text beginning with ‘index.php?’ and select ‘Copy’ and then ‘Copy link address.’ Paste this URL into your R script and save it as a character:

requesturl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&amp;Itemid=265&amp;option=com_weaver&amp;controller=lotto-history"

Now, left-click on the text beginning with index.php? you will see a tab called ‘Headers’ open with some information under ‘General’. Often, Javascript content is rendered with a ‘GET’ request, but in this case we can see that the Request Method is ‘POST’. (This is the same kind of request that is made when you submit a form on a webpage.) Since we have the URL and method of the request, we are nearly ready to run this request within R and thus scrape the data for this draw. But how does the webpage know which draw’s results to return? (If you open the results of another draw with Developer tools running, you will see that the Request URL is the same.)

Scroll down to the bottom of the Headers tab and you will see three variable names and values under ‘Form Data’:

Form Data

The variable drawNumber, with a value of 2013, is clearly what is going to tell the script which draw’s results to return. Now we are ready to emulate the POST request within R using the POST function of httr package. Among the arguments we pass to the function are the request URL and the three variables that we found under ‘Form Data’ in Developer tools.

response <- httr::POST(url = requesturl, body = list(gameName = "LOTTO", drawNumber = "2013", isAjax = "true"), encode = "form")

Now we have extracted the data, but it is encoded in JSON (JavaScript Object Notation) format. Fortunately, the function parse_json in R package jsonlite can parse it for us:

myjson <- jsonlite::parse_json(response)

The object myjson is a list containing another list object named data containing two other list objects named drawDetails and totalWinnerRecord. The drawDetails object has the data we need stored under various names like ball1 (the first ball drawn), ball2, etc. We can convert all of this to a vector with one simple step:

mydrawdata <- unlist(myjson$data$drawDetails)
mydrawdata
##        drawNumber          drawDate      nextDrawDate             ball1 
##            "2013"      "2020/04/15"      "2020/04/18"              "24" 
##             ball2             ball3             ball4             ball5 
##              "48"              "33"              "51"              "28" 
##             ball6         bonusBall       div1Winners        div1Payout 
##              "36"              "46"               "0"               "0" 
##       div2Winners        div2Payout       div3Winners        div3Payout 
##               "1"         "65122.8"              "20"          "5662.9" 
##       div4Winners        div4Payout       div5Winners        div5Payout 
##              "70"          "2022.5"            "1288"           "184.7" 
##       div6Winners        div6Payout       div7Winners        div7Payout 
##            "1822"           "113.5"           "25295"              "50" 
##       div8Winners        div8Payout    rolloverAmount    rolloverNumber 
##           "19247"              "20"      "2066940.84"               "1" 
##    totalPrizePool        totalSales  estimatedJackpot guaranteedJackpot 
##      "4481277.24"         "9958035"         "4500000"               "0" 
##       drawMachine           ballSet            status         nwwinners 
##            "RNG2"             "RNG"       "published"               "0" 
##        kznwinners         fswinners           winners       millionairs 
##               "0"               "0"           "47743"               "0" 
##         gpwinners         wcwinners         ncwinners         ecwinners 
##           "47743"               "0"               "0"               "0" 
##         mpwinners         lpwinners 
##               "0"               "0"

We now have the data for Draw No. 2013 saved in a convenient format. All that remains is to cycle through all the past LOTTO draws and compile their data into a single spreadsheet. We can verify that the earliest draw for which results are available on the website is Draw No. 1506 from 6 June 2015. Thus we can proceed as below. (Advanced R programmers will probably prefer to use lapply and the pipe operator %>% from package magrittr rather than a for loop.)

firstdrawno <- 1506
lastdrawno <- 2018 # Change to most recent draw number
ndraws <- length(firstdrawno:lastdrawno)
lottotable <- matrix(nrow = ndraws, ncol = length(mydrawdata))
jsondat <- vector("list", ndraws)
for (d in firstdrawno:lastdrawno) {
  response <- httr::POST(url = requesturl, body = list(gameName = "LOTTO", drawNumber = as.character(d), 
                    isAjax = "true"), encode = "form") 
  jsondat[[d - firstdrawno + 1]] <- unlist(jsonlite::parse_json(response)$data$drawDetails)
}
lottotable <- as.data.frame(do.call(rbind, jsondat), stringsAsFactors = FALSE)
## Warning in (function (..., deparse.level = 1) : number of columns of result is
## not a multiple of vector length (arg 1)
names(lottotable)
##  [1] "drawNumber"        "drawDate"          "nextDrawDate"     
##  [4] "ball1"             "ball2"             "ball3"            
##  [7] "ball4"             "ball5"             "ball6"            
## [10] "bonusBall"         "div1Winners"       "div1Payout"       
## [13] "div2Winners"       "div2Payout"        "div3Winners"      
## [16] "div3Payout"        "div4Winners"       "div4Payout"       
## [19] "div5Winners"       "div5Payout"        "div6Winners"      
## [22] "div6Payout"        "div7Winners"       "div7Payout"       
## [25] "div8Winners"       "div8Payout"        "rolloverAmount"   
## [28] "rolloverNumber"    "totalPrizePool"    "totalSales"       
## [31] "estimatedJackpot"  "guaranteedJackpot" "drawMachine"      
## [34] "ballSet"           "status"            "nwwinners"        
## [37] "kznwinners"        "fswinners"         "winners"          
## [40] "millionairs"       "gpwinners"         "wcwinners"        
## [43] "ncwinners"         "ecwinners"         "mpwinners"        
## [46] "lpwinners"
names(lottotable) <- names(jsondat[[length(jsondat)]])
# Change columns containing numbers from character to numeric
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)

# Write to Excel
xlsx::write.xlsx2(lottotable[1:37], file = "LOTTO_draw_results.xlsx", row.names = FALSE)

The Excel spreadsheet LOTTO_draw_results.xlsx in your working directory should now exist and contain the results of all LOTTO draws from 2015 to the present, as in the screenshot below. (If you are not sure of your working directory, just run the command getwd().)

Excel Spreadsheet

Analysis

We can easily verify the total number of jackpot winners over all the draws covered by our data set:

attach(lottotable)
njackpotwinners <- sum(div1Winners)
njackpotwinners
## [1] 117
# Ratio of number of jackpot winners to number of draws
njackpotwinners / ndraws 
## [1] 0.2280702

This works out to a jackpot win about every fourth draw. But how many total plays have occurred in that time? The data contains the total sales for each draw, so if we know the price per play we can work out the number of plays. The price per play is now R5.00, but up until Draw No. 1607 (21 May 2016) it was R3.50. Thus,

totalplays <- rep(NA_real_, ndraws)
for (i in 1:ndraws) {
  if (drawNumber[i] <= 1607) {
    totalplays[i] <- totalSales[i] / 3.5
  } else {
    totalplays[i] <- totalSales[i] / 5
  }
}

sum(totalplays)
## [1] 1879514323

The total number of plays is approaching two billion! So, does the observed frequency of jackpot winners fit the probability distribution? We will have to consider draws no. 1506-1731 (when there were 49 balls) separately from draws no. 1732-present (with 52 balls). We can use a Pearson chi-squared goodness-of-fit test

obs_jackpots49 <- sum(div1Winners[drawNumber <= 1731])
obs_nojackpot49 <- sum(totalplays[drawNumber <= 1731]) - obs_jackpots49
chisq.test(x = c(obs_jackpots49, obs_nojackpot49), p = c(1, choose(49, 6) - 1), rescale.p = TRUE)
## 
##  Chi-squared test for given probabilities
## 
## data:  c(obs_jackpots49, obs_nojackpot49)
## X-squared = 1.3928, df = 1, p-value = 0.2379
obs_jackpots52 <- sum(div1Winners[drawNumber >= 1732])
obs_nojackpot52 <- sum(totalplays[drawNumber >= 1732]) - obs_jackpots52
chisq.test(x = c(obs_jackpots52, obs_nojackpot52), p = c(1, choose(52, 6) - 1), rescale.p = TRUE)
## 
##  Chi-squared test for given probabilities
## 
## data:  c(obs_jackpots52, obs_nojackpot52)
## X-squared = 0.80276, df = 1, p-value = 0.3703

Both chi-squared goodness-of-fit tests have \(p\)-values well above 0.05, so it appears that the observed number of jackpot wins in the LOTTO game is reasonable.

Finally, let us look at how the number of LOTTO plays per draw has changed over time.

# Fixing an error in the year in the `drawDate` of draws nos. 1514-1521
drawDate[drawNumber %in% 1514:1524] <- gsub("2016", "2015", drawDate[drawNumber %in% 1514:1524])
par(mar = c(4, 4, 1, 1))
plot(as.POSIXct(drawDate), totalplays, type = "l", xlab = "Draw Date", ylab = "No. of Plays")
abline(v = as.POSIXct(drawDate[drawNumber == 1607]), lty = "dotted")

We can observe a sudden downward shift in mid-2016 at the point when the price per play increased from R3.50 to R5.00 (represented by the dotted line on the graph). We can also observe a huge spike in the number plays for LOTTO Draw No. 1783 (27 Jan 2018). A glance at the data reveals that this draw had a guaranteed jackpot prize of R110 million, the largest guaranteed jackpot in the game’s history. There also seems to be a slight downward trend in plays per draw from 2018 to 2020, which could be due to the country’s economic struggles. There is also a sharp drop in plays in the most recent draws in March-May 2020, which is of course due to the national lockdown. During this period it was still possible to play LOTTO online but not in-store.

We can also observe some interesting patterns in when people tend to play LOTTO more.

dayofweek <- weekdays(as.POSIXct(drawDate))
# Median number of plays on Wednesday draws
median(totalplays[dayofweek == "Wednesday"])
## [1] 3051419
# Median number of plays on Saturday draws
median(totalplays[dayofweek == "Saturday"])
## [1] 3940625
# Median number of plays by day of month
dayofmonth <- substr(drawDate, start = 9, stop = 10)
plot(aggregate(totalplays ~ dayofmonth, FUN = median), pch = 20, 
     xlab = "Day of Month", ylab = "Median Number of Plays")

We observe that the median number of plays for Saturday draws is nearly 1 million more than the median number of plays for Wednesday draws. We can also see that if we compute the median number of plays for each day of the month, the medians are generally higher at the end and beginning of the month than in the middle of the month. This makes sense given that most South Africans receive their salary near or at the end of the month.

Conclusion

Over the past two articles we have learned how to scrape or harvest data from webpages where the data is stored statically in the HTML code and webpages where the data is stored dynamically within Javascript-rendered content. Some R programming ability was required to extract exactly the data we need from the scraped objects, format it nicely into a data.frame, and analyse it. However, very little knowledge was required of HTML, Javascript, or generally how the Internet works. We just needed to get a little bit of information about the webpages we wanted to scrape from Developer tools in Google Chrome (or a similar tool in another browser.) The bottom line: web scraping is a very powerful tool in the data scientist’s arsenal, and a surprisingly easy one to use.

2020-05-08T08:44:14+00:00Blog Articles|