From the library's earliest issue (#45), Monday 4th of June, 1860. View on Trove →
Using n-grams has a number of potential issues and limitations, and as an approach is obviously an oversimplification, so it won't reveal all the changes that occurred, but it can provide an interesting overview of the period, and highlight some of the more pronounced trends. For example, it relies on the text being correct, and given the gazette has been converted from the original documents via optical character recognition (OCR), like many OCR datasets (especially those of older documents with weathered paper) there's errors in the conversion ('side-lovers' for 'side-levers', 'dork' for 'dark etc), that will effect the final result to varying degrees.
As mentioned earlier, the gazette also repeats a number of entries, as well as includes occasional supplements from other gazettes. Initially I was going to try and exclude all of that, but given the size and scale (and variety) of the article formats, in the end I was more worried that by attempting to exclude the repeats and supplements, unless I was 100% accurate, I might accidentally introduce more bias than if I just used the entire corpus as is. The same concern was there with the typos introduced by OCR: if I included some common OCR errors but missed others, this would then weight the terms incorrectly. (In looking for terms I'd use Trove to get an idea of any different spellings over the period - 'grey' vs 'gray', 'color' vs 'colour' etc - but was mainly looking at how it related to hyphens and/or capital letters, such as 'Blucher boots' vs 'blucher-boots', rather than trying to capture all possible misspellings...).
There was a similar issue with homonyms, especially with the section on places. For example, was 'Albert' in the text referring to a person named Albert, or the town called Albert, or The Royal Albert Hotel, or an Albert chain. Obviously there are ways to disambiguate this on a case by case basis, an/or use some form of Part-Of-Speech tagger to classify each occurrence, but in the end I just decided to work with the text as is.
Therefore results should still be taken with a grain of salt, as what they may appear to show is open to misinterpretation. For example consider the trends for the two terms 'grey' and 'slaughter-house' below. As you can see, 'grey' drops to almost nothing after 1869, and 'slaughter-house' in 1885. However this does not mean that there was no grey hair or grey trousers in NSW after 1869, or that slaughter-houses suddenly all closed in 1885. Rather that the gazette switched to using 'gray' instead of 'grey' in the case of the former, and it no longer reported the names of those appointed as inspectors of slaughter-houses in the case of the latter.
Finally, given the size and scale of the corpus (almost 20 million words), and the number of terms analysed and counted, there's also likely to be a variety of errors in my code. Some will probably be to do with the regex I've used to extract the terms, so for the sake of transparency you can see what's been used for by hovering over/clicking the text. Despite all these issues however, the n-gram/bag-of-words approach is a good way of quickly getting a sense of the overall terms and trends in a large body of data.
This is just a quick overview of the main steps involved in the process.
- Download the gazette data from Trove (and thanks to @wragge for the Trove API Console for helping me work out the requests to make)
- Clean up the data (strip html tags, line-breaks, tabs, double spaces, stray characters etc)
- Count words per year (just counted the spaces) in order to work out the rate of terms per year
- I then used regex to count the number of times a word or phrase appeared per year. With certain terms, I'd also include the plural version (e.g. ‘earring' and ‘earrings'), and/or variations with hyphens (e.g. 'public house' and 'public-house'), and/or case insensitive searches (e.g. SYDNEY and Sydney). To see what variations existed I'd use Trove, limited to the gazette from 1860-1900.
- I would then take the number of time the term appeared each year, and work out the rate, based on the number of words in that year.
- The results were then graphed. As the rates differs greatly, from only a handful of mentions per year to thousands, they're scaled against the highest rate for that term, in order to best see the differences overtime, so they are relative. The dotted line represents the absolute rate. i.e., every term on the page uses the same scale for the line, which is 400 mentions per 100,000 words in a year (for reference, 'gold' is mentioned ~800 times per 100,000 words)
For the map section, the approach was slightly different:
- I ran a list of all the places in NSW through the gazette to see what what was mentioned, and then ran that list through the gazette again to get the count. I used geonames.org for the names and locations, which lists its data providers as Geoscience Australia and the ABS among others. It (understandably) doesn't include historical suburbs, plus the spelling of some place names have changed, so certain places may have been missed.
- Because this deals with the period when NSW was still a colony, the ACT obviously didn't exist, so I did the same process with ACT suburbs (but ignored all the ones that were named after politicians)
For 'Selective Reporting':
- I got all articles from the gazette that mentioned the words 'murder' and 'Aboriginal' (or variations, including '(Ab.)' which is what the gazette often used as an abbreviation from 1877). There were 145 entries in the end.
- I then read the articles and if they were reports of a murder, classified them according to their alleged suspect(s) and victim(s)