WEBVTT

00:00:00.000 --> 00:00:13.520
So, I would like to talk to you about the Digital Tolkien project. But before I do that,

00:00:13.520 --> 00:00:22.400
I want to talk about two things that happened in 1937. There was two big releases in the

00:00:22.400 --> 00:00:29.420
sort of cultural world. One was an animated film in America and one was a children's book

00:00:29.420 --> 00:00:38.300
in England. And both of them ended up radically changing their respective genres. The book

00:00:38.300 --> 00:00:46.100
was J.R.R. Tolkien's The Hobbit and the film was Snow White and the Seven Dwarfs. What's

00:00:46.100 --> 00:00:52.020
interesting about this is that both these stories took inspiration from Germanic stories

00:00:52.020 --> 00:00:59.540
and both heavily featured dwarfs. But one of the things you may not be aware of is that

00:00:59.540 --> 00:01:09.940
the way the word dwarfs is spelled in The Hobbit versus Snow White and the Seven Dwarfs.

00:01:09.940 --> 00:01:14.580
Now it's an interesting question why the difference isn't just an American English thing, not

00:01:14.580 --> 00:01:26.100
at all. In fact, the dominant spelling of dwarfs was with an F, not a VE in 1937. And

00:01:26.100 --> 00:01:32.020
Michaela mentioned Google Ngram Viewer. So one of the things we can do is go to Google

00:01:32.020 --> 00:01:38.580
Books Ngram Viewer and look at the relative frequency of the two spellings of dwarfs.

00:01:38.580 --> 00:01:45.340
An interesting thing you notice though is that just after 1937, starting in the 1940s,

00:01:45.340 --> 00:01:52.900
the Tolkien spelling with the VE started to rise. It was almost non-existent at that point.

00:01:52.900 --> 00:01:59.780
But by around the time that the Peter Jackson films came out, the two crossed over and now

00:01:59.780 --> 00:02:07.640
the spelling, Tolkien's spelling dominates. I bring this up not to talk about spelling

00:02:07.920 --> 00:02:14.360
conventions but really just to give you an example of doing digital humanities in the

00:02:14.360 --> 00:02:20.200
sense that what we've done is we've counted some things and compared them across time.

00:02:20.200 --> 00:02:24.520
But one of the things that's sort of hidden behind this sort of chart is just a tremendous

00:02:24.520 --> 00:02:29.320
amount of work to get all these texts digitised and everything. And this is a very common

00:02:29.320 --> 00:02:34.000
thing in digital humanities. It's one thing for us to be able to count things and do visualisations

00:02:34.040 --> 00:02:40.160
like this but there's a lot that needs to get done before that happens. And I'm going

00:02:40.160 --> 00:02:44.680
to be talking about quite a bit of that sort of stuff today.

00:02:44.680 --> 00:02:51.680
I want to briefly say something about my background and entry into all of this with Tolkien. I

00:02:53.440 --> 00:02:58.320
was a huge fan of The Hobbit when I was about 11 years old and the other thing I was really

00:02:58.360 --> 00:03:04.360
interested in was computer games. And at the time computer games were these text adventures.

00:03:04.360 --> 00:03:08.360
You got a little bit of graphics but it was mostly interacting with the world by typing

00:03:08.360 --> 00:03:13.360
sentences where you would say, you know, get lamp and go north and stuff like that. And

00:03:13.360 --> 00:03:18.640
as an 11 year old I wanted to write these kind of games. And reading computer magazines

00:03:18.640 --> 00:03:23.360
about this I discovered that behind these computer games they were doing things like

00:03:23.360 --> 00:03:27.520
parsing the sentences and they had lexicons and stuff like that. So I was learning all

00:03:27.520 --> 00:03:34.520
of this linguistic terminology that planted the seed in my mind to end up studying linguistics

00:03:35.400 --> 00:03:41.960
and use computers to better understand languages and texts.

00:03:41.960 --> 00:03:47.200
The third book in the, well third part of The Lord of the Rings, if you look in the

00:03:47.200 --> 00:03:52.000
appendices has tremendous amount of linguistic information and this was another thing that

00:03:52.000 --> 00:03:57.160
sort of planted the seed. These different writing systems and languages and so on that

00:03:57.160 --> 00:04:01.600
was a big part of all this. So I ended up having an interest in computers and also having

00:04:01.600 --> 00:04:06.520
an interest in language. I studied linguistics as an undergraduate and in particular at the

00:04:06.520 --> 00:04:12.920
time I was interested in ancient Greek. And so a lot of the software development that

00:04:12.920 --> 00:04:18.800
I've done over the years has been applying computers to ancient Greek and I work on,

00:04:18.800 --> 00:04:22.900
I built something called the Skaithiou which is the new reading environment for the Perseus

00:04:23.020 --> 00:04:27.940
Digital Library. Some of you may be familiar with Perseus. It's one of the oldest digital

00:04:27.940 --> 00:04:34.140
humanities projects. It's been around since the 80s and basically providing free Greek

00:04:34.140 --> 00:04:41.140
and Latin texts to classicists. And the sort of stuff in the software that I've built includes

00:04:42.060 --> 00:04:49.060
a reading environment here. This is the start of Homer's Iliad but bringing in dictionaries,

00:04:49.060 --> 00:04:54.900
syntactic annotations, translation alignments in this case with Persian and so bringing

00:04:54.900 --> 00:04:59.740
together all this sort of information, images of the manuscripts. This is the Venetis A

00:04:59.740 --> 00:05:06.180
manuscript and so tying the images to the text with scholarly commentaries from the

00:05:06.180 --> 00:05:09.580
middle ages on the right hand side and so on, bringing all this sort of information

00:05:09.580 --> 00:05:16.580
together in an online reading environment. But about six years ago I was actually at

00:05:16.580 --> 00:05:22.780
a conference in Oxford and it occurred to me where I asked the question of myself, what

00:05:22.780 --> 00:05:27.260
if the works of Tolkien, The Hobbit and The Lord of the Rings and The Silmarillion, what

00:05:27.260 --> 00:05:32.980
if they were treated as objects of philological study like these Greek texts that I was studying?

00:05:32.980 --> 00:05:36.540
What if you had reading environments where you could bring in all of this linguistic annotation

00:05:36.540 --> 00:05:43.540
and other data to bear on the text? And so that was the beginnings of the digital Tolkien

00:05:44.540 --> 00:05:51.340
project, which I described from the very beginning as a scholarly project focused on Tolkien

00:05:51.340 --> 00:05:57.140
from both a corpus linguistic and digital humanities perspective.

00:05:57.140 --> 00:06:00.860
And the way I sort of very generally think about the sorts of things we do, we start

00:06:00.860 --> 00:06:05.260
with the text and we think about the way that the text is structured and I'll talk a little

00:06:05.260 --> 00:06:12.260
bit about each of these. Then once you have a structure to a text you can cite it. You've

00:06:12.900 --> 00:06:16.740
got a way of referring to parts of the text once you understand its structure. I'll say

00:06:16.740 --> 00:06:21.460
a little bit more about that and then you can do annotation analysis and in many cases

00:06:21.460 --> 00:06:25.020
ultimately visualise that.

00:06:25.020 --> 00:06:32.020
So it all started off with getting the text and marking it up, in this case in the extensible

00:06:32.820 --> 00:06:39.620
markup language or XML and this is really just a way of giving some basic structure

00:06:39.620 --> 00:06:46.620
to the text. In the case of novels it's typically chapters and paragraphs. All of Tolkien's

00:06:46.780 --> 00:06:53.020
works also involve poetry as well, so you have the need to mark up that as well. In

00:06:53.020 --> 00:06:57.580
some cases you have other material, things like letters and so on. And so we go through

00:06:57.580 --> 00:07:04.380
the process of marking that up, putting in these extra things you see in angled brackets

00:07:04.380 --> 00:07:07.980
that indicate that it's a paragraph and so on.

00:07:07.980 --> 00:07:08.620
Thank you very much.

00:07:09.620 --> 00:07:16.620
Another thing you'll notice, which turns out to be very important for a lot of the

00:07:17.300 --> 00:07:22.780
sort of analysis and annotation work that I wanted to do, is that we go through all

00:07:22.780 --> 00:07:29.780
the quotation marks and disambiguate whether they are indicating speech or they're apostrophes,

00:07:30.060 --> 00:07:33.540
indicating possession or contraction or something like that. Because if you're wanting to do

00:07:33.540 --> 00:07:38.280
stuff later which we'll get to in a moment of extracting direct speech, then you need

00:07:38.280 --> 00:07:43.800
to understand is that a quote just indicating possession or is it marking the end of direct

00:07:43.800 --> 00:07:48.000
speech and that sort of stuff. So we sort of go through that process of disambiguation.

00:07:48.000 --> 00:07:52.760
You'll also notice that these paragraphs are numbered and this is part of the citation

00:07:52.760 --> 00:07:59.760
aspect of things. One of the things that happens in classics, in Latin and Greek and in medieval

00:08:00.000 --> 00:08:05.560
texts, is that scholars have developed citation systems where you can refer to parts of the

00:08:05.560 --> 00:08:12.560
text independent of a particular print edition. So you don't say page 213 of the Bible, you

00:08:13.200 --> 00:08:20.200
say John 316 or something, right? And the same is true with Iliad, with Herodotus, anything

00:08:20.200 --> 00:08:25.760
like that. You have conventions about how you refer to it. And so one of the first things

00:08:25.760 --> 00:08:32.760
that I wanted to do was shift Tolkien's scholarship to use that kind of citation to free scholars

00:08:32.760 --> 00:08:37.680
from having to have a particular edition. Because it's still the case that if you pick

00:08:37.680 --> 00:08:41.280
up a lot of scholarly works, they'll talk about page numbers in a particular edition

00:08:41.280 --> 00:08:45.080
and if you don't have that edition, you don't necessarily know what on earth they're talking

00:08:45.080 --> 00:08:52.080
about. Another reason this is important in the case of Tolkien is that these works are

00:08:52.080 --> 00:09:00.080
still... When we annotate these texts and share our annotations with one another, we typically

00:09:01.080 --> 00:09:08.080
can't share the text itself. I couldn't give you this file. It would give you trouble.

00:09:08.720 --> 00:09:12.660
But if we agree on these citations, these ways of referring to... I'm talking about

00:09:12.660 --> 00:09:18.800
paragraph 37 in chapter 5, then we can talk about and annotate those texts without actually

00:09:18.800 --> 00:09:25.560
having to share the text. So we end up with this sort of citation system and we've done

00:09:25.560 --> 00:09:32.320
this for most of the major published works of Tolkien. And once you have text and you

00:09:32.320 --> 00:09:39.320
have this citation, we have the ability to do searches and map to things. One of the

00:09:39.360 --> 00:09:44.520
early things that I did was built this search engine where you can start to type a particular

00:09:44.520 --> 00:09:50.520
part of text and get back the reference. So if you've got a book and you're saying, how

00:09:50.520 --> 00:09:54.360
am I going to communicate to somebody what the reference for this paragraph is, you can

00:09:54.360 --> 00:10:00.160
start typing in the paragraph and down the bottom you can see this LR 6.09.395, which

00:10:00.160 --> 00:10:07.160
is the only paragraph that contains that engram, well I'm back. Which those of you who have

00:10:07.400 --> 00:10:14.400
read Lord of the Rings would know is the final line of the story spoken by Samwise Gamgee.

00:10:16.280 --> 00:10:20.200
But there's an interesting side effect of building this sort of search engine, which

00:10:20.200 --> 00:10:24.800
is then you can search for terms and get interesting comparisons across the work. So

00:10:24.800 --> 00:10:29.800
you can search for a phrase like the one ring and get a distribution of where that phrase

00:10:29.800 --> 00:10:36.800
appears in Lord of the Rings, in the Silmarillion, in letters, unfinished tales and so on.

00:10:39.640 --> 00:10:45.720
We've built other stuff as well that provides lexical level information, so information

00:10:45.720 --> 00:10:51.400
on individual words, because a lot of the research questions that people have have

00:10:51.400 --> 00:10:58.400
to do with understanding certain things at a word level, a lexical level. So for example

00:11:00.080 --> 00:11:04.760
there are people that want to know when Tolkien talks about trees. Tolkien very famously talks

00:11:04.760 --> 00:11:09.680
about trees a lot. But you can't just search for the word tree, because maybe he uses a

00:11:09.680 --> 00:11:15.480
particular species of tree, maybe he uses other terms, botanical terms. And so you need

00:11:15.480 --> 00:11:22.480
to know what are the botanical terms and tree species and so on that Tolkien uses.

00:11:22.480 --> 00:11:26.960
So one of the things we've been doing with this Tolkien Glothary is basically a database

00:11:26.960 --> 00:11:31.960
of all the words that Tolkien uses with a way of people being able to annotate them

00:11:31.960 --> 00:11:36.640
for various reasons. Another sort of different application of this, a lot of people are interested

00:11:36.640 --> 00:11:43.160
in how Tolkien chooses to use English words that are of Germanic origin as opposed to

00:11:43.160 --> 00:11:50.160
Latinate origin. So obviously English is a strange combination of, because of the Norman

00:11:51.840 --> 00:11:58.840
invasion of sort of, it's a Germanic language but with tremendous amount of French influence.

00:12:00.240 --> 00:12:05.680
Tolkien was a Germanic philologist, he understood this better than almost anyone else. And there

00:12:05.680 --> 00:12:11.840
are hints throughout his writing of him using that as a device, deliberately choosing to

00:12:11.840 --> 00:12:16.800
use more Germanic words rather than Latinate words at various points in the text. But

00:12:16.800 --> 00:12:20.600
in order to really quantify that we need to know for every single word is it Germanic

00:12:20.600 --> 00:12:25.600
or Latinate and so on. And so having a system for annotating the words in that manner is

00:12:25.600 --> 00:12:30.240
something that we've worked on.

00:12:30.240 --> 00:12:35.520
But I want to come back to just simply counting words here. This is a visualisation of mentions

00:12:35.520 --> 00:12:40.120
of Frodo and Sam by Book of Lord of the Rings. So Lord of the Rings is structured into six

00:12:40.160 --> 00:12:47.160
different books and here you can see the actual number of occurrences. So a couple of things

00:12:47.160 --> 00:12:52.480
to say about this. One is just sort of in terms of the story itself, one thing you'll

00:12:52.480 --> 00:12:59.480
notice is that Book 3 and Book 5 hardly talk about Frodo and Sam at all. And the reason

00:12:59.520 --> 00:13:04.760
is because the way that Tolkien tells the story is once the fellowship gets split we

00:13:04.760 --> 00:13:09.860
follow different groups of people at different times. And to emphasise the separation of

00:13:09.860 --> 00:13:16.860
these various parties within the fellowship, Tolkien separates when he writes about them.

00:13:17.760 --> 00:13:22.260
And at the most extreme case he separates writing about Frodo and Sam who are on their

00:13:22.260 --> 00:13:26.460
own from the rest of the fellowship. And it really emphasises, it gives the reader that

00:13:26.460 --> 00:13:29.520
experience of I don't know what's happening to the other group because I haven't read

00:13:29.520 --> 00:13:34.220
about them for ten chapters. And so Book 3 and Book 5 don't cover what's going on with

00:13:34.220 --> 00:13:40.660
Frodo and Sam at all. The other interesting thing you get here though is the rise of Sam.

00:13:40.660 --> 00:13:47.060
Sam starts off as not a particularly important character but by the end is some would argue

00:13:47.060 --> 00:13:54.060
the most important character in the story. So we see some of that in this. But I do want

00:13:55.380 --> 00:14:00.800
to point out some limitations of what I've done here. This is purely based on occurrences

00:14:00.800 --> 00:14:06.320
of the word Frodo and the word Sam. And there are two issues with that. One is that if

00:14:06.320 --> 00:14:11.720
they're referred to by any other name, it's not going to come up here. And secondly, this

00:14:11.720 --> 00:14:15.360
is not saying anything to do with whether they are in the narrative main line. They

00:14:15.360 --> 00:14:18.960
might be mentioned by another group. And in fact that's why you get a bit of Frodo and

00:14:18.960 --> 00:14:24.600
Sam in Book 3 for example. That's other people talking about Frodo and Sam. And I just bring

00:14:24.600 --> 00:14:29.680
this up because depending on your research question and the kind of thing that you're

00:14:29.680 --> 00:14:35.920
trying to investigate, you need more annotations to tease out those examples. So again one

00:14:35.920 --> 00:14:40.960
of the things that we've been doing is solving both of those problems. So first of all, and

00:14:40.960 --> 00:14:45.560
I'll talk a little bit about this in a few slides, going through and making sure that

00:14:45.560 --> 00:14:52.420
we link the particular name that is used for someone with who that character actually is.

00:14:52.420 --> 00:14:55.940
Again those of you who know the story will know that Strider turns out to be Aragorn.

00:14:55.940 --> 00:14:59.360
So sometimes he's called Strider, sometimes he's called Aragorn. He actually has over

00:14:59.360 --> 00:15:04.800
a hundred different names that are used for him throughout the novel. So it's not quite

00:15:04.800 --> 00:15:11.240
as bad as Tolstoy but it's up there. So in order to for example find when is Aragorn

00:15:11.240 --> 00:15:15.460
talked about, you need to know that he's also called Strider and Elisar and the heir of

00:15:15.460 --> 00:15:20.500
Isildur and all sorts of things like that. But also to know whether the character is

00:15:21.160 --> 00:15:26.160
in the scene, actually on the narrative main line as opposed to merely being referenced,

00:15:26.160 --> 00:15:30.480
talked about, then that's another whole level of annotation that needs to be done and I'll

00:15:30.480 --> 00:15:35.880
talk a little bit about that at the moment. Here's another example not with characters

00:15:35.880 --> 00:15:41.160
but just with words, the word evil versus good and their relative frequency across the

00:15:41.160 --> 00:15:48.000
three books. And it's really quite interesting. You immediately get this sense, okay the Hobbit

00:15:48.000 --> 00:15:53.000
uses the word good a lot more than evil but Silmarillion uses the word evil a lot more

00:15:53.000 --> 00:15:57.700
than good and Lord of the Rings is somewhere in the middle. But again there are certain

00:15:57.700 --> 00:16:02.360
caveats that need to be made about this. The word good in particular does not always mean

00:16:02.360 --> 00:16:08.580
the opposite of evil. It could have just been a good breakfast that they had. And so again

00:16:08.580 --> 00:16:13.220
there's some more annotation that potentially needs to be done with this sort of stuff to

00:16:13.220 --> 00:16:17.620
do word sense disambiguation to make sure that you're really capturing what's going

00:16:17.620 --> 00:16:21.100
on here. But it's still interesting. A lot of these things are not necessarily about

00:16:21.100 --> 00:16:25.640
coming to a conclusion but the starting point of an investigation. You can notice these

00:16:25.640 --> 00:16:30.820
sorts of things and say okay now I want to investigate the text more, go into all the

00:16:30.820 --> 00:16:38.520
places where these words are mentioned and really get a sense of what's going on. Another

00:16:38.520 --> 00:16:42.320
example that was a surprise to me, this is actually a really good example of something

00:16:42.320 --> 00:16:47.180
that kicked things off for me and I ended up doing a whole lot more research and annotation

00:16:47.180 --> 00:16:53.880
as a result of this. So this is whether the word do not is contracted to don't in direct

00:16:53.880 --> 00:17:01.460
speech. And I did not expect these results to be so strong. So in the Hobbit in direct

00:17:01.460 --> 00:17:12.560
speech it's almost always contracted. 420 times 400,000 tokens as opposed to only 19.6.

00:17:12.560 --> 00:17:19.060
And the Silmarillion never ever contracts don't. It's always do not. But perhaps the

00:17:19.060 --> 00:17:26.440
most interesting thing of all is the Lord of the Rings has split almost exactly 50-50.

00:17:26.440 --> 00:17:31.840
And that raised the question when I did these numbers, what's different about when it's

00:17:31.840 --> 00:17:38.840
contracted and not contracted and is it to do with perhaps who's speaking. I'll come

00:17:38.840 --> 00:17:41.800
back to that in a moment. There's a couple more things I want to say but I will return

00:17:41.800 --> 00:17:47.840
to that question because it really is quite interesting. So I already mentioned a lot

00:17:47.840 --> 00:17:53.360
of this stuff, this idea of going through the text and looking at these named entities

00:17:53.360 --> 00:17:58.600
and mapping them to when you have a place mentioned or a person mentioned actually mapping

00:17:58.600 --> 00:18:04.120
that to the character or the location, tying the names to the characters and places so

00:18:04.120 --> 00:18:11.720
that you can do more with the text than simply search for where particular words are used.

00:18:11.720 --> 00:18:18.000
But coming back to that question of whether the reason for the difference in the contraction

00:18:18.000 --> 00:18:22.680
of do not to don't is a character issue, of course one of the things that you need to

00:18:22.680 --> 00:18:27.000
know in order to do that is who is saying everything. So once you go through a text

00:18:27.000 --> 00:18:31.200
and you mark up all the direct speech you still need to know who's speaking. And so

00:18:31.200 --> 00:18:37.280
last year, because I wanted to investigate this specific question of do and do not, it

00:18:37.280 --> 00:18:42.880
ended up being valuable for a whole lot of other things including my dissertation. A

00:18:42.880 --> 00:18:48.120
group of volunteers went through and looked at every bit of direct speech in the Hobbit,

00:18:48.120 --> 00:18:54.200
the Lord of the Rings and the Silmarillion and we had people separately annotate who

00:18:54.200 --> 00:19:00.520
the speaker was and then check that they agreed. And so we went through and for every single

00:19:00.520 --> 00:19:06.640
bit of direct speech who the speaker is. And so what that then allows us to do is say okay

00:19:06.640 --> 00:19:11.360
who is doing the contraction in this case. There's a whole lot more we ended up doing

00:19:11.360 --> 00:19:17.720
and I'll say more about that but just on this question of do not to don't. There's a group

00:19:17.720 --> 00:19:22.520
of people that always contract and a group of people that never contract. So no elf ever

00:19:22.520 --> 00:19:31.260
contracts do not to don't. No dwarf ever contracts. The sort of the higher class humans, the Gondorians,

00:19:31.260 --> 00:19:37.380
also never contract. And the Rohirrim, the horsemen of the plains that sort of have a

00:19:37.380 --> 00:19:49.020
more conservative background maybe, they also never contract. In contrast to that, almost

00:19:49.020 --> 00:19:54.620
all of the Hobbits contract, most of the sort of lower class men contract and all the Orcs

00:19:54.620 --> 00:20:00.540
contract, 100% of the time. Where it gets really interesting is the people that can

00:20:00.540 --> 00:20:07.700
do both. Gandalf, Hobbits, the sort of more upper class Hobbits like Frodo, Merriam,

00:20:07.700 --> 00:20:13.940
Pippin, Gandalf, Aragorn, they all are able to do both and I've got the percentages there

00:20:13.940 --> 00:20:18.980
of how often they contract. One thing that I have not yet looked at that I think would

00:20:18.980 --> 00:20:25.100
be interesting here is whether it varies depending on who they're speaking to. But we don't have

00:20:25.100 --> 00:20:30.500
that annotated yet. And again another example of how there are these questions that arise

00:20:30.500 --> 00:20:35.880
that need you to then go and annotate the data in a richer way. But I do think it's

00:20:35.880 --> 00:20:44.980
interesting that characters like Gandalf and Aragorn, who are both of a much sort of higher

00:20:44.980 --> 00:20:53.580
background but interact a lot with the common person or Hobbit or whatever, that they are

00:20:53.580 --> 00:21:01.380
quite happy to contract as needed. Okay, one of the sort of related questions to all this

00:21:01.380 --> 00:21:10.060
that dominated my research last year and the dissertation that I did at Lancaster University

00:21:10.060 --> 00:21:15.340
was this whole question of the overall style that the Hobbit, the Lord of the Rings and

00:21:15.340 --> 00:21:20.140
the Silmarillion are written in. It's often been remarked that they're in very different

00:21:20.180 --> 00:21:24.700
styles and it's obvious even if you just read the first paragraph of the three of them,

00:21:24.700 --> 00:21:32.400
they seem to be written in very different styles. But the question is how do we actually

00:21:32.400 --> 00:21:38.820
describe that in a quantitative way? What are the linguistic features of the text that

00:21:38.820 --> 00:21:49.060
really get to that intuitive difference that we pick up? And one way of doing that is to

00:21:49.060 --> 00:21:55.340
look at the use of function words. Because one of the things you can't do, overall words

00:21:55.340 --> 00:22:00.740
in general are going to vary depending on the topic. The content words, particularly

00:22:00.740 --> 00:22:05.780
the common nouns and to a lesser extent the verbs that get used are going to depend on

00:22:05.780 --> 00:22:10.900
what the story is about. And I'll give an example of that in a moment. So what often

00:22:10.980 --> 00:22:17.980
happens in these kinds of studies of variations in style is you look at the function words,

00:22:19.060 --> 00:22:26.060
the prepositions and the articles and so on. So imagine that we decided let's look at the

00:22:27.220 --> 00:22:32.620
top most common function words in each of these three. Will that tell us a difference

00:22:32.620 --> 00:22:37.300
between the three? Well I'm going to show you the top three words in the Hobbit, the

00:22:37.380 --> 00:22:44.380
top three words in Lord of the Rings and the top three words in the Summerland. Not particularly

00:22:45.980 --> 00:22:51.940
helpful at this point. Now coming back to what I said about the content words, there

00:22:51.940 --> 00:22:58.180
are certain words that only appear in each of these. So burglar, grumbled, mutton, ruffians,

00:22:58.180 --> 00:23:04.180
hoofs, file, theme, ban, yearned. If you're familiar with the books you would probably

00:23:04.180 --> 00:23:07.660
be able to guess that the first is the Hobbit, the second is Lord of the Rings and the third

00:23:07.660 --> 00:23:13.100
is the Summerland. So when we're talking about these content words it's a lot easier. But

00:23:13.100 --> 00:23:20.100
coming back to this, one thing that might surprise you is that while knowing the top

00:23:20.140 --> 00:23:25.500
three words doesn't tell you much about the difference, if we look at the relative frequency

00:23:25.500 --> 00:23:31.540
of those something interesting pops out. The Hobbit and the Lord of the Rings don't really

00:23:31.540 --> 00:23:36.820
differ that much. But look at the Silmarillion. The Silmarillion has considerably higher proportion

00:23:36.820 --> 00:23:43.820
of the definite article the and even more so the word of. And it turns out that if you

00:23:48.460 --> 00:23:53.480
plot every chapter of Lord of the Rings, the Hobbit and the Silmarillion based on the relative

00:23:53.480 --> 00:24:00.480
frequency of of on the x axis and but on the vertical axis and hopefully this will sort

00:24:00.840 --> 00:24:04.720
of come out, the colours might not come out. The Silmarillion which are the light blue

00:24:04.720 --> 00:24:09.400
dots are almost entirely separated from the others. There's only one exception, there's

00:24:09.400 --> 00:24:15.360
only one chapter. In other words, if you had a chapter and I wanted to know whether it

00:24:15.360 --> 00:24:19.960
was from the Silmarillion on the one hand or the Hobbit and Lord of the Rings on the

00:24:19.960 --> 00:24:26.080
other hand, if you just told me the relative proportion of the word of I would be able

00:24:26.080 --> 00:24:32.800
to tell you with only one chapter wrong that it was the Silmarillion. We can actually do

00:24:32.800 --> 00:24:37.600
slightly better than that by not just looking at two which I've done here in these two dimensions

00:24:37.600 --> 00:24:42.600
but look at say the top forty function words. So one of the things that I did was I took

00:24:42.600 --> 00:24:46.240
the top forty function words. Now obviously if we were trying to draw that, that would

00:24:46.240 --> 00:24:53.240
mean a forty dimensional space which is a little hard to deal with. But we have ways

00:24:53.320 --> 00:25:00.320
of basically rotating and stretching multi-dimensional spaces so that you can look at them just in

00:25:00.320 --> 00:25:04.480
two dimensions, flatten them in two dimensions and actually see a lot of the differences.

00:25:04.480 --> 00:25:09.880
So if you imagine if I put my fingers, imagine the two points at the tip of my fingers, if

00:25:09.880 --> 00:25:14.800
I'm looking at it from this angle I can't distinguish my two fingers but if I look at

00:25:14.800 --> 00:25:18.360
it from this angle it's much clearer that my two fingers are in different points. That's

00:25:18.360 --> 00:25:22.640
the sort of thing that's going on except instead of three dimensions turning into two we've

00:25:22.640 --> 00:25:27.960
got forty dimensions turning into two. But if you do that and plot those points, this

00:25:27.960 --> 00:25:33.480
is called principal component analysis by the way, it's one way of doing this dimensionality

00:25:33.480 --> 00:25:38.520
reduction, you actually get a situation where you can draw a line and completely separate

00:25:38.520 --> 00:25:45.520
all chapters of the Silmarillion from the others. Now one other thing we can do because

00:25:46.320 --> 00:25:51.160
we annotated all of the text, the direct speech with who is speaking is we can do this sort

00:25:51.160 --> 00:25:58.160
of comparison for speakers as well. This is not coming out that well on the screen

00:25:59.360 --> 00:26:06.360
but I'll try to highlight what's going on here. So this is again just taking each chapter,

00:26:06.840 --> 00:26:13.840
the speech of four different characters, Gandalf, Frodo, Sam and Gollum and it's just doing

00:26:14.760 --> 00:26:21.760
a principal component analysis of the top forty function words. And in the bottom middle

00:26:23.040 --> 00:26:30.040
the yellow is Sam, you can see Sam is speaking distinctly from Frodo who is the sort of teal

00:26:30.280 --> 00:26:35.120
in the, you can barely make out between the red. The red is Gandalf. One interesting thing

00:26:35.120 --> 00:26:41.640
about Gandalf is he starts off much closer to Frodo when he's in the Shire talking to

00:26:41.640 --> 00:26:48.640
other hobbits. As the book goes on and he's out in the wide world his speech changes quite

00:26:50.000 --> 00:26:54.360
considerably. But perhaps the most fascinating thing is Gollum in the top left hand corner,

00:26:54.360 --> 00:26:59.600
it doesn't sound like anyone else at all except you might be able to just make out there's

00:26:59.600 --> 00:27:06.600
a teal dot right in the middle of Frodo's purple and that's one chapter when Frodo is

00:27:06.600 --> 00:27:13.600
in Gollum's circle for one chapter when he's really under the influence of the ring, his

00:27:17.040 --> 00:27:22.960
speech changes and that is evident just in the proportion of function words that he uses

00:27:22.960 --> 00:27:30.120
which I find really fascinating. But Tolkien intuitively knew how to change the style of

00:27:30.120 --> 00:27:37.120
speech of these different characters and it comes out very clearly in this sort of

00:27:37.120 --> 00:27:41.800
analysis. A few other things, I'll just quickly go through a bunch of other stuff we've done.

00:27:41.800 --> 00:27:48.040
One of the things that I've looked at is shared engrams between different texts. Where do

00:27:48.040 --> 00:27:55.040
texts use those same consecutive sequences of words? This is five grams between the Silmarillion

00:27:55.760 --> 00:27:59.720
and Lord of the Rings which is not particularly interesting because when you have small engrams

00:27:59.720 --> 00:28:05.600
it tends to be more indicative of the terms of phrase and style of writing rather than

00:28:05.600 --> 00:28:10.960
content. As you get bigger you start to get more sort of content oriented stuff. So this

00:28:10.960 --> 00:28:16.920
is the seven grams between the Silmarillion and Lord of the Rings and so it actually matches

00:28:16.920 --> 00:28:22.120
up passages in one book that correspond to, they're talking about the same topic as the

00:28:22.120 --> 00:28:29.120
other. I also did this with the Peter Jackson films, the books to the Peter Jackson subtitles

00:28:29.120 --> 00:28:36.000
and so you can sort of see where from the book things got used in the dialogue.

00:28:36.000 --> 00:28:40.280
We haven't just been digitising the latest editions of the books but also different versions

00:28:40.280 --> 00:28:47.280
of the books and so we've done a lot to try to get a sense of how texts have changed,

00:28:47.880 --> 00:28:53.520
sometimes different versions. This is two different versions of a long poem that we

00:28:53.520 --> 00:29:00.520
visualised the common elements of and so there's a lot of stuff we've been doing there.

00:29:01.560 --> 00:29:06.600
I'll just quickly go through a few more things. One of the things that I'm interested in is

00:29:06.600 --> 00:29:12.560
the whole relationship between sort of the narrative, where you are in the text and where

00:29:12.560 --> 00:29:19.560
you are in story time. So this was a very simple plot that I did just at a chapter level

00:29:19.960 --> 00:29:25.160
trying to show what day a particular chapter is taking place on in Lord of the Rings and

00:29:25.160 --> 00:29:28.960
so you get a really good sense of the interlacing that Tolkien does at the time. But this is

00:29:28.960 --> 00:29:33.680
only at a chapter level. I wanted to get much more fine grained and so we've been doing

00:29:33.680 --> 00:29:40.640
this project this year to basically annotate every single paragraph of Lord of the Rings

00:29:40.640 --> 00:29:47.640
with what characters are present, what location are they in and what time is it. And so lots

00:29:48.120 --> 00:29:55.120
of interesting stuff there. I built an annotation software specifically for this purpose and

00:29:56.680 --> 00:29:59.720
some of the results that come out of this already are these kinds of visualisations

00:29:59.720 --> 00:30:05.920
of exactly when characters are in a particular paragraph, when they're speaking, also when

00:30:05.920 --> 00:30:12.040
is the narrative main line in a particular location.

00:30:12.040 --> 00:30:17.800
So overall, I'm just finishing up now, we're doing a lot of different things. This is a

00:30:17.800 --> 00:30:23.000
diagram that I drew about a year ago to kind of show all of the various sub-projects within

00:30:23.000 --> 00:30:28.320
the Digital Tolkien project. I've talked about a lot of these but we're also doing stuff

00:30:28.320 --> 00:30:35.320
with maps and timelines and genealogies and a catalogue of verse which is particularly

00:30:36.160 --> 00:30:42.680
significant now because just a couple of months ago a collected poems, three volumes of collected

00:30:42.680 --> 00:30:47.720
poems of Tolkien came out. And so we've got digital versions of those and we're building

00:30:47.720 --> 00:30:54.320
a database so that you can query it based on different poetic forms, rhyming schemes,

00:30:54.320 --> 00:30:58.280
metrical schemes and stuff like that. And I won't say much about linked open data because

00:30:58.280 --> 00:31:04.420
I know Christoph will but we're trying to bring a lot of this together and better inter-operate

00:31:04.460 --> 00:31:11.460
with other sources of data about Tolkien through linked open data, both in terms of relationships

00:31:12.180 --> 00:31:17.260
within the fictional world, between people and places and so on and where in the text

00:31:17.260 --> 00:31:21.880
they're mentioned, but also the real world. So for example, annotating here that this

00:31:21.880 --> 00:31:26.500
particular paragraph was discussed in a letter that J.R. Tolkien wrote to W.H. Orton on a

00:31:26.500 --> 00:31:33.500
particular date. And there's a bunch of stuff that we're working on in this linked open

00:31:34.020 --> 00:31:41.020
data framework. So lots of fun stuff going on and it's very much a volunteer effort.

00:31:43.740 --> 00:31:49.780
And we have a wonderfully active Discord where we talk about all this sort of stuff and work

00:31:49.780 --> 00:31:54.500
on various things. We have people that are experts on just one tiny thing like we've

00:31:54.500 --> 00:32:01.500
got a former Olympic archer that is interested in annotating any references to archery terms

00:32:01.660 --> 00:32:07.340
and stuff. So she's working with us. And another one's an astronomer that is going through

00:32:07.340 --> 00:32:11.900
and annotating all the astronomical references and so on. So there's something for everyone

00:32:11.900 --> 00:32:16.700
if you'd like to join so you can find out more about the project at digitaltolkien.com.

00:32:16.700 --> 00:32:23.700
And dead on time, I will end it there.

