WEBVTT

00:00:00.000 --> 00:00:09.000
Thanks a lot.

00:00:09.000 --> 00:00:11.160
Thanks for the invitation and the occasion to be here.

00:00:11.160 --> 00:00:19.800
It's really wonderful to witness the start of digital humanities in earnest at FAU.

00:00:19.800 --> 00:00:20.800
Very happy to be here.

00:00:20.800 --> 00:00:29.160
And, yes, I will talk about linked open data for literary history and how linked open data

00:00:29.160 --> 00:00:35.280
may help us think differently about literary history, but also about how wanting to do

00:00:35.280 --> 00:00:40.000
literary history differently helps us think about linked open data as well.

00:00:40.000 --> 00:00:46.200
You can follow along with the slides if you'd like using the link shown here and at the

00:00:46.200 --> 00:00:49.640
bottom of the slide.

00:00:49.640 --> 00:00:51.560
So this is what I would like to cover.

00:00:51.560 --> 00:00:55.200
First of all, I will talk a little bit about text and data mining or machine learning,

00:00:55.200 --> 00:00:59.240
linked open data and literary history as like separate bits and pieces.

00:00:59.240 --> 00:01:04.880
And what I then will try to show is how we brought these three pieces together in a project

00:01:04.880 --> 00:01:08.720
that ran until last year, mining and modeling text.

00:01:08.720 --> 00:01:12.920
And mining and modeling, you already have text and data mining and linked open data

00:01:12.920 --> 00:01:15.200
modeling in there.

00:01:15.200 --> 00:01:20.040
I will just show one example of what kinds of questions you can address with this kind

00:01:20.520 --> 00:01:26.920
But there's a huge variety of data in the resulting database.

00:01:26.920 --> 00:01:29.600
So let's get right into it.

00:01:29.600 --> 00:01:35.160
Text and data mining plus LOD plus literary history equals what?

00:01:35.160 --> 00:01:37.400
But first of all, a little bit of background.

00:01:37.400 --> 00:01:42.360
I think that you could say that there are three modes of data and digital humanities.

00:01:42.360 --> 00:01:49.480
Qualitative digital humanities, where data sets are typically small, flawlessly curated,

00:01:49.560 --> 00:01:54.840
very specific domains, often heavily annotated.

00:01:54.840 --> 00:02:00.040
And this could be called smart data, because there are many intelligent, nuanced ways that

00:02:00.040 --> 00:02:01.640
we can interact with them.

00:02:01.640 --> 00:02:06.040
And the prototype of this kind of data is digital scholarly editions, especially genetic

00:02:06.040 --> 00:02:11.520
editions or critical editions, where every comma is marked up in every different variant.

00:02:11.520 --> 00:02:16.560
And so the prototypical example for me is the Faust edition that really pushed us to

00:02:16.560 --> 00:02:19.200
the limits, I think.

00:02:19.240 --> 00:02:24.240
And then there's quantitative digital humanities, where data sets are typically large, more

00:02:24.240 --> 00:02:28.880
or less annotated or only simply annotated.

00:02:28.880 --> 00:02:33.600
They may have errors, they may have all kinds of biases, but they're big and generic, so

00:02:33.600 --> 00:02:35.120
they have their uses.

00:02:35.120 --> 00:02:38.680
And a typical way of doing this kind of digital humanities would be, for example, doing a

00:02:38.680 --> 00:02:45.720
topic modeling on Project Gutenberg, the whole thing, with all its strange biases and lacking

00:02:45.720 --> 00:02:48.440
metadata and you know.

00:02:48.440 --> 00:02:52.480
But I think there's a third way for digital humanities, which is bigger, smarter data

00:02:52.480 --> 00:02:53.680
in the humanities.

00:02:53.680 --> 00:02:57.040
And I don't mean this as a compromise, you know, a little bit bigger, but not quite as

00:02:57.040 --> 00:02:58.040
well annotated.

00:02:58.040 --> 00:03:03.320
I really think we can bring together scale and nuance or detail.

00:03:03.320 --> 00:03:07.880
And one approach to do that is to use text mining, machine learning for annotation, for

00:03:07.880 --> 00:03:12.240
information extraction, and to model everything in linked open data, so that you really have

00:03:12.240 --> 00:03:18.400
contextualized knowledge and richly contextualized analysis when you perform your analysis.

00:03:19.400 --> 00:03:25.080
And one attempt to do this is mining and modeling text.

00:03:25.080 --> 00:03:26.520
So a little bit background.

00:03:26.520 --> 00:03:27.520
What is text and data mining?

00:03:27.520 --> 00:03:29.200
What is machine learning?

00:03:29.200 --> 00:03:34.320
Basically what we're trying to do with machine learning is to discover relations between

00:03:34.320 --> 00:03:39.080
surface features that we can easily recognize and more interesting features that are not

00:03:39.080 --> 00:03:44.000
so easily recognizable, but that we as humans understand, but that the machine needs to

00:03:44.000 --> 00:03:45.840
sort of learn to understand.

00:03:45.840 --> 00:03:52.120
And so we label, manually label training data that has these more interesting features,

00:03:52.120 --> 00:03:57.240
let's say direct speech or indirect speech or locations, narrative locations.

00:03:57.240 --> 00:04:04.040
And we teach the machine learning algorithm to recognize the relationship between the

00:04:04.040 --> 00:04:08.920
surface features that the machine can see and the deeper, more interesting features

00:04:08.920 --> 00:04:11.280
that we are interested in.

00:04:11.280 --> 00:04:15.040
And once the machine has learned that relationship, we can use it to annotate larger amounts of

00:04:15.040 --> 00:04:17.080
data than what we could usually do.

00:04:17.080 --> 00:04:22.320
So it's a way of getting richer information about text than the computer can do in a first

00:04:22.320 --> 00:04:23.600
pass, let's say.

00:04:23.600 --> 00:04:29.520
And this can obviously be refined bit by bit to lead to more interesting labels.

00:04:29.520 --> 00:04:33.680
And linguistic annotation is what spaCy does, it's this paradigm.

00:04:33.680 --> 00:04:39.880
But the idea is to scale it up to more literary features as well.

00:04:39.880 --> 00:04:42.860
Second part of the equation, what is literary history?

00:04:42.860 --> 00:04:46.620
And it's sort of obvious in a way, but at the same time, it's worth stating it.

00:04:46.620 --> 00:04:52.660
So the goal of literary history is to collect and document knowledge about literature over

00:04:52.660 --> 00:04:59.740
time to provide explanations of how literary systems change over time.

00:04:59.740 --> 00:05:05.100
And sometimes the biological metaphor of evolution is used for this and has its pros and cons

00:05:05.100 --> 00:05:06.180
maybe.

00:05:06.180 --> 00:05:11.900
But also to contextualize literature, like we heard from Anastasia, that there could be

00:05:12.620 --> 00:05:17.140
the conditions of production in a way, if you want, could be studied, or the reception

00:05:17.140 --> 00:05:21.700
of literature could be studied as well as part of all that literary history can be.

00:05:21.700 --> 00:05:25.300
And then there are some organizational principle of literary history, like literary history

00:05:25.300 --> 00:05:29.100
always is often organized in national literatures.

00:05:29.100 --> 00:05:33.180
Of course, also there are histories of world literature as well that transcend this, but

00:05:33.180 --> 00:05:35.300
it's often done that way.

00:05:35.300 --> 00:05:41.460
Period, movements, currents, genres, these are like big organizational units.

00:05:41.460 --> 00:05:45.780
And then of course authors and works with their themes, with the forms, with the characters,

00:05:45.780 --> 00:05:47.460
the plot, etc.

00:05:47.460 --> 00:05:52.180
And then it's typical to look at similarities and differences, so compare things and to

00:05:52.180 --> 00:05:53.820
look at how things change over time.

00:05:53.820 --> 00:05:57.500
So basically that's the scope of what we want to try to do.

00:05:57.500 --> 00:06:02.420
And it's a bit caricatural, and the field has developed and moved on beyond these basics,

00:06:02.420 --> 00:06:05.420
but that's a good starting point, I guess.

00:06:05.420 --> 00:06:07.900
And then the third element is linked open data.

00:06:07.900 --> 00:06:11.300
And this is just a very simple illustration of how this can look like.

00:06:11.300 --> 00:06:15.900
So this is a page from Wikidata.

00:06:15.900 --> 00:06:24.300
And you can see that there's Han Kang, the South Korean Nobel Prize winner of this year,

00:06:24.300 --> 00:06:28.940
who is the subject of this entry, and so also the subject of all the statements here.

00:06:28.940 --> 00:06:32.580
There's an identifier, a numerical identifier, so the name is just a label and it could be

00:06:32.580 --> 00:06:33.580
in different languages.

00:06:33.580 --> 00:06:37.140
There's a little description, and then there are the statements.

00:06:37.140 --> 00:06:40.020
And they consist of the subject, it's always the same.

00:06:40.020 --> 00:06:45.180
And then predicate, like a verb, or in this case instance of, and then the object, human,

00:06:45.180 --> 00:06:46.340
like the value.

00:06:46.340 --> 00:06:51.540
So all of linked open data is basically constructed out of these little triples, we call them,

00:06:51.540 --> 00:06:53.940
because they have three elements.

00:06:53.940 --> 00:06:58.900
And linked open data is basically a huge collection of these very, very simple statements.

00:06:58.900 --> 00:07:01.060
And I like to compare this to chess.

00:07:01.060 --> 00:07:06.900
In chess you have just a few kinds of figures and some very simple rules and a very limited

00:07:06.900 --> 00:07:08.760
field.

00:07:08.760 --> 00:07:13.200
But the combination of these makes it very, very powerful and very, very complex.

00:07:13.200 --> 00:07:17.860
And so linked open data to my mind is a bit similar in that principle.

00:07:17.860 --> 00:07:22.800
So what we wanted to do in the mining and modeling text project is to bring these three

00:07:22.800 --> 00:07:27.880
things together and so use machine learning to extract information from text, model it

00:07:27.880 --> 00:07:33.940
in linked open data, and use that for literary history, basically.

00:07:33.960 --> 00:07:41.640
So that means we collect information like the one that is at the bottom here, the second

00:07:41.640 --> 00:07:42.900
part.

00:07:42.900 --> 00:07:48.240
For example, bibliographic data, a person is the author of a given work, or content

00:07:48.240 --> 00:07:53.460
related data, like a work is about a given theme, or formal information, like a work

00:07:53.460 --> 00:07:55.860
has a specific narrative form.

00:07:55.860 --> 00:07:57.000
All kinds of statements like this.

00:07:57.000 --> 00:08:00.880
So these are like the patterns or the building blocks, the abstract patterns of the kind

00:08:00.880 --> 00:08:04.520
of knowledge that we're looking for.

00:08:04.520 --> 00:08:12.100
And so basically what this leads to, what we wanted to do is build a Wikidata for literary

00:08:12.100 --> 00:08:13.100
history.

00:08:13.100 --> 00:08:17.260
And so we separated actually in our own Wikibase instance, so we have our own little Wikidata

00:08:17.260 --> 00:08:21.640
just for literary history and actually just for French second half of the 18th century

00:08:21.640 --> 00:08:24.680
novel as a pilot.

00:08:24.680 --> 00:08:29.480
And what this implies, this approach with linked open data, you could call it an atomization

00:08:29.480 --> 00:08:30.820
of literary history.

00:08:30.820 --> 00:08:35.740
So we're breaking things down to the smallest elements, like subjects, predicates, objects,

00:08:35.740 --> 00:08:36.980
and a lot of those.

00:08:36.980 --> 00:08:42.040
And so we're sort of rejecting the grand narrative approach to literary history, where you have

00:08:42.040 --> 00:08:48.380
some idea that literature becomes, I don't know, the novel becomes the genre of the bourgeoisie

00:08:48.380 --> 00:08:50.380
and then you want to show that.

00:08:50.380 --> 00:08:53.020
You have a thesis and then you show how that happens.

00:08:53.020 --> 00:08:55.300
And so we say, no, we don't have a grand narrative.

00:08:55.300 --> 00:09:00.300
We go break everything down to smaller pieces, but then we reassemble this in a model and

00:09:00.300 --> 00:09:04.860
try to find patterns and sort of build literary history from the bottom up again.

00:09:04.860 --> 00:09:08.580
And we'll see how that differs from other approaches.

00:09:08.580 --> 00:09:12.160
And in doing so, we follow some key values.

00:09:12.160 --> 00:09:13.720
This approach is networked.

00:09:13.720 --> 00:09:18.780
So all the information is linked between each other because one author wrote multiple texts,

00:09:18.780 --> 00:09:21.060
one theme is used by multiple texts, et cetera.

00:09:21.060 --> 00:09:25.380
So it creates a network, but also networked outside of our Wikibase.

00:09:25.380 --> 00:09:30.220
So our Wikibase is linked to other Wikibases or to Wikidata, to the National French Library

00:09:30.860 --> 00:09:34.900
and so it is part of a networked system of networks in a way.

00:09:34.900 --> 00:09:36.340
And that's the semantic web basically.

00:09:36.340 --> 00:09:38.620
So we're trying to be part of that.

00:09:38.620 --> 00:09:40.900
We try to be very open about the data.

00:09:40.900 --> 00:09:46.420
So everything is online, everything is freely available, all the analysis are available,

00:09:46.420 --> 00:09:51.460
all the code, all the data, and everything is linked from that Wikibase.

00:09:51.460 --> 00:09:56.100
It's also collaborative necessarily because doing this requires quite a lot of different

00:09:56.100 --> 00:09:59.620
competencies that usually not one person has.

00:09:59.620 --> 00:10:03.620
And it's also collective in that sense that we rely on existing other data sets as well

00:10:03.620 --> 00:10:05.380
that we use.

00:10:05.380 --> 00:10:09.780
And it's multilingual, which I like because I think we live in a multilingual world.

00:10:09.780 --> 00:10:14.860
And because we have this distinction between the identifier and the label, the identifier

00:10:14.860 --> 00:10:20.020
stays the same, the numerical one, but the label can be adjusted to any language and

00:10:20.020 --> 00:10:23.860
you can have multiple labels and the system is set up for this.

00:10:23.860 --> 00:10:26.260
So we like this about it.

00:10:26.260 --> 00:10:32.380
But let's see how this went really specifically in the Mining and Modeling Text project.

00:10:32.380 --> 00:10:36.140
So this is just a very simple overview of the project.

00:10:36.140 --> 00:10:40.500
Basically we had three different kinds of sources that we wanted to use and bring together.

00:10:40.500 --> 00:10:43.140
The first of them is bibliographic metadata.

00:10:43.140 --> 00:10:47.980
So just which author wrote which novel, published in what year, things like that.

00:10:47.980 --> 00:10:53.100
And there is a bibliography of that, of our domain, the French 18th century novel that

00:10:53.180 --> 00:10:55.100
we were able to use.

00:10:55.100 --> 00:10:57.820
And I will show it in a second.

00:10:57.820 --> 00:11:01.460
So this gives us the basis, like all the works and authors.

00:11:01.460 --> 00:11:06.600
Then we also used primary texts, so novels in this case, and I will get to that as well,

00:11:06.600 --> 00:11:07.840
to derive features from them.

00:11:07.840 --> 00:11:14.500
So we did analysis on them and derive information from the novels themselves in a way.

00:11:14.500 --> 00:11:18.120
And the third source of information is works of literary history.

00:11:18.160 --> 00:11:23.600
So chapters in overview works about 18th century French literature, for example, is usually

00:11:23.600 --> 00:11:25.100
a chapter on the novel.

00:11:25.100 --> 00:11:28.800
And so this is another corpus that we used to derive information because obviously they

00:11:28.800 --> 00:11:34.660
will talk about authors and works of that period in such chapters.

00:11:34.660 --> 00:11:40.400
And then bring all that information together in one network of information, basically the

00:11:40.400 --> 00:11:43.000
knowledge network up there.

00:11:43.000 --> 00:11:47.120
So just to illustrate this a little bit, this is from the Bibliographie du Genre Romanesque

00:11:48.120 --> 00:11:51.840
So this is what you see here is one typical entry for one work.

00:11:51.840 --> 00:11:57.280
You see up there the author name Voltaire and then the title Condit Ouloptimisme and

00:11:57.280 --> 00:12:00.080
then all kinds of information and pretty rich information.

00:12:00.080 --> 00:12:02.520
This is not just bibliographic information.

00:12:02.520 --> 00:12:06.920
There are different editions, there are reviews, and something that we were particularly interested

00:12:06.920 --> 00:12:09.120
in because we're not doing book history.

00:12:09.120 --> 00:12:13.840
There's another project using this bibliography for book history, but we're doing literary

00:12:13.840 --> 00:12:14.840
history.

00:12:15.560 --> 00:12:18.880
We're interested in this, what you see here in italics, in the middle of the screen.

00:12:18.880 --> 00:12:22.440
So it gives us information about narrative perspective, Troisiepersonnes, third person

00:12:22.440 --> 00:12:23.440
narrative.

00:12:23.440 --> 00:12:28.760
It gives us information about the narrative locations, Europe and America.

00:12:28.760 --> 00:12:33.520
It gives us the protagonist names, a little bit something about the plot elements like

00:12:33.520 --> 00:12:43.560
voyage, desastres, and also about the themes and even the tone, like satirical tone there

00:12:43.560 --> 00:12:44.560
at the end.

00:12:45.280 --> 00:12:50.360
We digitized the book, we did OCR on it, and then we trained a machine learning classifier

00:12:50.360 --> 00:12:54.240
to pick out these bits and pieces and get them into structured information.

00:12:54.240 --> 00:12:59.520
And that was the basis for the database because that gave us anchors, authors and works, and

00:12:59.520 --> 00:13:05.800
any additional information from the other two sources we would connect to those.

00:13:05.800 --> 00:13:07.480
And it's an almost complete bibliography.

00:13:07.480 --> 00:13:12.960
We will get to the point with the almost because this was done in the 70s and they missed out

00:13:12.960 --> 00:13:16.040
on a few things.

00:13:16.040 --> 00:13:20.320
So the second source of information is the corpus of novels.

00:13:20.320 --> 00:13:25.200
And so we purpose built a corpus for this because not a lot of novels are digitized

00:13:25.200 --> 00:13:26.200
for this period.

00:13:26.200 --> 00:13:30.680
We even had to train an OCR model for this.

00:13:30.680 --> 00:13:35.440
But we ended up with 200 novels encoded in very nice, very simple, but very nice XML

00:13:35.440 --> 00:13:38.420
TEI with some metadata.

00:13:38.420 --> 00:13:40.080
And then we did analysis on this.

00:13:40.200 --> 00:13:43.960
For example, topic modeling to know something about the themes of these novels, named entity

00:13:43.960 --> 00:13:48.520
recognition to know about the protagonists and the narrative locations, like which cities

00:13:48.520 --> 00:13:54.840
are mentioned in the text, and stylometry to know how similar is this novel stylistically

00:13:54.840 --> 00:13:56.280
to all the other novels.

00:13:56.280 --> 00:13:58.160
And so we have data on that as well.

00:13:58.160 --> 00:14:02.680
The corpus was published and described by Julia Röttgemann, and you can follow, you

00:14:02.680 --> 00:14:05.400
can look it up there.

00:14:05.400 --> 00:14:08.560
And then the third source of information is the scholarly literature.

00:14:09.040 --> 00:14:13.240
And here I show an annotation interface for this.

00:14:13.240 --> 00:14:17.960
So this is called Inception to annotate things.

00:14:17.960 --> 00:14:24.360
And so we annotate student helpers annotated these chapters using like following annotation

00:14:24.360 --> 00:14:29.840
guidelines, especially for information on genre and theme of the novels.

00:14:29.840 --> 00:14:33.360
And one of the nice things about Inception is that it's already part of this linked open

00:14:33.360 --> 00:14:38.440
data ecosystem, because you can connect it to any Wikibase instance, our own instance,

00:14:39.320 --> 00:14:43.360
or Wikidata instance, and use it to disambiguate your annotations.

00:14:43.360 --> 00:14:48.200
And you see here, maybe you can see it, there's the bottom one, the yellow one at the bottom,

00:14:48.200 --> 00:14:54.920
it says, Condit, but is it Condit the character in the book, or is it Condit the book?

00:14:54.920 --> 00:14:58.440
So if you look at the context, you will see that it is the book.

00:14:58.440 --> 00:15:03.320
And the annotators can make that explicit by checking the right box here on the right,

00:15:03.320 --> 00:15:06.640
Condit the book, not Condit the character in the book.

00:15:06.840 --> 00:15:11.800
And so we can, because the system knows these are the categories, we can, and these come

00:15:11.800 --> 00:15:16.720
up automatically when you mark Condit, you can simply disambiguate this.

00:15:16.720 --> 00:15:20.120
So this gave us a lot of interesting information already, and we wanted to use this for machine

00:15:20.120 --> 00:15:21.980
learning to scale this up.

00:15:21.980 --> 00:15:23.440
That turned out to be really hard.

00:15:23.440 --> 00:15:26.080
So within the project time, we didn't really get to that step.

00:15:26.080 --> 00:15:31.040
So this is the exception where we didn't just use automatically generated data, but also

00:15:31.040 --> 00:15:33.360
manually generated data.

00:15:33.360 --> 00:15:37.440
To bring this all together, we needed to model it in linked open data.

00:15:37.440 --> 00:15:41.600
And so we started developing a data model for literary history, which is a very complicated

00:15:41.600 --> 00:15:44.600
domain and nobody can really agree on anything.

00:15:44.600 --> 00:15:48.240
But we just said, okay, we need to make a proposal for some fundamental things.

00:15:48.240 --> 00:15:52.640
And actually, nobody can even agree on what is fundamental.

00:15:52.640 --> 00:15:55.840
What is the kind of thing that you want to know about literary history?

00:15:55.840 --> 00:16:01.240
But we just went ahead and tried to find some interesting information.

00:16:01.240 --> 00:16:05.880
So there are modules, it's modular, so you can take out bits and pieces without using

00:16:05.880 --> 00:16:08.840
the whole data model.

00:16:08.840 --> 00:16:14.040
But there are modules on theme, on space, on narrative form, on works, on authors, on

00:16:14.040 --> 00:16:19.760
mapping things, and mapping not in the sense of geographical mapping, but mapping identifiers

00:16:19.760 --> 00:16:23.240
to other resources, and many more, many more things.

00:16:23.240 --> 00:16:29.360
And I will just show you as an example, and don't be scared when you see this, just one

00:16:29.360 --> 00:16:31.160
visualization of one of the modules.

00:16:32.080 --> 00:16:35.800
And there's a lot going on here, but we can just focus on some of the central things.

00:16:35.800 --> 00:16:43.040
So there's a node labeled about, and it has on the top left, it has another node that

00:16:43.040 --> 00:16:44.520
says literary work.

00:16:44.520 --> 00:16:48.000
And on the right, it says thematic concept down there.

00:16:48.000 --> 00:16:54.520
So basically, this is about expressing statements where a work is about a theme.

00:16:54.520 --> 00:16:56.040
Now that sounds super simple.

00:16:56.040 --> 00:16:58.240
Why is it such a complicated image?

00:16:58.320 --> 00:17:04.800
That's because in addition to having these statements, we also have two more elements.

00:17:04.800 --> 00:17:08.520
One is, for example, for theme, we have a thematic vocabulary.

00:17:08.520 --> 00:17:12.560
So a controlled vocabulary, a bit like an ontology, but a bit simpler than an ontology,

00:17:12.560 --> 00:17:15.760
that basically says these are the possible themes.

00:17:15.760 --> 00:17:21.120
And we can have multilingual labels, and we can map topics or annotations to these themes.

00:17:21.120 --> 00:17:25.400
And that's the part on the right where we have this thematic vocabulary, and with the

00:17:25.400 --> 00:17:27.220
labels, etc.

00:17:27.220 --> 00:17:31.700
And the other thing is, so any annotation is mapped to this thematic vocabulary.

00:17:31.700 --> 00:17:36.820
And the other thing is that every statement that says this work is about this theme has

00:17:36.820 --> 00:17:37.860
a source.

00:17:37.860 --> 00:17:40.900
And that's the part here on the top with stated in.

00:17:40.900 --> 00:17:45.140
So anything that has a source, it could come from the bibliography, it could come from

00:17:45.140 --> 00:17:48.340
topic modeling, it could come from the manual annotations.

00:17:48.340 --> 00:17:51.820
So we make this very transparent, and it allows us also to compare information from different

00:17:51.820 --> 00:17:55.340
sources on the same topic.

00:17:55.460 --> 00:18:02.540
Now, the third important feature of this database is that it's linked also to external databases.

00:18:02.540 --> 00:18:05.940
And this is the illustration of how it's linked to Wikidata.

00:18:05.940 --> 00:18:07.260
And it's linked both ways.

00:18:07.260 --> 00:18:08.500
That's the main important thing.

00:18:08.500 --> 00:18:12.460
So authors, works, and thematic concepts, for example, are linked.

00:18:12.460 --> 00:18:15.280
Also spatial concepts like cities are linked.

00:18:15.280 --> 00:18:20.540
And that means that any item like that in our database has Wikidata identifiers, so

00:18:20.540 --> 00:18:24.380
that we know what it is in Wikidata.

00:18:24.420 --> 00:18:27.100
That means we can pull additional information from Wikidata.

00:18:27.100 --> 00:18:29.980
For example, for cities, we don't encode geographical location.

00:18:29.980 --> 00:18:32.540
We just pull that from Wikidata.

00:18:32.540 --> 00:18:35.960
Because otherwise it would be redundant, it would be error prone, you would have to make

00:18:35.960 --> 00:18:37.700
corrections in multiple places.

00:18:37.700 --> 00:18:38.960
So that's one way.

00:18:38.960 --> 00:18:45.880
But we also set up a Memotex ID and registered it with Wikidata and also registered our Wikibase

00:18:45.880 --> 00:18:48.720
with Wikidata as part of the federation system.

00:18:48.720 --> 00:18:54.100
So you can find an author or novel in Wikidata, like the big one, the normal one, and discover

00:18:54.100 --> 00:18:55.940
that, oh, there's a Wikidata ID.

00:18:55.940 --> 00:18:56.940
What do they have?

00:18:56.940 --> 00:18:58.760
And you can pull in our information also from Wikidata.

00:18:58.760 --> 00:19:01.620
So the federation goes both ways, which is pretty cool.

00:19:01.620 --> 00:19:05.200
Also for people to discover our data, because nobody knows that we're doing this.

00:19:05.200 --> 00:19:08.500
But if you are interested in an author and there are very few statements on Wikidata

00:19:08.500 --> 00:19:13.640
and you see, oh, there's another project that has statements on this, this could be interesting.

00:19:13.640 --> 00:19:16.700
The result then is this what we call the Memotex base.

00:19:16.700 --> 00:19:21.820
It's on the one hand just like a Wikipedia page or Wikipedia platform.

00:19:21.820 --> 00:19:24.840
So you can browse and search and click around.

00:19:24.840 --> 00:19:29.340
But more interestingly and more importantly, there's also a so-called Sparkle endpoint.

00:19:29.340 --> 00:19:35.140
So an API to the database where you can write all kinds of queries and ask all kinds of

00:19:35.140 --> 00:19:36.380
questions.

00:19:36.380 --> 00:19:41.940
And this is what I want to illustrate now with an example.

00:19:41.940 --> 00:19:45.340
So we talked about fanfiction and how it's sometimes juicy.

00:19:45.340 --> 00:19:48.300
And that's not a new invention.

00:19:48.300 --> 00:19:56.500
So there is a pretty rich tradition of soft pornographic, erotic, pretty rough pornographic

00:19:56.500 --> 00:19:58.700
literature in the 18th century.

00:19:58.700 --> 00:20:07.140
And there seems to be a sort of structural mapping or concordance between epistolary,

00:20:07.140 --> 00:20:08.640
so writing letters.

00:20:08.640 --> 00:20:11.260
So they are epistolary novels in the 18th century.

00:20:11.260 --> 00:20:14.740
And there's some, you know, they have all kinds of themes, but there seems to be a sort

00:20:14.740 --> 00:20:20.100
of way that libertinage, so which is all of the free thinking, whether philosophically

00:20:20.100 --> 00:20:26.500
or socially, that this goes together well with letters.

00:20:26.500 --> 00:20:34.300
And so this is just an illustration from a late, like an early 20th century edition of

00:20:34.300 --> 00:20:41.020
one of the most famous libertine epistolary novels of the 18th century, The Dangerous

00:20:41.020 --> 00:20:42.020
Liaisons.

00:20:42.060 --> 00:20:47.580
You may have seen the movie with John Markovich and Glenn Close, but the novel is actually

00:20:47.580 --> 00:20:48.620
amazing.

00:20:48.620 --> 00:20:53.140
And so this comes from, this illustration comes from a later edition where you see that

00:20:53.140 --> 00:20:58.260
libertinage and letter writing is very closely connected to each other because you basically

00:20:58.260 --> 00:21:00.380
do it at the same time.

00:21:00.380 --> 00:21:03.060
But anyways, a bit of research on this.

00:21:03.060 --> 00:21:09.900
So is there a libertine epistolary novel after 1782, which is the year of Dangerous Liaisons?

00:21:09.900 --> 00:21:10.900
One author said no.

00:21:10.980 --> 00:21:12.860
There's barely anything.

00:21:12.860 --> 00:21:16.740
The epistolary genre is hardly represented in the libertine novel after Laclou, which

00:21:16.740 --> 00:21:20.700
is a metaphor for this novel.

00:21:20.700 --> 00:21:25.700
And then another scholar said, actually wrote an article on the late libertine epistolary

00:21:25.700 --> 00:21:30.460
novel and speaks about eight different libertine epistolary novels after 1782.

00:21:30.460 --> 00:21:31.460
So what's going on?

00:21:31.460 --> 00:21:33.740
Is there, does it exist or not?

00:21:33.740 --> 00:21:37.420
And on what does it depend?

00:21:37.420 --> 00:21:38.860
And this is what we wanted to investigate.

00:21:38.860 --> 00:21:41.660
So there seems to be this convergence, but when does it happen?

00:21:41.660 --> 00:21:42.820
How does it happen?

00:21:42.820 --> 00:21:48.740
And a database of all this rich metadata on the novel of that time seems to be well placed

00:21:48.740 --> 00:21:50.140
to do this.

00:21:50.140 --> 00:21:58.780
So we try to clarify, or I try to clarify as an example basically, what we can learn

00:21:58.780 --> 00:22:01.300
about this question from our database.

00:22:01.300 --> 00:22:07.180
So first of all, we can query the number of novels that we have in the database for this

00:22:07.260 --> 00:22:12.380
time period, 1782 to 1800, so roughly 20 years.

00:22:12.380 --> 00:22:16.700
And there are 647 novels in the database for that time.

00:22:16.700 --> 00:22:19.780
There are 2000 in total, so that kind of makes sense.

00:22:19.780 --> 00:22:24.620
It's a pretty productive end of the century in a way.

00:22:24.620 --> 00:22:28.020
Then we can also say, okay, but how many of those are epistolary novels?

00:22:28.020 --> 00:22:31.780
Because that's an annotation level that we have.

00:22:31.780 --> 00:22:36.780
And it boils down to 92, 91 novels in query two.

00:22:37.380 --> 00:22:38.380
And what are their topics?

00:22:38.380 --> 00:22:39.380
And what are these about?

00:22:39.380 --> 00:22:40.820
They are candidates in a way.

00:22:40.820 --> 00:22:43.740
They could be about Libertinage or not.

00:22:43.740 --> 00:22:48.140
And so this is one of the queries, the query results that we can get.

00:22:48.140 --> 00:22:52.420
And I can, we can look into the queries together in the discussion if you would like to, but

00:22:52.420 --> 00:22:58.660
this is just a result, a visualization that shows that the epistolary novel is actually

00:22:58.660 --> 00:23:04.060
about sentiments, love, and happiness, about correspondence of course, because epistolary

00:23:04.140 --> 00:23:08.220
novels are always also about writing and reading and sending letters.

00:23:08.220 --> 00:23:09.220
It's like a bit self-reflective.

00:23:09.220 --> 00:23:15.020
Family, virtue, travel, so it's not Libertin at all.

00:23:15.020 --> 00:23:19.460
Except maybe for, at the top left corner you see crime and passion.

00:23:19.460 --> 00:23:20.940
That could be something.

00:23:20.940 --> 00:23:24.900
But actually when we look into it, we see that this is something different.

00:23:24.900 --> 00:23:29.060
Anyway, so this is one of the, one type of information we can get.

00:23:29.060 --> 00:23:31.100
But let's dig a little bit further.

00:23:31.140 --> 00:23:36.380
So how many of those novels, so the epistolary novels of the right period, are actually about

00:23:36.380 --> 00:23:37.380
Libertinage?

00:23:37.380 --> 00:23:39.580
In turns, only one.

00:23:39.580 --> 00:23:44.820
So it breaks down very quickly to just one novel, and it's a novel by the Marquis de

00:23:44.820 --> 00:23:51.060
Sade called Aline Valcourt, which is a sort of compendium of the 18th century novel, one

00:23:51.060 --> 00:23:52.900
of my favorite novels of that time.

00:23:52.900 --> 00:23:55.900
Very long, very complicated, but it has everything in it.

00:23:55.900 --> 00:23:59.020
And it's also epistolary.

00:23:59.020 --> 00:24:02.620
And with the Marquis de Sade, you can imagine that it is also about Libertinage.

00:24:02.620 --> 00:24:03.620
But there's only one.

00:24:03.620 --> 00:24:07.140
So how could Meloncon write about eight of them?

00:24:07.140 --> 00:24:09.500
What happened here?

00:24:09.500 --> 00:24:12.460
So we'll look into that.

00:24:12.460 --> 00:24:17.180
There are some non-epistolary Libertine novels, 14 even, of them at that time.

00:24:17.180 --> 00:24:19.820
So apparently it's not like Libertinage is dead.

00:24:19.820 --> 00:24:25.860
It's just that this combination seems to be gone, or might be gone.

00:24:26.020 --> 00:24:30.460
And for example, just to show what we can visualize, these Libertine novels of that

00:24:30.460 --> 00:24:34.340
time period, we can show where they take place, where the action takes place.

00:24:34.340 --> 00:24:40.060
And this is the second, like, query six, I think, just the output.

00:24:40.060 --> 00:24:41.060
And it's a bit hard to see.

00:24:41.060 --> 00:24:42.060
And again, we can look into it.

00:24:42.060 --> 00:24:44.620
This is interactive in the database.

00:24:44.620 --> 00:24:46.780
Here it's just static.

00:24:46.780 --> 00:24:51.900
But we can see that it's set in different places of Europe, and also in one of the novels

00:24:51.900 --> 00:24:55.540
is set in Africa as well.

00:24:56.220 --> 00:25:00.860
In this case, this is like the result of a federated query, where we pull out the information

00:25:00.860 --> 00:25:06.500
and narrative location that we built, and combine it with the latitude and longitude

00:25:06.500 --> 00:25:12.300
information from Wikidata, and then visualize it in the system.

00:25:12.300 --> 00:25:14.340
But let's get back to that question.

00:25:14.340 --> 00:25:15.460
So what went on?

00:25:15.460 --> 00:25:19.580
Why is it so different for Meloncon and for our database?

00:25:19.620 --> 00:25:26.260
And it turns out, so I looked at the candidates that Meloncon wrote about, what are these

00:25:26.260 --> 00:25:27.260
novels?

00:25:27.260 --> 00:25:30.460
And so some of them are not marked as epistolary in our database.

00:25:30.460 --> 00:25:32.340
So it's a formal feature, you would think.

00:25:32.340 --> 00:25:33.340
It's simple.

00:25:33.340 --> 00:25:34.340
It's yes or no.

00:25:34.340 --> 00:25:36.020
But it's not, because they are mixed forms.

00:25:36.020 --> 00:25:38.640
So there's first-person narrative with some letters.

00:25:38.640 --> 00:25:41.340
And we didn't mark that as an epistolary novel.

00:25:41.340 --> 00:25:43.600
We just marked it as mixed.

00:25:43.600 --> 00:25:44.740
So it's not in there.

00:25:44.740 --> 00:25:46.340
But he said, yeah, there's epistolary.

00:25:46.500 --> 00:25:47.000
That's fine.

00:25:47.000 --> 00:25:50.460
So I will keep that in my corpus, basically.

00:25:50.460 --> 00:25:54.420
Some of them are not marked as libertine in our database.

00:25:54.420 --> 00:25:57.500
But he uses them as examples of libertine novels.

00:25:57.500 --> 00:26:03.780
And I think that's simply because libertinage is not an easy concept to cut into pieces.

00:26:03.780 --> 00:26:06.980
And we actually have an item, libertinage.

00:26:06.980 --> 00:26:10.460
But we also have related topics that are actually included.

00:26:10.460 --> 00:26:15.580
And so you can actually play with the breadth of the concept in a way, by saying, I will

00:26:15.580 --> 00:26:22.540
only look at novels that have libertinage as keyword, or also sexuality or eroticism

00:26:22.540 --> 00:26:24.700
or other things like that.

00:26:24.700 --> 00:26:27.720
So you can vary the precision.

00:26:27.720 --> 00:26:32.500
But even with a rather wide definition, we don't have as many novels marked with this

00:26:32.500 --> 00:26:33.500
topic.

00:26:33.500 --> 00:26:36.820
So it's a question of definition and of the sources.

00:26:36.820 --> 00:26:40.380
And then something that was a bit shocking to me, two novels that Melonçon talks about

00:26:40.380 --> 00:26:42.260
are missing from our database.

00:26:42.260 --> 00:26:44.540
And that can happen.

00:26:44.540 --> 00:26:48.660
And it turns out these are two novels that were re-edited after the bibliography was

00:26:48.660 --> 00:26:49.820
published.

00:26:49.820 --> 00:26:52.400
So we probably should have edited in the meantime.

00:26:52.400 --> 00:26:56.100
But it makes sense that they didn't know about them yet.

00:26:56.100 --> 00:26:57.780
Although they have a lot of unedited novels as well.

00:26:57.780 --> 00:27:00.740
But somehow these must have been rediscovered later.

00:27:00.740 --> 00:27:04.420
Anyways, I will come to the conclusion now.

00:27:04.420 --> 00:27:08.100
So basically what we see is that we have multiple perspectives in the database.

00:27:08.100 --> 00:27:12.220
But Melonçon has yet another perspective.

00:27:12.220 --> 00:27:13.220
So I conclude.

00:27:13.220 --> 00:27:14.220
Yes, wonderful.

00:27:14.900 --> 00:27:17.940
I think I will just about make it.

00:27:17.940 --> 00:27:21.940
So some challenges and some opportunities of this approach.

00:27:21.940 --> 00:27:27.040
One of the challenges is that there's necessarily a certain complexity reduction when you try

00:27:27.040 --> 00:27:32.620
to take literary history and all of its diversity and you want to express it in triples.

00:27:32.620 --> 00:27:33.620
That's normal.

00:27:33.620 --> 00:27:34.620
But that's inherent in modeling.

00:27:34.620 --> 00:27:38.980
So there's a trade-off between this reduction in complexity and then what you can do with

00:27:38.980 --> 00:27:39.980
the data.

00:27:40.740 --> 00:27:45.700
The upsides of having the data in this shape really outweigh the limitations of the modeling.

00:27:45.700 --> 00:27:51.780
But we have seen that libertinage and even narrative form are not clear-cut categories.

00:27:51.780 --> 00:27:56.460
There's a lack of consensus, as I said, on relevant statements and how to model the domain.

00:27:56.460 --> 00:27:59.180
And that's just a challenge that's not going to go away.

00:27:59.180 --> 00:28:02.940
It's not something you can solve with technical solutions.

00:28:02.940 --> 00:28:05.740
Then there's this federation thing.

00:28:05.740 --> 00:28:09.660
The vision of linked open data and of semantic web is federation.

00:28:10.340 --> 00:28:12.700
It's not to have one central database, everything in Wikidata.

00:28:12.700 --> 00:28:14.980
It's these distributed resources.

00:28:14.980 --> 00:28:17.820
But it's actually really hard to get it to work.

00:28:17.820 --> 00:28:21.980
It took us a long time and towards the end of the project was when we actually managed

00:28:21.980 --> 00:28:23.140
to do this.

00:28:23.140 --> 00:28:27.740
And then sustainability is a big challenge because who's going to take care of our data

00:28:27.740 --> 00:28:29.540
when the project has ended?

00:28:29.540 --> 00:28:33.660
So we have another project so we can sustain it, but it's not really a good solution.

00:28:33.660 --> 00:28:37.980
We could post everything on Wikidata and parts of the data we have pushed to Wikidata because

00:28:37.980 --> 00:28:40.260
we think that's too big to fail in a way.

00:28:40.260 --> 00:28:44.940
And of course we have exported the raw RDF data to Zenodo, but there you can't really

00:28:44.940 --> 00:28:50.660
do anything with it, at least unless you analyze the RDF directly, which you can do, of course,

00:28:50.660 --> 00:28:52.900
but not everybody wants to do that.

00:28:52.900 --> 00:28:55.940
But there are, of course, also many opportunities.

00:28:55.940 --> 00:29:01.940
It's really a great way of linking these heterogeneous data from different types of sources and use

00:29:02.020 --> 00:29:08.220
semantic modeling to bridge differences in granularity, for example, in terminology,

00:29:08.220 --> 00:29:09.980
in language.

00:29:09.980 --> 00:29:14.820
It's a way to model differences of perspective on data because we have these sources, so

00:29:14.820 --> 00:29:19.500
we can have contradictory statements or statements of different granularity on the same thing.

00:29:19.500 --> 00:29:25.100
But that means we are modeling not facts, but perspectives or just statements, things

00:29:25.100 --> 00:29:27.620
we found.

00:29:27.660 --> 00:29:32.540
Like somebody said in the conference at the beginning of the week on linked open data,

00:29:32.540 --> 00:29:36.300
that history is never wrong.

00:29:36.300 --> 00:29:38.340
It's just what it is.

00:29:38.340 --> 00:29:41.540
There's transparency and knowledge production in this way because everything is sourced

00:29:41.540 --> 00:29:45.900
and you can click through to the GitHub repository with the topic model and the corpus and the

00:29:45.900 --> 00:29:47.380
code and everything.

00:29:47.380 --> 00:29:51.980
It's multilingual, which I think is really important, and we can avoid redundancy, as

00:29:51.980 --> 00:29:55.700
I said, by reusing external resources as well.

00:29:55.780 --> 00:29:59.500
Because this is so wonderful, we want to continue doing this, and so we started a new project

00:29:59.500 --> 00:30:05.620
called linked open data in the humanities, Loading, where we work on scholarly digital

00:30:05.620 --> 00:30:12.140
editions of correspondences, for example, which you can model as networks, on bibliographical

00:30:12.140 --> 00:30:15.340
data, on the digital humanities themselves.

00:30:15.340 --> 00:30:21.660
This is a project on the institutional history and institutional landscape of digital humanities,

00:30:21.740 --> 00:30:29.140
so you see that it's slightly biased towards Europe and the east coast of North America.

00:30:29.140 --> 00:30:35.180
Another project on bibliographical data and my favorite project on wine labels, such as

00:30:35.180 --> 00:30:40.940
these that we are digitizing and networking and annotating and mapping, of course, and

00:30:40.940 --> 00:30:43.140
showing.

00:30:43.140 --> 00:30:44.140
And that's it.

00:30:44.140 --> 00:30:44.900
Thanks for your attention.

