Monk - frequently asked questions

NWO/Catch - Scratch
ALICE/University of Groningen,Nationaal Archief/Den Haag

Welcome to the website of the Monk system for word retrieval in handwritten manuscript
collections. This is the corner of the system where we deal with 'word-zone' labeling.
Monk is a kind of Google system, intended for finding handwritten words in images (scans) of
historical documents. Monk must learn this trick and it will need
examples. An example is a typed piece of text, a label, associated with an image.
Here, the image will be a 'word zone' (but that may also be a single letter or character).
You can help in giving Monk examples. A minimum number of examples is needed in order
to train Monk in finding them. Here, you can help!

Also see (in Dutch):

Publieke Monk trainer

Monk - FAQ - List of email messages and questions

Logging in
Crowd sourcing of medieval words
Labeling
What to do if a target word images contains other objects in the image?
In Sordex (word training): No more improvements, no new word examples found?
Blue versus Green background of word zone images in a hit list?
Roman numbers
The cog wheel: request for a 'Sordex' recomputation
Difference between word labeling and line transcription

Subject: Login

Dear annotator,

- You can participate in the public crowd sourcing via this site, the 'Monk trainer' (http://tinyurl.com/5w5nqaf)
  It will show random words from the manuscript collections that reside on the Monk server.

- You can also log in as an 'expert trainer' with  user name/password
  on: /cgi-bin/monkweb?cmd=login

- After this, for collections that are suitable, you can perform text line transcription on:

  /cgi-bin/monkweb?cmd=editxt&db=19320&ipage=1&edit=11&ntxt=-1

  By default, Monk shows the segmented text lines in their original color version.
  For the human eye this is a bit more informative than the black-and-white version
  which Monk uses internally for word search. You can change the rendering style in the
  line transcription tool under the button [Update rendering style]

- You can also train words (i.e., label word zone images) on:

  /cgi-bin/monkweb?cmd=TrainedWords&db=19320&trainedwordmethod=Sordex&annot=all&prefix=&sortopt=sorted_hitlistsize

  This shows a hit list in which a word, letter or text pattern can be clicked and labeled individually.
  In contrast to the crowd-source labels, expert labels will be entered into the continuous learning process of Monk.
  The beneficial effects of labeling are usually noticeable after one day, on average.

This may all sound quite technical. With normal use you cannot do something that is extremely 'wrong'.
Just click around and try to label some images. A drop-down menu will show what other users
have typed as word labels, after you entered the first letter.

Best regards,
Lambert Schomaker

Subject: crowd sourcing of medieval words

Dear annotator,
you pointed out that some of the crowd-sourced word labels are not appropriate.
This is not a real problem, because those labels are treated differently from
the expert labels, in Monk. You also note that the function words and articles
are not interesting at all for a Google of handwriting. However, computers
know very little about handwriting, today. Even the labels of menial words
like 'the', 'a', 'and' etc. are extremely useful. By machine learning the
instances of these words are agglomerated (put onto a big pile). A large
number of examples means that these words will be recognized correctly.
The residual words will be often more interesting. Therefore: all labels
are useful! In fact, it is even useful to provide labels that indicate that a particular image
contains no text but, e.g., a horizontal line. By using the prefix character @ as in @HORIZONTAL_LINE,
you can add such special image classes (categories, '@SPECIALs'). Monk can use these
examples in a similar way as in the case of the boring words, such that the irrelevant stuff
in the text scans can be put aside, knowingly.

Finally, of course the words that are rich in meaning are most
important. The more examples, the easier the system can find them. Sometimes
there are multiple words in an image. Please use the underscore _ instead
of a blank. Example: the_president.

Best regards,
Lambert Schomaker

Subject: Re: leken-oogst middeleeuwse woorden


Beste meneer Schomaker,

Ik zie in de 'leken-oogst' dat afkortingen niet worden opgelost: Bij het
hedendaagse 'en' staat onder andere 'en (met een streep boven de n)'. In
het Middelnederlands is dat een afkorting voor 'ende'.

Subject: Re: Re: leken-oogst middeleeuwse woorden
Beste transcribent,
de annotatie op woordniveau (dwz, niet de regeltranscriptie)
zal voor U nog opvallende bijzonderheden vertonen. Immers,
de klassen die worden vastgelegd hebben een uniek 'label'.
Indien 'ende' (voluit) en 'en@' (met krul) beide als 'ende'
zouden worden gelabeld haalt dit het herkenningspercentage
omlaag: immers, twee vormen worden nu in dezelfde bak van
woordvormen gegooid. Tabel 1. maakt duidelijk hoe
Monk tegen woordbeelden, woordcodes en tekstweergave
aankijkt:

Table 1. Monk makes a fundamental distinction between 'word image',
'word code' and 'word text'

Beeld (Image) Code Text

[woordvormklasse] ↔ [uniek Monk woordlabel] ↔ [voorkeursweergave bij gebruikers]
@ende_en ende

Hiervoor is een oplossing: bijzondere codes worden voorafgegaan
door '@'. Zo zal 2 1/2 als @2_en_een_half kunnen worden geannoteerd,
als de vrijwilligers dit met elkaar afspreken of elkaar hierin
navolgen. De bijzondere diacritische vorm voor 1/2 (die ik op mijn systeem
nog niet zo gemakkelijk kan invoeren) is nog steeds geen standaard.
De codes zullen verschillen tussen Apple, Microsoft, Linux,
tekstverwerkers en de verschillende internet browsers. Daarom staat Monk toe
dat annotatoren een eigen systematiek ontwikkelen, volledig met
ASCII codes die zeker nog honderd jaar te ontleden zijn, in tegenstelling
tot de huidige coderingen UTF, Unicode etc. die nog steeds in ontwikkeling zijn.
Zolang @35_en_een_kwart maar uniek is, is het altijd te herleiden
tot interne binaire codering in Unicode, UTF, iso_latin, of TEI (etc. etc.).
Voor zeer veel vormen die we in het materiaal tegenkomen is ook
gewoonweg nog geen internationaal aanvaardde code. Daarom stelt Monk
(dwz. de gemeenschap van Monk transcribenten) zijn eigen standaard.

Zo zie ik ook dat U al een hedendaagse vertaling doet van de
hoofdletters in eigennamen en plaatsnamen. Voor de regeltranscriptie
is dit geen probleem (deze is vooral voor menselijk gebruik).
Voor de woord-labeling is dit echter een probleem! als de hoofdletter
er niet staat, zal het in de toekomst niet mogelijk zijn om
woordvorm-modellen met hoofdletters uit de collectie af te scheiden
van woord-vorm-modellen (afbeelding) zonder de hoofdletter.
Op woordvorm-niveau (en dat is waarin Monk 'denkt'), is er een
groot verschil tussen de afbeelding voor [jansen] en [Jansen]. In principe
tikt men wat er staat, niet wat men denkt dat er staat of wat men
als norm vermoedt. Ook in de Scheepsjournalen blijkt dat de kapiteins,
heel onhanding, eigennamen niet met hoofdletters schrijven.

[dit probleem is later opgelost met het scherm: woordvormcodificatie]
Een vergelijkbaar probleem is de contractie Derich soin ==> Derichsoin.
Ook dit is voor de herkenner op woordniveau een groot probleem.
De contractie zal op basis van regels uit het domein moeten gebeuren
als naverwerking (post processing). Als er een grote spatie staat
is Derich_soin een beter label dan 'Derichsoin', en
het is geen probleem om de afzonderlijke Derich
en soin
Wij hebben inmiddels veel ervaring met deze verschillende invalshoeken.
Bij een bijeenkomst van archivarissen bleek dat men het begreep
toen ik zei: al labelen jullie een woordklasse met de code 'XJ765', wanneer
je dit maar consistent doet kan Monk de weergave van een woord op scherm of printer
altijd weer laten construeren op basis van de weergaveregels voor 'XJ765'.
Een dergelijke omvorming van een code naar een specifieke tekstweergave is huis- tuin-
en keuken-informatica, dit i.t.t. de patroonherkenning- en beeldbewerkingsmethoden
van Monk.

De twee disciplines kijken op een verschillende manier naar dit materiaal:
voor ons zijn de pixels, de krulletjes en de witte ruimtes van belang.
Voor jullie is het meestal van belang om naar de inhoud te gaan, onvolkomenheden
weg te werken en zorgvuldig de diacritica uit te zoeken.
Naarmate er meer bekend is over een collectie groeit de systematiek in beide werelden
wat naar elkaar toe, hebben gemerkt in de samenwerking met archieven en transcribenten.

Vriendelijke groeten,
Lambert Schomaker

Subject: labeling

Dear Lambert,

The MONK-system is becoming more clear to me. It looks nice but I
still have a number of questions.
Do I understand it right that the abbreviated words in the labeling
are prefixed with a '@'? For example, if oviu is written
but ovium was intended by the writer, then what should I
do. We have a system for such contractions in the humanities.
For instance, 'Gijsbt' will be transcribed as 'Gijsb(er)t',
as in TEI.

Best regards,

annotator

Dear annotator,

In Monk, we use the principle that is labeled what can be seen, not what is intended, for normal words. However, in the case of contractions, abbreviations, unique glyphs, etc. we will use the @SPECIAL codes. As an example, if voirs is written but voirscreven is intended, the Monk standard will be to enter the label @[intended]_[visible], i.e., @voirscreven_voirs.
It is important that similar patterns get a unique label, such that Monk can compute a model for them, in order to find similar patterns. The label is the 'bridge' to the shape. Rendering the text in some appropriate or correct way is then easy for computers and can be done at a later stage.

For technical reasons we have chosen for the Monk system of encoding @SPECIALs. Notably, Monk uses ASCII characters for the unique encoding, internally. Remember, even a word label such as queen is nothing more than a shape identifier, for the word pattern as a whole. If all users agreed on this, within a collection, they could together have decided to use for example the code FG4738JD, systematically. In that case Monk will also be able to put the corresponding images into a single model for FG4738JD, a code which you and I know to represent 'queen'. Please note that historical manuscripts may contain many visual elements (such as a 'solvit' glyph in acts) that cannot be represented by UTF of Unicode, at all! For Monk, a systematic labeling policy is more important than the use of diacritics. The latter will still be coded differently on many computers today. Furthermore, even if UTF is implemented, it is uncertain whether an end user will actually see the corresponding character in the respective font which may not be present in their operating system. For Monk, ASCII is a much more reliable representation. Again: With unique codes for words and characters, their rendering in a particular font or image at some later stage is not very difficult for an application programmer.

Subject: I have labeled a word and at first it caught nice siblings in
the Sordex hit list but now no more new word instances are found. Why?

Dear user,

this can mean a number of things. Either the word is 'mined out' (there are no more instances) or, more likely, you have reached the agglomeration capacity of that method (Sordex). The trick is to go to another method (say: Qordex) and check the hit list of that same word. Even if you find a single new instance, this may help both Qordex itself, and Sordex. The principle is that of the Fahrkunst, the old mechanical up/down lift for miners. By adding a label, you push the system up, then go elsewhere and do the same. Consider the left pole in the image to be Sordex and the right pole Qordex, and you will see what I mean. The quality of the recognition is gradually pushed up:

! add new labels and shift

Sordex Qordex

Lambert

Subject: differences between word labeling and line transcription

Dear Lambert,

I have started to browse the Cuper-Braun correspondence, that has been uploaded to MONK.
Immediately I encountered some issues:

> (1) To start with the interface: 
> To be honest, I do not understand where I need to go / where I need to click to start labeling letterforms.
> - starting with the scan, I succeed in cutting up the page in lines; 
> - every line is then accompanied by five small, square icons; 
> - the last two icons "R" and "Z" send me to a "Label Zone" (only the first one actually works on my computer/browser, since the other one requires 
> a plug-in which is missing);

The [Z] plugin is actually just Java (not Javascript). It is mainly used when the computer-based segmentations do not deliver a correct 'cutting out' of an important word. This means that it would never be found automatically. In such a case, the human user needs to define the rectangle (region of interest, x,y,w,h) with the mouse by drawing it. However, this [Z] is not the preferred route: it is only intended to enter 'seed words' into the Monk learning cycle, for the first time.

> - in the Label Zone [R], the selected line has been rendered many times, each time with another rectangular selection highlighted; the highlighted 
> selections are accompanied by suggestions for labeling;
> - if I click on one of the suggested labels, I am forwarded to a screen where the rectangular selection is magnified, and I have the option of 
> changing the label.
> BUT: most of the times, the selections are of a combination of letterforms that do not match one word, but form a selection of letters from within a 
> word, or from two words.
> Am I supposed to simply go through all the highlighted selections in a line, until I see a selection which matches exactly one word?
> I presume it is of no use to label a selection if it contains two words in a row, or a random selection of letters?

The [R] method will look strange at first, and highly redundant. We are working on a number of future variants for the word labeling. This method [R] can work reasonably well in the following way. (a) be sure to have a mouse with a scrolling wheel; (b) select a line that has no 'green labels' yet (for clarity of this example); (c) label the words from left to right. Use the scroll wheel to quickly find the first correctly segmented word. Label it and save the label. (d) then put the page in the mode [Only with HUMAN duplicate] and press the [Go] button. This will have the effect that all the irrelevant grey segments to the left of the last green 'HUMAN' will be suppressed. In this way, the screen content becomes increasingly more simple as you work your way towards the right. (e) after going to a new line with the black rightward arrow, you will have to repeat from step [c]. Note that, initially, all the grey stuff will appear again when starting on the new line.

> Or is there a more convenient way of adding and correcting labels? If so, where do I click to get there?

Yes, you can always go to the 'Train Words' tool from the main menu. Select your book and select the Sordex (Sorted Index)
or Qordex (Q variant of the sorted index). There, you can see the hit lists for specific lexical words.
The green boxes are 'HUMAN', the pink boxes are machine recognized and need to be cleaned away by labeling them.
After labeling a number of pink boxes to green and pressing the cogwheel, the list will be resorted. These computations
take place on the server, you may want to continue working on another word until the cogwheel stops rotating. This may take
several minutes. In your book, the (admittedly boring) word 'quos' is what Monk calls a 'good prospect': http://tinyurl.com/kypnyvj

Hint: by alternating between Sordex and Qordex you can elevate the recognition performance for a particular lexical word, following the
principle of the 'Fahrkunst'. Each improvement in method B also brings method A to a higher level, which then in its
own way also improves the recognition, etc.

> (2) It was easier to start editing the transcriptions of lines. 
> I wondered if the changes and additions I make to the transcriptions (and save), will be automatically used by MONK in learning, improving its 
> "mastery" of the handwriting of this collection of letters. Or do you have to order the servers at the Zernike to integrate the changes I make to the 
> transcriptions into the learning process?

The goal of the line transcription is threefold:
a) It allows users to search on key words within Monk
b) Ultimately after some days also the Google robots will find the transcribed texts, opening up the collection to many more users.
c) The words in the transcribed lines are shown to the users at the moment of labeling an individual word. The transcription gives
them knowledge on how to interpret the letter shapes, and confirming a word is easy by clicking on the correctly transcribed word.
They are located on the lower right of the [word labeling] screen .

> (3) One volunteer has transcribed
>
> "venusta-
> temq.," as
> "venusta-
> temp..."
> 
> The suffix "-q." looks like "-q3", and it is a very common abbreviation for "-que". Apparently your transcriber has read it as "-p[iets]"
> What is the best way to correct this? 
> Simply "temq."? (that is what I have done right now, on scan #10) 
> Or "@temque_temq3"? 
> That might make the transcription quite ponderous, seeing that such abbreviations occur so often...

This is a bit tricky. To begin with: the Monk approach hopes that the line transcribers agree on one systematic mode for a book or document set.
The goal is to have a more or less legible version of the handwritten text. Also, the words need to be searcheable (Google).
The restrictions are not as strong as in the case of the word labeling in Monk. In this case, if the expert (you) really thinks
that 'temque' is the most appropriate transcription: then by all means, label it as such!
However, in the word labeling mode, Monk really prefers an extended code such as '@temque_tem_qB_ligature'. Finally, if handwritten
items are very deviant from the normally legible letter-by-letter word images, then even in the line transcription the use of the @SPECIALs
(i.e., the special user-defined codes starting with @) must be considered. This was done in the Schepenbank Leuven, because there the contractions
and abbreviations are so far from the intended all-letter versions that many contemporary readers cannot even find the handwritten item back in the line-strip
image with the full transcription next to it.

> Thanks in advance,
> Yours kindly,
> ...

Good luck!
Best regards, Lambert

Subject: doelwoord met andere woorden in hetzelfde vakje

Hallo Lambert,

Nog een vraagje over het labelen in MONK:

Het komt regelmatig voor dat in de hitlist blokken staan met meer dan
een (1) woord.  Onder 'vridages' staan naast 2 groene blokken met
'vridages' bijvoorbeeld ook 3 rode blokken met 'vridages op' en 'des
vridages'.  Is het een probleem als deze woordencombinaties als één
item worden gelabeld? Zodat 'vridages op' een (1) item in de hitlist
wordt.  Of wordt binnen MONK alleen per woord gelabeld en kunnen
meerdere woorden niet als een (1) item gelabeld worden?

Hartelijke groet,  
transcribent

Als er echt meerdere woorden in een vakje staan worden ze samen in het label ingevoerd: te_Amsterdam. Als u een spatie intikt zal deze automatisch in een _ (underscore) worden omgezet. Er is een uitzondering. Als het om een woord met voldoende letters gaat en het is een betekenisrijk woord zoals een achternaam of plaatsnaam, dan mag een enkele foutieve letter links of rechts wel weggelaten worden [e Groningen] ↔ [Groningen]. Evenzo mag [Groninge] als ↔ [Groningen] worden gelabeld.

Het is daarentegen begrijpelijk dat het labelen van [he] als [het] onwenselijk is: er is geen overtuigend bewijs in de eerste twee letters aanwezig om -als mens of computer- af te kunnen leiden om welk geheel woord het zou kunnen gaan.

Subject: blue instead of green marked word zones in word labeling

Dear Lambert,

I have tried to label a number of words. For example, I changed an '11' into
'@Roman_xi_11'. After saving, the pink box with
the word became blue (Human labeled,
does not fit in current hit list) instead of green.

Subject: Re: blue instead of green marked word zones in word labeling
Dear transcriber,
A blue marking is just as good as a green one. In this case, the word is not marked in green because its label (@Roman_xi_11) differs from the name of the current hit list for '11'. However, after re-sorting the hit list, that blue instance of @Roman_xi_11 will migrate to its own hit list. In a list with the same name, that word instance will be marked in green. The re-sorting occurs at night or on demand, with the cogwheel button, which is available for some Monk users.

Best regards Lambert

Subject: General procedure around this 'Sordex'?

Dear Lambert,

 I started some word labeling but I am still in the dark what is happening in the background.
I understand that this 'Sordex' thing is very important.

Dear word labeler,
* Monk first tries to find the most likely word, given a wordzone image, on the basis of shape Feature I (i.e., regular recognition). This yields the primary, raw unsorted /Index for known lexical words. In a secondary stage, it generates for each lexical word a hit list of word-zone images, sorted in order of decreasing similarity to the average word model, from the perspective of shape Feature II (i.e., regular retrieval). This yields the secondary, sorted index, or /Sordex.
* The latter list can be recomputed by pressing the 'cogwheel'. This recomputation will have a beneficial effect on the precision of a /Sordex word hit list if, in the meantime, new human word-zone labels have been entered, such that the word model in Feature-II space will have been improved statistically. However, note that no new words are added to the /Sordex that are not already present in the /Index. In order to harvest fresh new words, the Feature-I recognizer needs to run over unseen pages in the book with new word hypotheses added to the /Index. Contact the Monk manager for such a request. A book can have the status of 'hot book', i.e., a book the Monk agents (programs) are watching closely for any events and quickly issue recomputation requests to incorporated newly learned words.
* Alternating between hit lists such as Qordex/Sordex will allow you to uplift the ranking performance on a particular word. This is akin to the 'Fahrkunst' elevator used in ancient mines.

Best regards, Lambert

Subject: Roman numbers

Dear Lambert,

I have tried to number some words. For example, I have changed a '11' (decimal
eleven) label into @Roman_xi_11 as you requested (...) but now I see blue
colors instead of the desired green color for HUMAN-labeled words.

Subject: Re: Roman numbers, blue vs green word marking

Dear word labeler,
The word labeling in Monk is case sensitive. The given class for your word image was probably @ROMAN_xi_11. You can also infer this from the word suggestions in the drop-down list while typing. Usually it is better to use these suggestions. Please look carefully at the Roman number. If the characters are in Uppercase, like XII, then '@ROMAN_XII_12' would be the best label. In very old handwriting, a lower case variant is sometimes used, e.g., xc. In that case, the label should be '@ROMAN_xc_90' in te voeren. We add the decimal (arabic) amount because many contemporary users are not familiar with these roman numbers. Also as a general rule, Monk prefers a redundant coding for word labels (@veni_ueni as opposed to 'ueni').
Best regards, Lambert

Subject: Tandwiel /Sordex herberekening

Beste transcribent,

je kunt nu ook zelf met de 'tandwielknop' een aanvraag indienen
voor herberekening van een zoeklijst. Die wordt periodiek opgepikt door de rekenserver
en er wordt een nieuwe 'Sordex hit list' gemaakt, voor dat woord.

Subject: Re: Tandwiel Beste Lambert, Bedankt voor deze service! Ik heb het tandwiel inmiddels aangezwengeld. Hij is nu al een tijdje bezig met het aanmaken van een nieuwe Sordex. Hoe lang duurt dat proces gemiddeld? Vriendelijke groet, transcribent
Subject: Re: Re: Tandwiel
Beste transcribent,
hoelang een Sordex herberekening duurt is jammer genoeg niet goed te voorspellen.
Het hangt van veel factoren af: hoe groot is de woordenlijst?, hoe lang geleden is
de grote index ververst? Hoeveel rekencapaciteit kan Monk krijgen? Bij een nieuwe
collectie waarvan nog maar een paar zoektermen bekend zijn zal het in het algemeen
veel sneller gaan dan bij een grote bestaande collectie of een dik boek.

Naar de woordtranscriptie-pagina

Of probeer de Monk zoekmachine