Qoruyo: Models for Automatic Transcription of Manuscripts
The Beth Mardutho Qoruyo project seeks to develop tools and resources for successful optical character recognition (OCR) and handwritten-text recognition (HTR) of printed and handwritten Syriac texts.
The project is now pleased to announce the Beth Mardutho Transkribus HTR models, which can automatically transcribe handwritten Syriac documents written in Estrangelo, East Syriac, and Serto, with up to 98% accuracy.
These models were trained as part of the Beth Mardutho Summer Fellowship Program 2019.The Estrangelo and Serto models were trained by Kyle Brunner (NYU), and the East Syriac model was trained by Abigail Pearson (University of Exeter).
What is Transkribus?
Transkribus is a free software program which can automatically transcribe handwritten text from digital images using HTR technology. Completed transcriptions can then be edited, searched, tagged, and exported using tools available in the program.
Using Transkribus, we have trained the following three HTR models to recognise handwritten Syriac in all three scripts with impressive accuracy:
Qoruyo Estrangelo Beta 1.0 – up to 98% accuracy
Qoruyo East Syriac Beta 1.0 – up to 95% accuracy
Qoruyo Serto Beta 1.0 – up to 96% accuracy
These models are accessible to anyone with a Transkribus account and, as they were trained using data from several manuscripts, they can be used successfully on a variety of handwriting styles.
Who should use Transkribus?
Transkribus can benefit a range of research and digitization projects, both large and small.
For scholars or institutions wishing to publish digital or print editions of Syriac manuscripts, the Beth Mardutho Transkribus Model scan speed up the transcription process and reduce the workload, as minimal manual correction is required to reach an accurate transcription of a text.
Researchers conducting detailed studies on a particular theme, person, or place, will benefit from the keyword spotting tool,which pinpoints any appearances of a word in the text. As well as searching for the precise term, Transkribus offers a ‘fuzzy search’ which highlights close matches – useful for catching variant spellings and scribal errors.
Scholars working on editions for digital corpora can tag metadata and export the transcription as an .XML file. Completed transcriptions can also be exported as a .PDF .TXT or .DOCX file,making Transkribus a suitable first step for projects where further editing in other platforms is required.
How do I get started?
To begin, register as a Transkribus user and link with the Beth Mardutho account by following the steps below. Once your account is confirmed, you will be able to import images, transcribe the text with a HTR model, conduct a keyword search, and export completed transcriptions. Please see the guides for step-by-step instructions.
Please note that the models on offer are in the beta stage and will continue to be improved as we gather more data. Check back to the website regularly to ensure you are using the latest version.
How can I get involved?
If you would like to help us to improve our models, either by volunteering as a transcriber or by sharing your transcriptions with us, please email us at firstname.lastname@example.org
How to connect with the Beth Mardutho Transkribus account
1. Visit https://transkribus.eu/Transkribus/and register for a user account.
2. Download and install the latest version of Transkribus.
3. Under the ‘server’ tab, click ‘User Manager’
4. In the Username/Email box, type in our address Transkribus@BethMardutho.org and click ‘find user.’
5. When our account appears, select it, and then click ‘add user’
6. Change the role to ‘Transcriber’ and click ‘OK’.
7. E-mail Transkribus@BethMardutho.org to request which model you would like access. Please email from the same address as your Transkribus user account.
8. You will receive an email confirmation from us within 24 hours. The models you have requested will now appear when you click ‘models’ in the ‘tools’ tab.
How to import images into Transkribus and prepare them for HTR
i.) Create a Transkribus account by following the steps above
ii.) From the main menu, scroll over ‘documents’ and click ‘import documents.’
iii.) Click the folder button to browse your computer and select the images you want to import, then click ‘upload.’ Supported file formats for images are .PDF .JPEG .PNG and .TIFF.
iv.) Once complete, the file will be listed under the ‘server’ tab. Double-click the on it to load the images.
v.) Go to the ‘Layout Analysis’ heading under the ‘Tools’ tab and select the pages you want to prepare, then click ‘run’.
vi.) A dialogue box will open asking you to confirm the job, click ‘yes’.
vii.) When the segmentation is complete you will see a pop-up box stating ‘do you want to reload the current page?’ Click ’yes.’
viii.) Now each page will show the text regions in green, and the baselines in purple. If needed, these can be corrected manually using the buttons on the left of the viewing pane.
ix.) Finally, correct the reading order. Click the eye symbol and check ‘show regions reading order’ and ‘show lines reading order.’ To reorder, click the number of each text region or line and manually enter a new number.
TIP: If drawing in new baselines manually, always draw them from left to right. Lines drawn from right to left are registered as upside-down.
TIP: The layout analysis tool works very well for straightforward layouts, but manual correction is likely to be needed for texts with heavy marginalia or lots of diacritics.
How to use a Beth Mardutho HTR model to transcribe a manuscript
i.) Import your images into Transkribus and prepare them for HTR by following the steps above
ii.) Under the ‘tools’ tab, go to the ‘text recognition’ section and click ‘run.’
iii.) Select which pages you would like to be transcribed, then click ‘configure’.
iv.) Choose the model you want to use from the list, then click ‘OK’.
v.) A dialogue box will open asking to confirm the job, click ‘yes.’ Then a dialogue box will open with the Job ID number, click ‘OK’.
vi.) Once the HTR is complete, a dialogue box will open asking if you want to reload the page. Select ‘yes’ and the transcription will load.
How to conduct a keyword search
i.) Transcribe your documents using a Beth Mardutho HTR model by following the steps above
ii.) Click the binoculars on the top bar.
iii.) Under the ‘full text’ tab, type your search term into the box.
iv.) Click ‘search’ and any results will be listed in the box below.
v.) Double-click on a result to be navigated to it.
TIP: the word preview function does not yet work accurately on right-to-left texts.
TIP: Check the ‘fuzzy search’ box to include near-matches such as variant spellings.
How to export a transcription
i.) Load the document you would like to export by double-clicking on it in the ‘server’ tab.
ii.) From the main menu, scroll over ‘documents’ and click ‘export document.’
iii.) Select which file formats you would like to export the transcription as – .txt, .docx, .pdf, or .xml.
iv.) Choose which version of the transcription you would like to export.
v.) Select which pages you would like to export and click ‘OK’.
vi.) Once complete, you will receive an email from Transkribus with a zip file containing the exported files.