LiNT-II: readability assessment for Dutch

Table of contents

  1. Introduction
  2. Demo
  3. LiNT and LiNT-II
  4. Linguistic Features
  5. Formula, Scores and Diffculty Levels
  6. References and Credits

Introduction

Demo

Select one of the 4 texts below to see the detailed LiNT-II analysis:

LiNT and LiNT-II

Background and motivation

LiNT-II is a new implementation of the original LiNT (Leesbaarheidsinstrument voor Nederlandse Teksten) tool.

The original LiNT utilizes the legacy NLP pipeline T-Scan to extract linguistic features from text; this software is difficult to install and to run and is therefore not suitable for many use cases.

LiNT-II is a modern Python package, with spaCy under the hood. It can be easily installed with pip, and integrated into other software; it is fast and therefore suitable for production setups.

In order to preserve the scientific integrity of the tool, LiNT-II was developed in close collaboration with Henk Pander Maat, one of the researchers who developed the original LiNT.

Original LiNT

The first version of LiNT was developed in the NWO project Toward a validated reading level tool for Dutch (2012-2017). Later versions were developed in the Digital Humanities Lab of Utrecht University.

More details about the original LiNT can be found on:

The research on which LiNT is based, including the empirical comprehension study and the development of the model, is described in:

LiNT-II

In LiNT-II, the linguistic analysis of the text is done with spaCy, instead of the original T-Scan. This includes, for example, splitting the text into sentences and tokens, tagging the part-of-speech of each token (noun, verb, etc.), and parsing the syntactic structure of the sentence. We use the spaCy model nl_core_news_lg.

Doing the linguistic analysis with a different software affects the values of the linguistic features. Therefore, we fitted a new model on the comprehension data that was collected for the original LiNT. The new model leads to a new LiNT-II formula for calculating the readability score. For more information, read here.

Linguistic Features

Overview

The readability score of LiNT-II is calculated based on 4 features:

Feature Description
word frequency Mean word frequency of all the content words in the text (excluding proper nouns).
➡ Less frequent words make a text more difficult.
syntactic dependency length Syntactic dependency length (SDL) is the number of words between a syntactic head and its dependent (e.g., verb-subject). We take the biggest SDL in each sentence, and calculate their mean value for the whole text.
➡ Bigger SDL's make a text more difficult.
content words per clause Mean number of content words per clause.
➡ Larger number of content words indicates dense information and makes a text more difficult.
proportion concrete nouns Mean proportion of concrete nouns out of all the nouns in the text.
➡ Smaller proportion of concrete nouns (i.e. many abstract nouns) makes a text more difficult.

Definitions

Word Frequency

Why word frequencies?

Words that are not common in spoken language tend to be less familiar to people and therefore more difficult to process and understand. We can estimate how familiar a certain word is by measuring its frequency, i.e. counting its occurences in a big text corpus (dataset).

Choice of corpus

LiNT-II calculates word frequencies from SUBTLEX-NL (Keuleers et al. 2010): a corpus of Dutch subtitles, which contains about 40 million words. This corpus was chosen for the original LiNT after elaborate analysis and consideration; for details, please refer to the T-Scan manual and Pander Maat & Dekker 2016.

During the development of LiNT-II, we also experimented with using frequencies from wordfreq instead of SUBTLEX-NL. The wordfreq corpus is a lot bigger and contains multiple genres: SUBTLEX-NL, OpenSubtitles, Wikipedia, NewsCrawl, GlobalVoices, Web text (OSCAR), Twitter. However, wordfreq frequencies gave lower results when fitting the model on comprehension data. This suggests that SUBTLEX-NL might be a better approximation of spoken language than a bigger corpus that contains a lot of written language like news and Wikipedia.

It is important to note that any corpus captures language use only partially. Since the SUBTLEX-NL corpus is based on Dutch subtitles for English-speaking shows, some words that are common in a Dutch-speaking context might be less frequent there (e.g., fietser "cyclist"). In addition, the shows are from the years 2000-2010; new words from the last 15 years (Instagram, covid) are not in the corpus. Additional corrections were applied to address some of these issues, as described below.

What do the values mean?

We calculate the frequencies on a Zipf scale (Van Heuven et al. 2014):

\[ \text{Zipf value} = \log_{10}(\text{frequency per billion words}) \]

A Zipf value of 1 means that a word appears once per 100 million words, a Zipf value of 2 means that a word appears once per 10 million words, a Zipf value of 3 means that a word appears once per million words, and so on.

In line with the original LiNT and Van Heuven et al. 2014, we consider words with a Zipf value smaller than 3 as "uncommon"; these words appear in the SUBTLEX-NL corpus less than once per million words. Examples: afdwaling: 1.66, napraterij: 1.66.

The SUBTLEX-NL corpus with our calculated Zipf values can be found here.

Corrections and exceptions

The corrections and exceptions applied in LiNT-II are the same ones as in the original LiNT.

Syntactic Dependency Length (SDL)

Why SDL?

Syntactic dependency length (SDL) is the number of words between a syntactic head and its dependent (e.g., verb-subject). The bigger the distance between a head and its dependent is, the more difficult it is to process and understand the sentence. This phenomenon is called a tangconstructie.

Calculating SDLs

To calculate the SDLs in the sentence, we use the dependency parsing of spaCy. The parser of the Dutch model that we use was trained on the Alpino UD corpus.

For each token in the sentence, we identify its head(s) and then count the number of intervening tokens between the token and its head. The head is generally taken from the spaCy parser, except for the two cases described below. In each sentence, we take the longest SDL as an indicator of difficulty. For the document-level readability analysis, we take the mean of all the sentence-level max SDLs.

Example: In the sentence "De Oudegracht is het sfeervolle hart van de stad.", the longest SDL is between the subject of the sentence Oudegracht and the root (main predicate) of the sentence hart; the max SDL is 3 ( three intervening tokens is, het, sfeervolle).

Corrections and exceptions

There are three cases in which we do not follow spaCy's dependency analysis:

These exceptions and corrections were done based on a manual analysis of a sample of 200 sentences performed by Henk Pander Maat, one of the creators of the original LiNT. He identified these 3 issues as the main systematic differences between the spaCy parser and the parser used in the original LiNT.

Content Words per Clause

Why content words per clause?

A clause is a group of words that contains a subject and a verb. A simple sentence contains one clause; longer sentences may contain additional clauses, for example subordinate clauses or clauses connected with words like "and" or "because". For this metric, the number of clauses is not important; what we analyze is the number of content words in each clause.

A clause with a lot of content words is dense in information and is therefore more difficult to process and understand. For example, compare the sentence "Ik verknalde het proefwerk." with the sentence "Ik verknalde het proefwerk Wiskunde gisteren bij het laatste schoolexamen.". In both cases, the sentence contains one clause (one subject and one verb), but in the second sentence there is a lot more information, which is introduced through four extra content words (Wiskunde, gisteren, laatste, schoolexamen).

Calculating content words per clause

We calculate the number of clauses in the sentence by counting the number of finite verbs, i.e., verbs that show tense. This is done using the spaCy fine-grained part-of-speech tag "WW|pv" (werkwoord, persoonsvorm).

We claculate the number of content words by counting all words that have the following parts-of-speech (POS): nouns (NOUN), proper nouns (PROPN), lexical verbs (VERB), adjectives (ADJ). To these, we also add a list of 69 manner adverbs, which we consider content words; other adverbs are not included since they are considered function words, in line with the original LiNT. For more information, see the T-Scan manual.

Proportion of Concrete Nouns

Why concrete and abstract nouns?

Concrete nouns refer to specific, tangible items that can be perceived through the senses, like "apple" or "car". Abstract nouns, on the other hand, represent general ideas or concepts that cannot be physically touched, such as "freedom" or "happiness". Research suggests that a more concrete text is easier to understand; for example, adding examples helps understanding because examples make ideas more specific and concrete.

LiNT-II noun list

The noun list was created for the original LiNT and further revised and updated for LiNT-II. The original annotation work was done by Henk Pander Maat, Nick Dekker and N. van Houten; the revisions and additions for LiNT-II were done by Henk Pander Maat.

The list contains 164,671 nouns, annotated for their semantic type (e.g., "human", "place") and semantic class ("abstract", "concrete", "undefined"); the full annotation scheme is described below. The annotations are based on an existing lexicon -- Referentiebestand Nederlands (Martin & Maks 2005) -- which was expanded and revised. For more information about how the original list was created, see the T-Scan manual.

Descriptive statistics of the LiNT-II noun list:

Semantic types scheme

The nouns in the list are divided into 14 semantic types (see the table below), which are in turn classified into two classes: abstract and concrete. Ambiguous words that have both an abstract and a concrete meaning are classified as undefined.

Semantic class Semantic type Examples
concrete human economiedocenten, assistent
nonhuman sardine, eik
artefact stoel, barometers
concrete substance modder, lichaamsvloeistoffen
food and care melk, lettertjesvermicelli
measure euro, kwartje
place amsterdam, voorkamer
time kerstavond, periode
concrete event ademhaling, stakingsacties
miscellaneous concrete galblaas, vulkaan
abstract abstract substance fosfor, tumorcellen
abstract event crisis, status-update
organization nato, warenautoriteit
miscellaneous abstract (nondynamic) motto, woordfrequentie
undefined ambiguous words that belong to more than one type steun, underground

Calculating the proportion of concrete nouns

We calculate the proportion of concrete nouns in the document as follows:

\[ \frac{N_{\text{concrete}}}{N_{\text{concrete}} + N_{\text{abstract}} + N_{\text{undefined}}} \]

Formula, Scores and Diffculty Levels

Where does LiNT-II Formula Come from?

Original LiNT: data and model

For the development of the original LiNT, an empirical comprehension study was done. In this study, 2700 Dutch high-school students read 120 texts; their understanding of the texts was assessed using a cloze test (fill-in missing words). This comprehension dataset was then used by the researchers to fit a linear regression model; the model expresses which features of the text best predict the students' performance in the cloze test.

The developers of LiNT started with 12 different text features; step-by-step, they eliminated features which were not predictive enough or were highly correlated with other features. By the end of this process, they were left with the 4 features: word frequency, syntactic dependency length, number of content words per clause, and proportion of concrete nouns. These features explain 74% of the variance in the comprehension dataset (Adjusted \( R^2 = 0.74 \)). The regression model assigns each of these features a weight (coefficient) and this is the formula used to asses text readability.

The research on which LiNT is based, including the empirical comprehension study and the development of the model, is described in:

LiNT-II model

For the development of LiNT-II, we used the same comprehension dataset and the same 4 features as in the original LiNT.

Since LiNT-II uses a different software for the linguistic analysis, the values of the features are different from LiNT; therefore, a new model was fitted on the comprehension data. LiNT-II model performs as well as the original LiNT model: it explains 74% of the variance in the comprehension dataset (Adjusted \( R^2 = 0.74 \)). Below, additional details about the model are shown.

Parameter Coefficient Standardized Coefficient (Beta) Correlation (zero-order) Partial Correlation Variance Inflation Factor
constant -4.21
word frequency 17.28 0.403 0.72 0.56 1.43
syntactic dependency length -1.62 -0.255 -0.66 -0.34 1.99
content words per clause -2.54 -0.218 -0.70 -0.28 2.30
proportion concrete nouns 16.00 0.246 0.56 0.40 1.27

LiNT-II Formula & Score

The readability score is calculated based on the following formula:

\[ \begin{align} \text{LiNT-II score} = 100 - ( & -\text{4.21} \\ & + \text{17.28} \cdot \text{word frequency} \\ & - \text{1.62} \cdot \text{syntactic dependency length} \\ & - \text{2.54} \cdot \text{content words per clause} \\ & + \text{16.00} \cdot \text{proportion concrete nouns} ) \end{align} \]

LiNT-II Difficulty Levels

LiNT-II scores are mapped to 4 difficulty levels. For each level, it is estimated how many adult Dutch readers have difficulty understanding texts on this level.

Score Difficulty level Proportion of adults who have diffuculty understanding this level
[0-34) 1 14%
[34-46) 2 29%
[46-58) 3 53%
[58-100] 4 78%

The estimation is done in the same way as for the original LiNT, based on the comprehension dataset. For a detailed explanation, please refer to Pander Maat et al. 2023.

Handling missing values

When the value for at least one of the four features in the formula is None, LiNT-II score is not calculated (returns None). This happens in the following cases:

Examples:

sentence word frequency syntactic dependency length content words per clause proportion concrete nouns
Waarom? None None None None
Waarom is het zo? None 2 0 None
Misschien is het net opgekomen. None* 3 1 None
Wat rechtsom kan , zou ook linksom moeten kunnen. 4.12 5 1.5 None

*opgekomen is a content word, but it is in the "skip-list"

References and Credits

LiNT-II

LiNT-II was developed by Jenia Kim (Hogeschool Utrecht, VU Amsterdam), in collaboration with Henk Pander Maat (Utrecht University).

If you use this library, please cite as follows:

@software{lint_ii,
  author = {Kim, Jenia and Pander Maat, Henk},
  title = {{LiNT-II: readability assessment for Dutch}},
  year = {2025},
  url = {https://github.com/vanboefer/lint_ii},
  version = {0.1.0},
  note = {Python package}
}

Original LiNT

The first version of LiNT was developed in the NWO project Toward a validated reading level tool for Dutch (2012-2017). Later versions were developed in the Digital Humanities Lab of Utrecht University.

More details about the original LiNT can be found on:

The readability research on which LiNT is based is described in the PhD thesis of Suzanne Kleijn (English) and in Pander Maat et al. 2023 (Dutch). Please cite as follows:

@article{pander2023lint,
  title={{LiNT}: een leesbaarheidsformule en een leesbaarheidsinstrument},
  author={Pander Maat, Henk and Kleijn, Suzanne and Frissen, Servaas},
  journal={Tijdschrift voor Taalbeheersing},
  volume={45},
  number={1},
  pages={2--39},
  year={2023},
  publisher={Amsterdam University Press Amsterdam}
}
@phdthesis{kleijn2018clozing,
  title={Clozing in on readability: How linguistic features affect and predict text comprehension and on-line processing},
  author={Kleijn, Suzanne},
  year={2018},
  school={Utrecht University}
}