Greynir parses sentences of Icelandic text into parse trees. A parse tree recursively describes the grammatical structure of the sentence, including its noun phrases, verb phrases, prepositional phrases, etc.
The individual tokens (words, numbers, punctuation, etc.) of the sentence correspond to leaves in the parse tree.
By examining and processing the parse tree, information and meaning can be extracted from the sentence.
Here is a short example of what can be done with Greynir:
>>> from reynir import Greynir >>> g = Greynir() >>> sent = g.parse_single("Ása sá sól.") >>> print(sent.tree.view) S0 # Root +-S-MAIN # Main sentence +-IP # Inflected phrase +-NP-SUBJ # Noun phrase, subject +-no_et_nf_kvk: 'Ása' # Noun, singular, nominative, feminine +-VP # Verb phrase containing arguments +-VP # Verb phrase containing verb +-so_1_þf_et_p3: 'sá' # Verb, 1 accusative arg, singular, 3rd p +-NP-OBJ # Noun phrase, object +-no_et_þf_kvk: 'sól' # Noun, singular, accusative, feminine +-'.' # Punctuation >>> sent.tree.nouns ['Ása', 'sól'] >>> sent.tree.verbs ['sjá'] >>> # Show the subject noun phrase >>> sent.tree.S.IP.NP_SUBJ.lemmas ['Ása'] >>> # Show the verb phrase >>> sent.tree.S.IP.VP.lemmas ['sjá', 'sól'] >>> # Show the object of the verb >>> sent.tree.S.IP.VP.NP_OBJ.lemmas ['sól']
S stands for sentence (málsgrein),
IP for inflected
VP is a verb phrase (sagnliður),
NP_SUBJ is a subject noun phrase (frumlag) and
NP_OBJ is an object noun phrase (andlag).
Nonterminal names are listed in the Nonterminals section.
What Greynir does¶
Greynir starts by tokenizing your text, i.e. dividing it up into individual words, numbers, punctuation and other tokens. For this, it uses the separate Tokenizer package, by the same authors, which is automatically installed with Greynir.
After tokenization, Greynir proceeds to parse the text according to a context-free grammar for the modern Icelandic language. This grammar contains rules describing how sentences and the various subparts thereof can be validly constructed.
Almost all sentences are ambiguous. This means that there are multiple
parse trees that can validly describe the sentence according to the grammar
rules. Greynir thus has to choose a single best tree from the forest of
possible trees. It does this with a scoring heuristic which assigns higher
scores to common word forms and grammatical constructs, and lower scores to
rare word forms and uncommon constructs. The parse tree with the highest
overall score wins and is returned from the
Once the best parse tree has been found, it is available for various kinds of queries. You can access word lemmas, extract noun and verb phrases as shown above, look for patterns via wildcard matching, and much more. This is described in detail in the Reference.