Patterns¶
This section describes grammatical matching patterns that can be used with the
SimpleTree.match()
, SimpleTree.first_match()
,
SimpleTree.all_matches()
and SimpleTree.top_matches()
methods.
Overview¶
The above mentioned methods can be used to find trees and subtrees that match a specific grammatical pattern, within a sentence. The pattern can include conditions that apply to the root of each subtree as well as its children, direct or indirect.
The patterns are given as strings, with pattern tokens separated by whitespace. Examples are given below.
See the documentation of each method for a further explanation of how the given pattern is matched in each case, and how results are returned.
Simple matches¶
A
"literal"
within double quotes matches a subtree that covers exactly the given literal text, although using a case-neutral comparison."Icelandic"
thus matchesicelandic
andICELANDIC
. The literal may have multiple words, separated by spaces:"borgarstjóri reykjavíkur"
matches a subtree that covers these two word forms. The matched subtree can be a nonterminal or a terminal node.A
'literal'
within single quotes matches a subtree that covers exactly the given word lemma(s), using a case-neutral comparison.'hestur'
thus matcheshests
andHestinum
. The literal may have multiple words, separated by spaces:'borgarstjóri reykjavík'
matches a subtree that covers these two lemmas. ('borgarstjóri reykjavíkur'
would never match anything asreykjavíkur
is not the lemma of any word form.) The matched subtree can be a nonterminal or a terminal node.A
@"literal"
within double quotes and prefixed with the @ symbol matches a terminal node that corresponds to a token having the given literal text, although using a case-neutral comparison.@"Icelandic"
thus matchesicelandic
andICELANDIC
.A
@'literal'
within single quotes and prefixed with the @ symbol matches a terminal node that corresponds to a token having the given word lemma, using a case-neutral comparison.@'hestur'
thus matcheshests
andHestinum
.A
NONTERMINAL
identifier in upper case matches nodes associated with that nonterminal, as well as subcategories thereof.NP
thus matchesNP
as well asNP-OBJ
andNP-SUBJ
.NP-OBJ
only matchesNP-OBJ
and subcategories thereof.A
terminal
identifier in lower case matches nodes associated with the specified category of terminal, and having at least the variants given, if any.no
thus matches all noun terminals, whileno_nf_et
only matches noun terminals in nominative case, singular (but any gender, since a gender variant is not specified).
Wildcard match¶
A dot
.
matches any single tree node.
OR match¶
(Any1 | Any2 | ...)
matches if anything between the parentheses matches. The options are separated by vertical bars|
.
Sequence matches¶
Any1 Any2 Any3
matches the given sequence of matches if each element matches in exactly the given order. The match must be exhaustive, i.e. no child nodes may be left unmatched at the end of the list.Any+
matches one or more sequential instances of the givenAny
match.Any*
matches zero or more sequential instances of the givenAny
match.Any?
matches zero or one instances of the givenAny
match..*
thus matches any number of any nodes and is an often-used construct.[ Any1 Any2 ]
matches any node sequence that starts with the two given matches. It does not matter whether the sequence contains more nodes.[ Any1 Any2 $ ]
matches any node sequence whereAny1
andAny2
match and there are no further nodes in the sequence. The$
sign is an end-of-sequence marker.[ Any1 .* Any2 $ ]
matches only sequences that start withAny1
and end withAny2
.
Hierarchical matches¶
Any1 > { Any2 Any3 ... }
matches ifAny1
matches and has immediate (direct) children that includeAny2
,Any3
and other given arguments (irrespective of order). This is a set-like operator.Any1 >> { Any2 Any3 ... }
matches ifAny1
matches and has children at any sublevel that includeAny2
,Any3
and other given arguments (irrespective of order). This is a set-like operator.Any1 > [ Any2 Any3 ... ]
matches ifAny1
matches and has immediate children that includeAny2
,Any3
and other given arguments in the order specified. This is a list-like operator.
Examples¶
This pattern will match the root subtree of any sentence that has a verb phrase that refers to a person as an argument:
"S >> { VP >> { NP-OBJ >> person }}"
This pattern will match any sentence that has a verb phrase that refers to a male person as an argument:
"S >> { VP >> { NP-OBJ >> person_kk }}"
Here is a short program using some of the matching features:
from reynir import Greynir
g = Greynir()
my_text = ("Reynt er að efla áhuga ungs fólks á borgarstjórnarmálum "
"með framboðsfundum og skuggakosningum en þótt kjörstaðirnir "
"í þeim séu færðir inn í framhaldsskólana er þátttakan lítil.")
s = g.parse_single(my_text)
print("Parse tree:")
print(s.tree.view)
print("\nAll subjects:\n")
for d in s.tree.descendants:
if d.match_tag("NP-SUBJ"):
print(d.text)
print("\nAll masculine noun and pronoun phrases:\n")
for m in s.tree.all_matches("NP > { (no_kk | pfn_kk) } "):
print(m.canonical_np)
Output:
Parse tree:
S0
+-S-MAIN
+-IP
+-VP
+-VP
+-so_sagnb: 'Reynt'
+-VP
+-so_et_p3: 'er'
+-IP-INF
+-TO
+-nhm: 'að'
+-VP
+-VP
+-so_1_þf_nh: 'efla'
+-NP-OBJ
+-no_et_þf_kk: 'áhuga'
+-NP-POSS
+-lo_ef_et_hk: 'ungs'
+-no_et_ef_hk: 'fólks'
+-PP
+-P
+-fs_þgf: 'á'
+-NP
+-no_ft_þgf_hk: 'borgarstjórnarmálum'
+-PP
+-P
+-fs_þgf: 'með'
+-NP
+-no_ft_þgf_kk: 'framboðsfundum'
+-C
+-st: 'og'
+-no_ft_þgf_kvk: 'skuggakosningum'
+-C
+-st: 'en'
+-S-MAIN
+-CP-ADV-ACK
+-C
+-st: 'þótt'
+-IP
+-NP-SUBJ
+-no_ft_nf_kk: 'kjörstaðirnir'
+-PP
+-P
+-fs_þgf: 'í'
+-NP
+-pfn_kvk_ft_þgf: 'þeim'
+-VP
+-VP
+-so_ft_p3: 'séu'
+-NP-PRD
+-NP-PRD
+-VP
+-so_lhþt_sb_nf_ft_kk: 'færðir'
+-PP
+-ADVP-DIR
+-ao: 'inn'
+-P
+-fs_þf: 'í'
+-NP
+-no_ft_þf_kk: 'framhaldsskólana'
+-IP
+-VP
+-VP
+-so_et_p3: 'er'
+-NP-SUBJ
+-no_et_nf_kvk: 'þátttakan'
+-NP-PRD
+-lo_sb_nf_et_kvk: 'lítil'
+-'.'
All subjects:
kjörstaðirnir í þeim
þátttakan
All masculine noun and pronoun phrases:
áhugi
framboðsfundur og skuggakosning
kjörstaður
framhaldsskóli