Syntactic collocations
This page allows for searching and comparing syntactic collocations of words in Polish language corpora. Syntactic collocations are a type of collocation based not on the direct proximity of words in a text, but on their proximity in a syntactic dependency tree, understood as a direct connection by a dependency relation between two nodes (words) in that tree. The main motivation for this approach to collocations is the fact that in languages with a flexible word order, words that are distant from each other in a linear order can be closely related syntactically and represent typical combinations for each other. The syntactic approach therefore strengthens the signal of statistical co-occurrence of words, which a more flexible word order can sometimes weaken. This is particularly important in Polish from earlier epochs, where syntactic discontinuities were a statistically much more frequent phenomenon than in contemporary Polish.
In the application, lemmas from two corpora have been indexed — the Corpus of Contemporary Polish (KWJP) and the Electronic Corpus of Polish Texts from the 17th and 18th centuries (KorBa) — that appear at least 5 times. Each word was assigned a list of words directly connected to it by an edge in the dependency tree (with some extensions primarily concerning coordinated words of phrases), along with the label of that edge indicating their syntactic function. A co-occurrence coefficient (logDice) was then calculated. The coefficient reaches a theoretical maximum value of 14, which means that all occurrences of the given two words in the corpus were used exclusively in that specific combination. In reality, this value is usually significantly lower (a detailed description of the measure itself and its interpretation can be found in paper [1]). The application only includes collocations with a logDice value of at least 4. Users can increase this value using filters, thus limiting the view to stronger collocations.
The view for a given word's collocations consists of a set of lists — one for each syntactic function. Each list is sorted by the logDice value, from the strongest to the weakest collocations. Additionally, for each collocation, its absolute and relative frequency (per million words) in a given corpus is provided. One can also display a few example concordances for a given word combination (by clicking the symbol). Collocations that are significantly more frequent in specific text types within a given corpus are also marked. Individual types are indicated by graphical symbols — e.g., for press — which are explained in a tooltip after hovering over them. The absence of such symbols means that the given collocation occurs relatively evenly across all text types.
The application also allows for comparing the collocations of two words from a given corpus or the collocations of the same word in two different corpora (if the word appears in both). The comparison is presented in the form of lists similar to those for a single word, but at their two ends are the collocations most characteristic of each of the words/corpora. The number displayed next to each of them is the difference in logDice values for each of the collocations (the absence of a given word on the collocation list for one of the compared words is treated as logDice = 0). Therefore, negative values correspond to collocations that are more typical for the word/corpus against which the comparison is being made.
The basis for all calculations in the application are texts that have been automatically processed and annotated—from modernising spelling (in the case of older texts) to morphological and syntactic tagging and lemmatisation. We have made every effort to ensure that the results of these automatic analyses are as accurate as possible. Nevertheless, the collocation lists may contain some errors.
[1] P. Rychlý, A lexicographer-friendly association score, in Proceedings of Recent Advances in Slavonic Natural Language Processing, P. Sojka and A. Horák, eds. Brno, Czech Republic: Masaryk University, 2008, p. 6–9.