Toward a dynamic interface of lexical synchrony and diachrony
National Taiwan University
2022-11-02
Sinica BOW
((C.-R. Huang, Chang, and Lee 2004), 2000-2004)
Chinese Wordnet at Academia Sinica
(2005-2010)
Chinese Wordnet at NTU Taiwan
(2010-)
Note
Note that there are more than one Chinese Wordnet.
Two core components:
Follow PWN (in comparison with Sinica BOW)
Word segmentation principle (C.-R. Huang, Hsieh, and Chen 2017)
Corpus-based decision
Manually created (sense distinction, gloss with controlled vocabulary, etc)
Some new perspectives in CWN
sense granularity, relation discovery, and gloss with annotation
埔里種的【茶】很好喝
Co-predicative Nouns
a phenomenon where two or more predicates seem to require that their argument denotes different things.
Gloss (`lexicographic definition’) is carefully controlled with limited vocabulary and patterns, e.g.,
VH
tag (i.e., stative intransitive verbs) are glossed with “形容 or 比喻形容 …”.collocational information, pragmatic information (‘tone’, etc) are recorded as additional annotation.
Figure 2 shows the lemma and sense data distribution
Figure 3 further demonstrates the distribution of different types of relations
Gloss statistics
SemCor
manually sense-tagged corpusWSD: The Problems
The task as currently defined does not allow for generalization over different words \(\rightarrow\) learning is word-specific.
Need training data for every sense of every word, and no chance with unknown words. (unsupervised approaches perform consistently worse than supervised approaches)
Cannot capture the sense alternation regularities
gradience is found is many linguistic categories.
Regular polysemy detection: Using word vector (DI PIETRO 2013)or sense vector (Lopukhina and Lopukhin 2016) to detect sense alternations (such as FOOD
or ANIMAL
)
Recent (contextualized) vector representation could help us in locating where a word meaning is on the continuum (/in the multidimensional semantic space).
GlossBert
(L. Huang et al. 2019)
GlossBert
model on CWN gloss+SemCor reports 82% accuracy.Now we have chance to empirically explore the dominancy of word senses, which is essential for both lexical semantic and psycholinguistic studies.
Character Jacobian
: Chinese character (root morpheme) lies in the meaning core (Tseng and Hsieh 2022)
gloss2vec
(Hsieh et al. 2022. submitted)
Gradualness change and continual variations
Fusion of Archaic and Modern senses
The puzzle of affixoid
The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. (Tseng et al. 2020)
word, construction, word senses
We’ve build a parallel corpus of Mandarin in Mainland China and Taiwan, and
Corpus data collected from movie titles and TED talks.
A intralingual Machine Translation system has been developed and Sense Mapping/Inducing system is in process.
The haunting issues of
wordhood
(and the beautiful scene it has brought us into)
Chinese Wordnet beyond word
Form-meaning pairs
Construction has its own sense
Need to broaden the concept of word
Construction Sense Disambiguation
synset-structured (lexicalized) ontology doesn’t (seem) work well
unlabeled root vs embodied body
Chinese(s) are neighbors themselves.
Wordnet framework serves as a mirror for Chinese synchronic and diachronic varieties.