Chinese Wordnet 2.0

Toward a dynamic interface of lexical synchrony and diachrony

Shu-Kai Hsieh

National Taiwan University

2022-11-02

Outline

  • Background
  • Chinese Wordnet 2.0: Concepts and Implmentation
  • Challenges and Current Works

History

Sinica BOW ((C.-R. Huang, Chang, and Lee 2004), 2000-2004)

Chinese Wordnet at Academia Sinica (2005-2010)

Chinese Wordnet at NTU Taiwan (2010-)

Note

Note that there are more than one Chinese Wordnet.

WordNet architecture

Two core components:

  • Synset (synonymous set)
  • Paradigmatic lexical (semantic) relations: hyponymy/hypernymy; meronymy/holonymy, etc

Chinese Wordnet

  • Follow PWN (in comparison with Sinica BOW)

  • Word segmentation principle (C.-R. Huang, Hsieh, and Chen 2017)

  • Corpus-based decision

  • Manually created (sense distinction, gloss with controlled vocabulary, etc)

The status quo

Theories

Some new perspectives in CWN

sense granularity, relation discovery, and gloss with annotation

Meaning facets vs senses

埔里的【茶】很好喝

Co-predicative Nouns

a phenomenon where two or more predicates seem to require that their argument denotes different things.

Leveraging morpho-semantic relations

(a) Hsieh and Chang (2014)

(b) Tseng and Hsieh (2019)

Figure 1: Different leveraging methods

Gloss as lexicographic resources with add-ons annotations

  • Gloss (`lexicographic definition’) is carefully controlled with limited vocabulary and patterns, e.g.,

    • Verbs with VH tag (i.e., stative intransitive verbs) are glossed with “形容 or 比喻形容 …”.
    • Adverbs are glossed with “表…”
  • collocational information, pragmatic information (‘tone’, etc) are recorded as additional annotation.

Data Statistics

Zipf’s law (no surprise)

  • Most words have small number of senses (Zipf’s law)

Comparison with others

  • CWN is the best candidate

drawing drawing drawing

Data summary 1/1

Figure 2 shows the lemma and sense data distribution

Figure 2: cwn sense data summary

Data summary 2/2

Figure 3 further demonstrates the distribution of different types of relations

Figure 3: cwn relation data summary

Data summary 3/3

Gloss statistics

GraphAPI and Visualization

Computational Semantic Representations

  • human curated and machine generated lexical semantic resources
  • open-sourced (github)

SemCor manually sense-tagged corpus

Word Sense Tagger

WSD: The Problems

  • The task as currently defined does not allow for generalization over different words \(\rightarrow\) learning is word-specific.

  • Need training data for every sense of every word, and no chance with unknown words. (unsupervised approaches perform consistently worse than supervised approaches)

  • Cannot capture the sense alternation regularities

Distributed approach to model the ‘Gradience’

  • gradience is found is many linguistic categories.

  • Regular polysemy detection: Using word vector (DI PIETRO 2013)or sense vector (Lopukhina and Lopukhin 2016) to detect sense alternations (such as FOOD or ANIMAL)

  • Recent (contextualized) vector representation could help us in locating where a word meaning is on the continuum (/in the multidimensional semantic space).

WSD with Transformer (1)

  • Leveraging wordnet glosses using GlossBert (L. Huang et al. 2019)
    • a BERT model for word sense disambiguation with gloss knowledge.
  • Our extended GlossBert model on CWN gloss+SemCor reports 82% accuracy.

Word Sense Tagger

  • APIs (GlossBert version) released in 2021
    drawing
# pip install -U DistilTag SenseTagger
import DistilTag
import CwnSenseTagger
DistilTag.download()
CwnSenseTagger.download()

tagger = DistilTag()
tagged = tagger.tag("<raw text>")
sense_tagged = senseTag(tagged)

Word Sense frequencies

  • Now we have chance to empirically explore the dominancy of word senses, which is essential for both lexical semantic and psycholinguistic studies.

    • e.g., ‘開’ (kai1,‘open’) has (surprisingly) more dominant blossom sense over others (based on randomly chosen 300 sentences in ASBC corpus)

Word Sense Embeddings

  • We use our tagger to automatically tag ca. 5 millions word tokens in Academia Sinica Balanced Corpus, and indexed the annotated sense.
    • word sense frequency data are calculated out via the tags.
    • tokenize the index and use word2vec to get the word sense embeddings.
  • Character Jacobian: Chinese character (root morpheme) lies in the meaning core (Tseng and Hsieh 2022)

  • gloss2vec (Hsieh et al. 2022. submitted)

Chinese(s) in Synchrony and Diachrony

Gradualness change and continual variations

Contemporary Mandarin Varieties

  • Fusion of Archaic and Modern senses

    • resulting in (expressive vs receptive word senses). E.g. 【打】水 (`to pump our water out of a well’.)

Contemporary Mandarin Varieties

The puzzle of affixoid

The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. (Tseng et al. 2020)

  • E.g. 【化】(huà, ‘-ize’)

Change of affixiod status in diachrony

  • The indeterminate nature of Chinese affixoids
  • Sense status of 家 jiā from the Tang dynasty to the 1980s

Dynamics in Contemporary Mandarin Chinese(s)

【真香】 (‘zhēn xiāng’, soappetizing)

Dynamics in Contemporary Mandarin Chinese(s)

word, construction, word senses

  • originally appeared as a fixed phrase in MC (cannot be replaced with other synonymous phrases like 好香)
  • gradually spread into TM, but diversified itself into new construction senses, as well as word sense.

World Chinese(s) and Construction Grammar (CxG)

  • We’ve build a parallel corpus of Mandarin in Mainland China and Taiwan, and

  • Corpus data collected from movie titles and TED talks.

  • A intralingual Machine Translation system has been developed and Sense Mapping/Inducing system is in process.

Challenges and On-going Works

The haunting issues of wordhood (and the beautiful scene it has brought us into)

Re-theorizing

Chinese Wordnet beyond word

  • Form-meaning pairs

  • Construction has its own sense

    • ( ‘還在那邊’)
  • Need to broaden the concept of word

  • Construction Sense Disambiguation

Re-structuring Ontologies

  • synset-structured (lexicalized) ontology doesn’t (seem) work well

  • unlabeled root vs embodied body

sumo

dolce

quantum

Conclusions

  • Chinese(s) are neighbors themselves.

  • Wordnet framework serves as a mirror for Chinese synchronic and diachronic varieties.

Reference

DI PIETRO, GIULIA. 2013. “Regular Polysemy: A Distributional Semantic Approach.”
Hsieh, Shu-Kai, and Yu-Yun Chang. 2014. “Leveraging Morpho-Semantics for the Discovery of Relations in Chinese Wordnet.” In Proceedings of the Seventh Global Wordnet Conference, 283–89.
Huang, Chu-Ren, Ru-Yng Chang, and Hshiang-Pin Lee. 2004. “Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO.” In LREC.
Huang, Chu-Ren, Shu-Kai Hsieh, and Keh-Jiann Chen. 2017. Mandarin Chinese Words and Parts of Speech: A Corpus-Based Study. Routledge.
Huang, Luyao, Chi Sun, Xipeng Qiu, and Xuanjing Huang. 2019. “GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge.” arXiv Preprint arXiv:1908.07245.
Lopukhina, Anastasiya, and Konstantin Lopukhin. 2016. “Regular Polysemy: From Sense Vectors to Sense Patterns.” In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-v), 19–23.
Tseng, Yu-Hsiang, and Shu-Kai Hsieh. 2019. “Augmenting Chinese WordNet Semantic Relations with Contextualized Embeddings.” In Proceedings of the 10th Global Wordnet Conference, 151–59.
———. 2022. “Character Jacobian: Modeling Chinese Character Meanings with Deep Learning Model.” In Proceedings of the 29th International Conference on Computational Linguistics, 152–62.
Tseng, Yu-Hsiang, Shu-Kai Hsieh, Pei-Yi Chen, et al. 2020. “Computational Modeling of Affixoid Behavior in Chinese Morphology.” In Proceedings of the 28th International Conference on Computational Linguistics, 2879–88.