Chinese Wordnet 2.0

Toward a dynamic interface of lexical synchrony and diachrony

Shu-Kai Hsieh

National Taiwan University

2022-11-02

Outline

Background
Chinese Wordnet 2.0: Concepts and Implmentation
Challenges and Current Works

History

Sinica BOW ((C.-R. Huang, Chang, and Lee 2004), 2000-2004)

Chinese Wordnet at Academia Sinica (2005-2010)

Chinese Wordnet at NTU Taiwan (2010-)

Note

Note that there are more than one Chinese Wordnet.

WordNet architecture

Two core components:

Synset (synonymous set)
Paradigmatic lexical (semantic) relations: hyponymy/hypernymy; meronymy/holonymy, etc

Chinese Wordnet

Follow PWN (in comparison with Sinica BOW)
Word segmentation principle (C.-R. Huang, Hsieh, and Chen 2017)
Corpus-based decision
Manually created (sense distinction, gloss with controlled vocabulary, etc)

The status quo

latest release 2022
website online

CWN 2.0 Programmable Search

The most comprehensive and fine-grained sense repository and network in Chinese
API and doc freely available

Theories

Some new perspectives in CWN

sense granularity, relation discovery, and gloss with annotation

Leveraging morpho-semantic relations

Gloss as lexicographic resources with add-ons annotations

Gloss (`lexicographic definition’) is carefully controlled with limited vocabulary and patterns, e.g.,
- Verbs with VH tag (i.e., stative intransitive verbs) are glossed with “形容 or 比喻形容 …”.
- Adverbs are glossed with “表…”
collocational information, pragmatic information (‘tone’, etc) are recorded as additional annotation.

Data Statistics

Zipf’s law (no surprise)

Most words have small number of senses (Zipf’s law)

Comparison with others

CWN is the best candidate

drawing drawing drawing

Data summary 1/1

Figure 2 shows the lemma and sense data distribution

Figure 2: cwn sense data summary

Data summary 2/2

Figure 3 further demonstrates the distribution of different types of relations

Figure 3: cwn relation data summary

Data summary 3/3

Gloss statistics

GraphAPI and Visualization

Computational Semantic Representations

human curated and machine generated lexical semantic resources
open-sourced (github)

`SemCor` manually sense-tagged corpus

Word Sense Tagger

WSD: The Problems

The task as currently defined does not allow for generalization over different words \(\rightarrow\) learning is word-specific.
Need training data for every sense of every word, and no chance with unknown words. (unsupervised approaches perform consistently worse than supervised approaches)
Cannot capture the sense alternation regularities

Distributed approach to model the ‘Gradience’

gradience is found is many linguistic categories.
Regular polysemy detection: Using word vector (DI PIETRO 2013)or sense vector (Lopukhina and Lopukhin 2016) to detect sense alternations (such as FOOD or ANIMAL)
Recent (contextualized) vector representation could help us in locating where a word meaning is on the continuum (/in the multidimensional semantic space).

WSD with Transformer (1)

Leveraging wordnet glosses using GlossBert (L. Huang et al. 2019)
- a BERT model for word sense disambiguation with gloss knowledge.
Our extended GlossBert model on CWN gloss+SemCor reports 82% accuracy.

Word Sense Tagger

APIs (GlossBert version) released in 2021

# pip install -U DistilTag SenseTagger
import DistilTag
import CwnSenseTagger
DistilTag.download()
CwnSenseTagger.download()

tagger = DistilTag()
tagged = tagger.tag("<raw text>")
sense_tagged = senseTag(tagged)

Word Sense frequencies

Now we have chance to empirically explore the dominancy of word senses, which is essential for both lexical semantic and psycholinguistic studies.
- e.g., ‘開’ (kai1,‘open’) has (surprisingly) more dominant blossom sense over others (based on randomly chosen 300 sentences in ASBC corpus)

Word Sense Embeddings

We use our tagger to automatically tag ca. 5 millions word tokens in Academia Sinica Balanced Corpus, and indexed the annotated sense.
- word sense frequency data are calculated out via the tags.
- tokenize the index and use word2vec to get the word sense embeddings.

Character Jacobian: Chinese character (root morpheme) lies in the meaning core (Tseng and Hsieh 2022)
gloss2vec (Hsieh et al. 2022. submitted)

Chinese(s) in Synchrony and Diachrony

Gradualness change and continual variations

Contemporary Mandarin Varieties

Fusion of Archaic and Modern senses
- resulting in (expressive vs receptive word senses). E.g. 【打】水 (`to pump our water out of a well’.)

Contemporary Mandarin Varieties

The puzzle of affixoid

The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. (Tseng et al. 2020)

E.g. 【化】(huà, ‘-ize’)

Change of affixiod status in diachrony

The indeterminate nature of Chinese affixoids
Sense status of 家 jiā from the Tang dynasty to the 1980s

Dynamics in Contemporary Mandarin Chinese(s)

【真香】 (‘zhēn xiāng’, soappetizing)

Dynamics in Contemporary Mandarin Chinese(s)

word, construction, word senses

originally appeared as a fixed phrase in MC (cannot be replaced with other synonymous phrases like 好香)
gradually spread into TM, but diversified itself into new construction senses, as well as word sense.

World Chinese(s) and Construction Grammar (CxG)

We’ve build a parallel corpus of Mandarin in Mainland China and Taiwan, and
Corpus data collected from movie titles and TED talks.
A intralingual Machine Translation system has been developed and Sense Mapping/Inducing system is in process.

Challenges and On-going Works

The haunting issues of wordhood (and the beautiful scene it has brought us into)

Re-theorizing

Chinese Wordnet beyond word

Form-meaning pairs
Construction has its own sense
- ( ‘還在那邊’)
Need to broaden the concept of word
Construction Sense Disambiguation

Re-structuring Ontologies

synset-structured (lexicalized) ontology doesn’t (seem) work well
unlabeled root vs embodied body

Conclusions

Chinese(s) are neighbors themselves.
Wordnet framework serves as a mirror for Chinese synchronic and diachronic varieties.

Reference

DI PIETRO, GIULIA. 2013. “Regular Polysemy: A Distributional Semantic Approach.”

Hsieh, Shu-Kai, and Yu-Yun Chang. 2014. “Leveraging Morpho-Semantics for the Discovery of Relations in Chinese Wordnet.” In Proceedings of the Seventh Global Wordnet Conference, 283–89.

Huang, Chu-Ren, Ru-Yng Chang, and Hshiang-Pin Lee. 2004. “Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO.” In LREC.

Huang, Chu-Ren, Shu-Kai Hsieh, and Keh-Jiann Chen. 2017. Mandarin Chinese Words and Parts of Speech: A Corpus-Based Study. Routledge.

Huang, Luyao, Chi Sun, Xipeng Qiu, and Xuanjing Huang. 2019. “GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge.” arXiv Preprint arXiv:1908.07245.

Lopukhina, Anastasiya, and Konstantin Lopukhin. 2016. “Regular Polysemy: From Sense Vectors to Sense Patterns.” In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-v), 19–23.

Tseng, Yu-Hsiang, and Shu-Kai Hsieh. 2019. “Augmenting Chinese WordNet Semantic Relations with Contextualized Embeddings.” In Proceedings of the 10th Global Wordnet Conference, 151–59.

———. 2022. “Character Jacobian: Modeling Chinese Character Meanings with Deep Learning Model.” In Proceedings of the 29th International Conference on Computational Linguistics, 152–62.

Tseng, Yu-Hsiang, Shu-Kai Hsieh, Pei-Yi Chen, et al. 2020. “Computational Modeling of Affixoid Behavior in Chinese Morphology.” In Proceedings of the 28th International Conference on Computational Linguistics, 2879–88.

Chinese Wordnet 2.0

Outline

History

WordNet architecture

Chinese Wordnet

The status quo

CWN 2.0 Programmable Search

Theories

Meaning facets vs senses

Leveraging morpho-semantic relations

Gloss as lexicographic resources with add-ons annotations

Data Statistics

Zipf’s law (no surprise)

Comparison with others

Data summary 1/1

Data summary 2/2

Data summary 3/3

GraphAPI and Visualization

Computational Semantic Representations

SemCor manually sense-tagged corpus

Word Sense Tagger

Distributed approach to model the ‘Gradience’

WSD with Transformer (1)

Word Sense Tagger

Word Sense frequencies

Word Sense Embeddings

Other related works

Chinese(s) in Synchrony and Diachrony

Contemporary Mandarin Varieties

Contemporary Mandarin Varieties

Change of affixiod status in diachrony

Dynamics in Contemporary Mandarin Chinese(s)

Dynamics in Contemporary Mandarin Chinese(s)

World Chinese(s) and Construction Grammar (CxG)

Challenges and On-going Works

Re-theorizing

Re-structuring Ontologies

Conclusions

Reference

`SemCor` manually sense-tagged corpus