hgct

This page showcases the HGCT (Hanzi Glyph Corpus Toolkit), a tool designed for advanced analysis and querying of Chinese text data, demonstrated through a textbook-derived corpus.

Installation

To starting using HGCT, clone the repository and install the required dependencies.

  • Python Version

    Python >= 3.0, < 3.11
    
  • Clone Repository

    git clone git@github.com:lopentu/HanziAnalysisKit.git
    
  • Install Requirement

    cd HanziAnalysisKit && pip install -r requirements.txt
    

Quick Start

Building the Corpus

To prepare your data for use with the HGCT tool, build your corpus from the provided textbook data:

from textbook import build_corpus

# Specify your CSV data file and desired output folder
csv_file = './data/教科書課文.csv'
folder = "textbook_corpus"

# Build the corpus
build_corpus(csv_file, folder)

This will create a textbook_corpus folder in your project directory, containing the processed data ready for analysis with HGCT.

Directory Structure

After building the corpus, your project directory will be organized as follows:

|--- data/
|    |--- 教科書課文.csv
|--- textbook/
|--- textbook_corpus/ # Newly generated
|--- ...

Demonstration

This section provides comprehensive examples demonstrating the capabilities of HGCT for analyzing Chinese text data. Each example includes a brief description and the expected output.

Corpus Reading and Concordancer Initialization

Load your corpus and set up the Concordancer for text analysis:

from hgct import PlainTextReader, Concordancer

# Load the corpus into PlainTextReader
corpus = PlainTextReader(dir_path="textbook_corpus/").corpus

# Initialize the Concordancer with the loaded corpus
c = Concordancer(corpus)

Search by Characters

Perform searches based on character-specific criteria.

cql = """
[char="窗"] [char="[一-窗]"]
"""
results = get_first_n(cql, n=5)
results

This prints:

[
    <Concord 門、客廳和{窗戶}旁貼上年畫>,
    <Concord 水匠來,把{窗子}用甎頭堵上>,
    <Concord 子,就指著{窗子}說:「這兩>,
    <Concord 大門上、紙{窗旁},幾乎都貼>,
    <Concord 要把這兩個{窗子}堵起來。」>,
]

To examine detailed information about a specific match, retrieve the data attribute from a Concordance object.

result_1 = results[0]
result_1.data

This prints:

{
    'left': '門、客廳和',
    'keyword': '窗戶',
    'right': '旁貼上年畫',
    'position': (3, 11, 0, 25),
    'meta': {
        'id': '4S/90-1-有趣的年畫.txt',
        'time': {
            'label': '教科書課文 - 4下',
            'ord': 4,
            'year': ['90', '88', '69']
        },
        'text': {
            'lesson': '1',
            'title': '有趣的年畫',
            'year': '90'
        }
    },
    'captureGroups': {}
}

Search by Character Components

Explore additional character searches using Kangxi radicals and Ideographic Description Characters (IDCs).

Kangxi Radical

cql = """
[radical="穴"]
"""
get_first_n(cql, 5)

This prints:

[
    <Concord 更飛進了太{空}。萊特兄弟>,
    <Concord 以後,一有{空},我就要哥>,
    <Concord 高的掛在天{空}。好多人看>,
    <Concord 笑語,聞著{空}氣中淡淡的>,
    <Concord 箭手,不論{空}中的飛雁,>,
]

IDC

cql = """
[compo="木" & idc="horz2"] # 'horz2' represents '⿰'
"""
get_first_n(cql, 5)

This prints:

[
    <Concord 舞動,這個{栩}栩如生的掌>,
    <Concord 動,這個栩{栩}如生的掌中>,
    <Concord 城河上的小{橋},穿過古老>,
    <Concord 巒溪的長虹{橋}附近,兩岸>,
    <Concord 來到人間的{橋}梁。小弟弟>,
]

Radical Semantic Type

One can also search by the semantic type of Kangxi radicals based on Ma’s (2016) classification.

cql = '''
[semtag="植物"] [semtag="植物"]
'''
get_first_n(cql, 5)

This prints:

[
    <Concord 握細緻的小{楷筆},一筆一畫>,
    <Concord 年的歷史,{梁柱}雕刻很細緻>,
    <Concord 在二十世紀{萌芽}的新科技,>,
    <Concord 印象深刻。{花蓮}秀姑巒溪的>,
    <Concord 遠到宜蘭、{花蓮},深入原住>,
]

Search by Phonetic Properties

Explore phonetic aspects of the text by utilizing specific phonetic attributes available in the corpus.

Mandarin

cql = '''
[phon="ㄨㄥ" & tone="1" & sys="moe"]
'''
get_first_n(cql, 5)

This prints:

[
    <Concord 前有一個富{翁},很迷信。>,
    <Concord 起來。」富{翁}聽了,就叫>,
    <Concord 。他就對富{翁}說:「這棵>,
    <Concord 棵樹。」富{翁}聽了,覺得>,
    <Concord 有一天,富{翁}的朋友來,>,
]

Middle Chinese

cql = '''
[韻母="東" & 聲調="平" & sys="廣韻"]
'''
get_first_n(cql, 5)

This prints:

[
    <Concord 我要學給有{蟲}的樹治病,>,
    <Concord ,只剩竹節{蟲}與枯葉蝶,>,
    <Concord 驚嘆。竹節{蟲}與枯葉蝶就>,
    <Concord 眼前,竹節{蟲}和竹子的細>,
    <Concord 比賽,竹節{蟲}與枯葉蝶勝>,
]

Setting Up Analysis Tools

Initialize the necessary tools for component analysis, concordance searches, and dispersion analysis. This setup enables detailed linguistic exploration within the textbook_corpus.

from hgct import CompoAnalysis, PlainTextReader, Concordancer, Dispersion

CA = CompoAnalysis(PlainTextReader("textbook_corpus/", auto_load=False))
CC = Concordancer(PlainTextReader("textbook_corpus/").corpus)
DP = Dispersion(PlainTextReader("textbook_corpus/").corpus)

Frequency Distribution Analysis

Character

Retrieve the most common characters from a 5th grade fall textbook. The tp parameter specifies the type; here it is set to “chr” for characters.

CA.freq_distr(tp="chr", subcorp_idx=5).most_common(4)

This prints:

[('的', 108), (',', 104), ('。', 68), ('了', 47)]

Characters with a given radical

Calculate the frequency of characters containing the radical “水” across the same textbook.

CA.freq_distr(tp=None, radical="水", subcorp_idx=5).most_common(4)

This prints:

[('洲', 21), ('海', 19), ('沒', 16), ('湖', 16)]

Characters with a given IDC component

Analyze the frequency of characters containing the IDC component “土” with a vertical arrangement in the same textbook.

CA.freq_distr(tp=None, compo="土", idc="vert2", subcorp_idx=5)

This prints:

Counter({'王': 30,
     '去': 27,
     '走': 10,
     '堅': 5,
     '幸': 5,
     '墓': 5,
     '至': 4,
     '壁': 4,
     '堂': 3,
     '主': 2,
     '基': 2,
     '赤': 1,
     '堡': 1})

Dispersion Analysis

Analyze how uniformly certain characters are spread throughout the corpus to understand their usage patterns. In this example, we look at the dispersion of ‘的’ (a function word) and ‘花’ (a content word).

import pandas as pd

df_disp = []

for ch in '的花':
    stats, raw = DP.char_dispersion(
        char=ch, subcorp_idx=0, return_raw=True
    )
    d = {
        'char': ch,
        'Range(%)': '{:.2f}'.format(100 * stats['Range'] / raw['n']),  # Normalize range as a percentage of total segments
        **stats  # Include all other stats
    }
    df_disp.append(d)

df_disp = pd.DataFrame(df_disp)
df_disp

This prints:

Char

Range(%)

Range

DP

DPnorm

KLdivergence

JuillandD

RosengrenS

100.00

9

0.179573

0.187460

0.126840

0.840689

0.955655

33.33

3

0.686952

0.717124

1.899763

0.384915

0.290457

Ngram and Collocation Analysis

Analyze bigrams and their associations using various statistical measures to better understand the common pairings and their strengths in a 5th grade fall textbook corpus.

# Frequency distribution of 2-gram ngrams
CC.freq_distr_ngrams(n=2, subcorp_idx=5).most_common(4)

This prints:

[('他們', 26), ('母親', 26), ('我們', 25), ('一個', 23)]
# Bigram associations sorted by G-squared statistic
bi_asso = CC.bigram_associations(subcorp_idx=5, sort_by="Gsq")
bi_asso[0]

This prints:

(
 '母親',
 {
    'MI': 8.254205080149351,
    'Xsq': 7934.8041752841245,
    'Gsq': 329.2349988448261,
    'Dice': 0.9285714285714286,
    'DeltaP21': 0.9626357378320729,
    'DeltaP12': 0.8964426252943788,
    'FisherExact': 3.6528671910409246e-72,
    'RawCount': 26
 }
)
d = pd.DataFrame([{'bigram': x[0], **x[1]} for x in bi_asso][:5])
d

This prints:

bigram

MI

Xsq

Gsq

Dice

DeltaP21

DeltaP12

FisherExact

RawCount

母親

8.254205

7934.8042

329.2350

0.928571

0.962636

0.896443

3.652867e-72

26

時候

7.674781

3871.3328

211.5859

0.593750

0.422222

0.997167

1.488648e-46

19

企鵝

9.466194

9195.0000

196.5797

1.000000

1.000000

1.000000

1.869793e-42

13

南極

8.435315

5531.9219

195.0192

0.761905

0.940196

0.639891

2.816916e-43

16

小昌

7.235897

2700.5653

186.4523

0.455696

0.295082

0.995314

3.869527e-41

18