hgct
This page showcases the HGCT (Hanzi Glyph Corpus Toolkit), a tool designed for advanced analysis and querying of Chinese text data, demonstrated through a textbook-derived corpus.
Installation
To starting using HGCT, clone the repository and install the required dependencies.
Python Version
Python >= 3.0, < 3.11
Clone Repository
git clone git@github.com:lopentu/HanziAnalysisKit.git
Install Requirement
cd HanziAnalysisKit && pip install -r requirements.txt
Quick Start
Building the Corpus
To prepare your data for use with the HGCT tool, build your corpus from the provided textbook data:
from textbook import build_corpus
# Specify your CSV data file and desired output folder
csv_file = './data/教科書課文.csv'
folder = "textbook_corpus"
# Build the corpus
build_corpus(csv_file, folder)
This will create a textbook_corpus folder in your project directory, containing the processed data ready for analysis with HGCT.
Directory Structure
After building the corpus, your project directory will be organized as follows:
|--- data/
| |--- 教科書課文.csv
|--- textbook/
|--- textbook_corpus/ # Newly generated
|--- ...
Demonstration
This section provides comprehensive examples demonstrating the capabilities of HGCT for analyzing Chinese text data. Each example includes a brief description and the expected output.
Corpus Reading and Concordancer Initialization
Load your corpus and set up the Concordancer for text analysis:
from hgct import PlainTextReader, Concordancer
# Load the corpus into PlainTextReader
corpus = PlainTextReader(dir_path="textbook_corpus/").corpus
# Initialize the Concordancer with the loaded corpus
c = Concordancer(corpus)
Utility Function for CQL Search
Create a function to facilitate Corpus Query Language (CQL) searches and display results:
def get_first_n(cql, n=10, left=5, right=5):
out = []
for i, r in enumerate(c.cql_search(cql, left=left, right=right)):
if i == n: break
out.append(r)
return out
Search by Characters
Perform searches based on character-specific criteria.
cql = """
[char="窗"] [char="[一-窗]"]
"""
results = get_first_n(cql, n=5)
results
This prints:
[
<Concord 門、客廳和{窗戶}旁貼上年畫>,
<Concord 水匠來,把{窗子}用甎頭堵上>,
<Concord 子,就指著{窗子}說:「這兩>,
<Concord 大門上、紙{窗旁},幾乎都貼>,
<Concord 要把這兩個{窗子}堵起來。」>,
]
To examine detailed information about a specific match, retrieve the data
attribute from a Concordance object.
result_1 = results[0]
result_1.data
This prints:
{
'left': '門、客廳和',
'keyword': '窗戶',
'right': '旁貼上年畫',
'position': (3, 11, 0, 25),
'meta': {
'id': '4S/90-1-有趣的年畫.txt',
'time': {
'label': '教科書課文 - 4下',
'ord': 4,
'year': ['90', '88', '69']
},
'text': {
'lesson': '1',
'title': '有趣的年畫',
'year': '90'
}
},
'captureGroups': {}
}
Search by Character Components
Explore additional character searches using Kangxi radicals and Ideographic Description Characters (IDCs).
Kangxi Radical
cql = """
[radical="穴"]
"""
get_first_n(cql, 5)
This prints:
[
<Concord 更飛進了太{空}。萊特兄弟>,
<Concord 以後,一有{空},我就要哥>,
<Concord 高的掛在天{空}。好多人看>,
<Concord 笑語,聞著{空}氣中淡淡的>,
<Concord 箭手,不論{空}中的飛雁,>,
]
IDC
cql = """
[compo="木" & idc="horz2"] # 'horz2' represents '⿰'
"""
get_first_n(cql, 5)
This prints:
[
<Concord 舞動,這個{栩}栩如生的掌>,
<Concord 動,這個栩{栩}如生的掌中>,
<Concord 城河上的小{橋},穿過古老>,
<Concord 巒溪的長虹{橋}附近,兩岸>,
<Concord 來到人間的{橋}梁。小弟弟>,
]
Radical Semantic Type
One can also search by the semantic type of Kangxi radicals based on Ma’s (2016) classification.
cql = '''
[semtag="植物"] [semtag="植物"]
'''
get_first_n(cql, 5)
This prints:
[
<Concord 握細緻的小{楷筆},一筆一畫>,
<Concord 年的歷史,{梁柱}雕刻很細緻>,
<Concord 在二十世紀{萌芽}的新科技,>,
<Concord 印象深刻。{花蓮}秀姑巒溪的>,
<Concord 遠到宜蘭、{花蓮},深入原住>,
]
Search by Phonetic Properties
Explore phonetic aspects of the text by utilizing specific phonetic attributes available in the corpus.
Mandarin
cql = '''
[phon="ㄨㄥ" & tone="1" & sys="moe"]
'''
get_first_n(cql, 5)
This prints:
[
<Concord 前有一個富{翁},很迷信。>,
<Concord 起來。」富{翁}聽了,就叫>,
<Concord 。他就對富{翁}說:「這棵>,
<Concord 棵樹。」富{翁}聽了,覺得>,
<Concord 有一天,富{翁}的朋友來,>,
]
Middle Chinese
cql = '''
[韻母="東" & 聲調="平" & sys="廣韻"]
'''
get_first_n(cql, 5)
This prints:
[
<Concord 我要學給有{蟲}的樹治病,>,
<Concord ,只剩竹節{蟲}與枯葉蝶,>,
<Concord 驚嘆。竹節{蟲}與枯葉蝶就>,
<Concord 眼前,竹節{蟲}和竹子的細>,
<Concord 比賽,竹節{蟲}與枯葉蝶勝>,
]
Setting Up Analysis Tools
Initialize the necessary tools for component analysis, concordance searches, and dispersion analysis. This setup enables detailed linguistic exploration within the textbook_corpus
.
from hgct import CompoAnalysis, PlainTextReader, Concordancer, Dispersion
CA = CompoAnalysis(PlainTextReader("textbook_corpus/", auto_load=False))
CC = Concordancer(PlainTextReader("textbook_corpus/").corpus)
DP = Dispersion(PlainTextReader("textbook_corpus/").corpus)
Frequency Distribution Analysis
Character
Retrieve the most common characters from a 5th grade fall textbook. The tp
parameter specifies the type; here it is set to “chr” for characters.
CA.freq_distr(tp="chr", subcorp_idx=5).most_common(4)
This prints:
[('的', 108), (',', 104), ('。', 68), ('了', 47)]
Characters with a given radical
Calculate the frequency of characters containing the radical “水” across the same textbook.
CA.freq_distr(tp=None, radical="水", subcorp_idx=5).most_common(4)
This prints:
[('洲', 21), ('海', 19), ('沒', 16), ('湖', 16)]
Characters with a given IDC component
Analyze the frequency of characters containing the IDC component “土” with a vertical arrangement in the same textbook.
CA.freq_distr(tp=None, compo="土", idc="vert2", subcorp_idx=5)
This prints:
Counter({'王': 30,
'去': 27,
'走': 10,
'堅': 5,
'幸': 5,
'墓': 5,
'至': 4,
'壁': 4,
'堂': 3,
'主': 2,
'基': 2,
'赤': 1,
'堡': 1})
Dispersion Analysis
Analyze how uniformly certain characters are spread throughout the corpus to understand their usage patterns. In this example, we look at the dispersion of ‘的’ (a function word) and ‘花’ (a content word).
import pandas as pd
df_disp = []
for ch in '的花':
stats, raw = DP.char_dispersion(
char=ch, subcorp_idx=0, return_raw=True
)
d = {
'char': ch,
'Range(%)': '{:.2f}'.format(100 * stats['Range'] / raw['n']), # Normalize range as a percentage of total segments
**stats # Include all other stats
}
df_disp.append(d)
df_disp = pd.DataFrame(df_disp)
df_disp
This prints:
Char |
Range(%) |
Range |
DP |
DPnorm |
KLdivergence |
JuillandD |
RosengrenS |
---|---|---|---|---|---|---|---|
的 |
100.00 |
9 |
0.179573 |
0.187460 |
0.126840 |
0.840689 |
0.955655 |
花 |
33.33 |
3 |
0.686952 |
0.717124 |
1.899763 |
0.384915 |
0.290457 |
Ngram and Collocation Analysis
Analyze bigrams and their associations using various statistical measures to better understand the common pairings and their strengths in a 5th grade fall textbook corpus.
# Frequency distribution of 2-gram ngrams
CC.freq_distr_ngrams(n=2, subcorp_idx=5).most_common(4)
This prints:
[('他們', 26), ('母親', 26), ('我們', 25), ('一個', 23)]
# Bigram associations sorted by G-squared statistic
bi_asso = CC.bigram_associations(subcorp_idx=5, sort_by="Gsq")
bi_asso[0]
This prints:
(
'母親',
{
'MI': 8.254205080149351,
'Xsq': 7934.8041752841245,
'Gsq': 329.2349988448261,
'Dice': 0.9285714285714286,
'DeltaP21': 0.9626357378320729,
'DeltaP12': 0.8964426252943788,
'FisherExact': 3.6528671910409246e-72,
'RawCount': 26
}
)
d = pd.DataFrame([{'bigram': x[0], **x[1]} for x in bi_asso][:5])
d
This prints:
bigram |
MI |
Xsq |
Gsq |
Dice |
DeltaP21 |
DeltaP12 |
FisherExact |
RawCount |
---|---|---|---|---|---|---|---|---|
母親 |
8.254205 |
7934.8042 |
329.2350 |
0.928571 |
0.962636 |
0.896443 |
3.652867e-72 |
26 |
時候 |
7.674781 |
3871.3328 |
211.5859 |
0.593750 |
0.422222 |
0.997167 |
1.488648e-46 |
19 |
企鵝 |
9.466194 |
9195.0000 |
196.5797 |
1.000000 |
1.000000 |
1.000000 |
1.869793e-42 |
13 |
南極 |
8.435315 |
5531.9219 |
195.0192 |
0.761905 |
0.940196 |
0.639891 |
2.816916e-43 |
16 |
小昌 |
7.235897 |
2700.5653 |
186.4523 |
0.455696 |
0.295082 |
0.995314 |
3.869527e-41 |
18 |