Collostructional Analysis with Python

In [1]:
from APIsearch import search, get_capture_groups, top_n
from collo_measures import cca, dca, rank_collo

1. Covarying Collexeme Analysis (CCA)

衡量同一句式下的兩個 lexical slots 內的詞彙的共現傾向

e.g., 「把」字句中的賓語動作,如:把 時間(slot1) 花(slot2) 在...

CQL

[word="把" & pos="P"] [pos!="N[abcd].*|COMMACATEGORY|PERIODCATEGORY"]* obj:[pos="N[abcd].*"] v:[pos="V.*"]
In [2]:
cql = '[word="把" & pos="P"] [pos!="N[abcd].*|COMMACATEGORY|PERIODCATEGORY"]* obj:[pos="N[abcd].*"] v:[pos="V.*"]'
search_results, requested_urls = search(cql, board="Boy-Girl", year_from=2019, year_to=2019, number=None)
len(search_results)
Found 8977 results
Out[2]:
8977
In [3]:
freq_table = {}
for hit in search_results:
    gramrel = get_capture_groups(hit)
    Obj, Act = gramrel['obj'][0], gramrel['v'][0]
    
    k = (Obj, Act)
    if k not in freq_table:
        freq_table[k] = 0
    freq_table[k] += 1

len(freq_table)
Out[3]:
6264
In [4]:
top_n(freq_table, 15)
Out[4]:
[(('重心', '放'), 65),
 (('錢', '給'), 45),
 (('錢', '拿去'), 42),
 (('重點', '放'), 39),
 (('話', '講'), 39),
 (('人', '當'), 38),
 (('錢', '花'), 33),
 (('時間', '花'), 33),
 (('女生', '當'), 29),
 (('責任', '推給'), 26),
 (('心思', '放'), 23),
 (('時間', '浪費'), 22),
 (('女友', '當'), 22),
 (('男生', '當'), 22),
 (('焦點', '放'), 20)]
In [5]:
cca_results = cca(freq_table)
In [6]:
rank_collo(cca_results, sort_by='G2', freq_cutoff=3)[:15]
Out[6]:
[(('重心', '放'), 271.421560848666, 65),
 (('重點', '放'), 183.20998938024928, 39),
 (('妹', '達'), 173.7096502719227, 19),
 (('錢', '拿去'), 173.27124551993205, 42),
 (('話', '講'), 161.3834669743639, 39),
 (('心', '聊'), 155.68343186694932, 19),
 (('責任', '推給'), 148.0138228120744, 26),
 (('時間', '花'), 137.15892850244444, 33),
 (('責任', '推到'), 112.98412079076587, 17),
 (('錢', '花'), 110.52864351326383, 33),
 (('時間', '浪費'), 108.61381164211798, 22),
 (('錯', '推給'), 105.97270943400696, 13),
 (('心思', '放'), 105.32060147547003, 23),
 (('距離', '拉開'), 102.1130824791797, 9),
 (('焦點', '放'), 93.53232078379003, 20)]
In [7]:
rank_collo(cca_results, sort_by='fisher_exact', freq_cutoff=3)[:15]
Out[7]:
[(('重心', '放'), 138.13724018692318, 65),
 (('重點', '放'), 93.63402248631904, 39),
 (('錢', '拿去'), 88.9350611803136, 42),
 (('妹', '達'), 87.70386959728724, 19),
 (('話', '講'), 83.0890997460714, 39),
 (('心', '聊'), 79.51399246447232, 19),
 (('責任', '推給'), 76.16474172420881, 26),
 (('時間', '花'), 70.87859313042334, 33),
 (('責任', '推到'), 58.32442867914655, 17),
 (('錢', '花'), 57.56853869447204, 33),
 (('時間', '浪費'), 56.297417776301586, 22),
 (('錯', '推給'), 54.51960351446885, 13),
 (('心思', '放'), 54.478503047634995, 23),
 (('距離', '拉開'), 52.17269143420717, 9),
 (('焦點', '放'), 48.48194515983087, 20)]

2. Distinctive Collexeme Analysis (DCA)

比較兩種 (or 多種) 句式中,相應位置之 lexical slot 的偏好,例如:比較「把」字句與「將」字句

  1. 動詞的使用偏好
  2. 賓語的使用偏好

CQL

  • 將/把 Obj V
    construction:[word="將|把" & pos="P"] [pos!="N[abcd].*|COMMACATEGORY|PERIODCATEGORY"]* obj:[pos="N[abcd].*"] v:[pos="V.*"]
In [8]:
cql = 'construction:[word="將|把" & pos="P"] [pos!="N[abcd].*|COMMACATEGORY|PERIODCATEGORY"]* obj:[pos="N[abcd].*"] v:[pos="V.*"]'
search_results, requested_urls = search(cql, board="Boy-Girl", year_from=2019, year_to=2019, number=None)
len(search_results)
Found 9243 results
Out[8]:
9243

2.1 動詞偏好

In [9]:
freq_table = {'把': {}, '將': {}}

for hit in search_results:
    gramrel = get_capture_groups(hit)
    Type, Obj, Act = gramrel['construction'][0], gramrel['obj'][0], gramrel['v'][0]
    
    if Act not in freq_table[Type]:
        freq_table[Type][Act] = 0
    freq_table[Type][Act] += 1
In [10]:
dca_results = dca(freq_table)
rank_collo(dca_results, sort_by='G2', freq_cutoff=3)[:10]
Pos: attract to 把
Neg: attract to 將
Out[10]:
[('當', 20.243037700027056, 459),
 ('給', 8.440691320250224, 243),
 ('講', 6.463272099931488, 110),
 ('花', 4.692759307985961, 80),
 ('想', 4.516031699030498, 77),
 ('丟', 3.3393456854430585, 57),
 ('拿來', 3.3393456854430585, 57),
 ('搞', 3.1630670599328394, 54),
 ('交給', 3.1043204966558857, 53),
 ('告訴', 2.928119722432687, 50)]
In [11]:
rank_collo(dca_results, sort_by='G2', freq_cutoff=3)[-1:-11:-1]
Out[11]:
[('視為', -35.847320549195416, 18),
 ('整理', -9.50887685227265, 7),
 ('列為', -7.650962649824157, 3),
 ('無限', -5.557709099452445, 6),
 ('套到', -5.557709099452445, 6),
 ('交到', -5.0783699392913455, 7),
 ('問', -4.298055978140039, 9),
 ('分', -3.419970801844095, 12),
 ('變成', -2.962685691673147, 14),
 ('高', -2.7758098477978104, 3)]

2.2 賓語偏好

In [12]:
freq_table = {'把': {}, '將': {}}

for hit in search_results:
    gramrel = get_capture_groups(hit)
    Type, Obj, Act = gramrel['construction'][0], gramrel['obj'][0], gramrel['v'][0]
    
    if Obj not in freq_table[Type]:
        freq_table[Type][Obj] = 0
    freq_table[Type][Obj] += 1
In [13]:
dca_results = dca(freq_table)
rank_collo(dca_results, sort_by='G2', freq_cutoff=3)[:10]
Pos: attract to 把
Neg: attract to 將
Out[13]:
[('事情', 14.44430933034662, 244),
 ('話', 12.22859668032919, 207),
 ('錢', 5.4966119339174035, 359),
 ('妹', 4.98743585718279, 85),
 ('女友', 4.242322416189659, 158),
 ('小孩', 4.2216160931407165, 72),
 ('心', 4.045044911538693, 69),
 ('想法', 3.3393456854430585, 57),
 ('照片', 2.693276106447513, 46),
 ('女兒', 2.5758931654122454, 44)]
In [14]:
rank_collo(dca_results, sort_by='G2', freq_cutoff=3)[-1:-11:-1]
Out[14]:
[('物品', -9.50887685227265, 7),
 ('網友', -7.650962649824157, 3),
 ('意見', -5.557709099452445, 6),
 ('問題', -5.319274474376449, 146),
 ('經驗', -5.137425705920231, 18),
 ('伴侶', -4.6630495351146095, 8),
 ('愛情', -3.534983614441842, 26),
 ('情緒', -3.361910031958626, 85),
 ('女性', -3.1810420745778942, 13),
 ('感情', -2.9867987997911083, 90)]