Language Resources and Large Language Models: possible linkages
Hands-on code session
Language Models
Probability distributions over sequences of words, tokens or characters (Shannon, 1948; 1951)
As a core task in NLP, often framed as next token prediction.
Pre-trained Large Language Models
直到大型語言模型開始出現結構理解的頓悟行為 (emergence)之後, 開始有推理能力的期待。
Rosetta Stone Problems
hallucination and knowledge-cutoff
BART’s hallucination (每一家的幻覺都蠻嚴重的)
可愛的錯誤是沒關係,但在重要的決策(如:法律親屬繼承關係)就出大事
持續學習
Life-long learning for Human and Machine
Here are the images illustrating language and knowledge resources as the food for AI:
Language (and Knowledge) Resources in linguistics and language technology
數據 data, 工具 tools, 經驗 advice
WordNet
architecture: two core components:
Follow PWN (in comparison with Sinica BOW)
Word segmentation principle (Huang, Hsieh, and Chen 2017)
Corpus-based decision
Manually created (sense distinction, gloss with controlled vocabulary, etc)
Some new perspectives in CWN
sense granularity, relation discovery, glos and annotation in parallel
埔里種的【茶】很好喝
Co-predicative Nouns
a phenomenon where two or more predicates seem to require that their argument denotes different things.
Gloss (lexicographic definition) is carefully controlled with limited vocabulary and lexical patterns, e.g.,
VH
tag (i.e., stative intransitive verbs) are glossed with “形容 or 比喻形容 …”.collocational information, pragmatic information (‘tone’, etc) are recorded as additional annotation.
Figure 2 further demonstrates the distribution of different types of relations
Visualization
sense tagger
Transformer-based sense tagger
Leveraging wordnet glosses using GlossBert
(huang2019glossbert?), a BERT model for word sense disambiguation with gloss knowledge.
Our extended GlossBert
model on CWN gloss+ SemCor reports 82% accuracy.
Chinese SemCor
sense frequency distribution in corpus
word sense acquisition
tracking sense evolution
(Autoregressive) LLMs 與 LR/KR 的關係?
Pre-train, Prompt, and Predict (Liu et al. 2021)
refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights.
Note
prompt: a natural language description of the task.
prompt design: involves instructions and context passed to the LLM to achieve a desired task.
prompt engineering: the practice of developing optimal (clear, concise, informative) prompts to efficiently use LLMs for a variety of applications.
(Text) Generation is a meta capability of Large Language Models & Prompt Engineering is the key to unlocking it.
basic elements
A prompt is composed with the following components:
zero and few shot
(Weng 2023) . . .
Zero shot learning: implies that a model can recognize things that have not explicitly been taught in the training.
Few shot learning: refers to training a model with minimal data.
Chain-of-Thought
(Wei et al. 2023)
Self-consistency sampling
(Weng 2023)
新發展可以參見 Prompting Guide
Persona
setting is also important, socio-linguistically
word sense disambiguation
sense to action
word sense induction
code-switching wsd
Exploratory Prompting Analysis
instruction tuning
Human rating is based on the word’s appropriateness of interpretation, the meanings’ correspondence to the word’s part of speech, their avoidance of oversimplification/overgeneralization, and their compliance with the prompt’s requirements.
The top 600 frequent words are rated to further analyze their error types.
想辦法問博學者 savant 的各種技能,也會有天花板
In-context learning (~ prompting)
involves providing input to the language model in a specific format to elicit the desired output.
context window size
restricts the model’s ability to process long sequences of information effectively.各種幻覺與執著 Hallucination
appears when the generated content is nonsensical or unfaithful to the provided source content.
reluctance to express uncertainty, to change premise, and yields to authority
incapable of accessing factual/up-to-date information; or no authentic sources provided
Toward a more linguistic knowledge-aware LLMs
模型的壓縮技術 quantization (LoRA
, QLoRA
, …) 使得微調大型語言模型變得更為可行
LoLlama
on top of Taiwan-LLaMa (Lin and Chen, 2023), which was pre-trained on over 5 billion tokens of Traditional Chinese. The model was further fine-tuned on over 490K multi-turn con- versational data to enable instruction-following and context-aware responses.商業版本好,但很貴,也不保證安全。
開源的 llm 越來越好,壓縮技術越見成熟。但訓練不便宜,結果常動輒被政治價值審查。
(抱怨:又要馬兒跑,又要馬兒沒草吃)
Retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process.
We obtain vectors by removing the last layer and taking the output from the second-to-last layer.
Vector DB/ Vector Stores
a higer archtecture
augmented LLMs
localized and customized
upload data to calculate (word frequency, word sense frequency, etc) via llama-index
data loader.
vectorized the data and semantically search/compare
given few shot, predict the sense and automatically generate the gloss (and relations to others)
Note
All examples are tested with chatgpt-3.5-turbo
(using OpenAI’s API) . It uses the default configurations, i.e., temperature
=0.0.
Orchestration frameworks provide a way to manage and control LLMs.
需要一名專案經理,協調所有進行中的工作項目,確保各專案達成理想的最終結果
LlamaIndex
Semantic Kernel
LangChain
langchain
is an open source framework that allows AI developers to combine LLMs like GPT-4 with external sources of computation and data.
Chains: The core of langchain
. Components (and even other chains) can be stringed together to create chains.
Prompt templates: Prompt templates are templates for different types of prompts. Like “chatbot” style templates, ELI5 question-answering, etc
LLMs: Large language models
Indexing Utils: Ways to interact with specific data (embeddings, vectorstores, document loaders)
Tools: Ways to interact with the outside world (search, calculators, etc)
Agents: Agents use LLMs to decide what actions should be taken. Tools like web search or calculators can be used, and all are packaged into a logical loop of operations.
Memory: Short-term memory, long-term memory.
langchain
) 對於部署 LLM 應用變成核心技能。