Language Resources and Large Language Models: possible linkages
Hands-on code session
Probability distributions over sequences of words, tokens or characters (Shannon, 1948; 1951)
As a core task in NLP, often framed as next token prediction.
Pre-trained Large Language Models
直到大型語言模型開始出現結構理解的頓悟行為 (emergence)之後， 開始有推理能力的期待。
Rosetta Stone Problems
hallucination and knowledge-cutoff
BART’s hallucination (每一家的幻覺都蠻嚴重的)
Life-long learning for Human and Machine
Here are the images illustrating language and knowledge resources as the food for AI:
Language (and Knowledge) Resources in linguistics and language technology
數據 data, 工具 tools, 經驗 advice
WordNet architecture: two core components:
Follow PWN (in comparison with Sinica BOW)
Word segmentation principle (Huang, Hsieh, and Chen 2017)
Manually created (sense distinction, gloss with controlled vocabulary, etc)
Some new perspectives in CWN
sense granularity, relation discovery, glos and annotation in parallel
a phenomenon where two or more predicates seem to require that their argument denotes different things.
Gloss (lexicographic definition) is carefully controlled with limited vocabulary and lexical patterns, e.g.,
VHtag (i.e., stative intransitive verbs) are glossed with “形容 or 比喻形容 …”.
collocational information, pragmatic information (‘tone’, etc) are recorded as additional annotation.
Figure 2 further demonstrates the distribution of different types of relations
Transformer-based sense tagger
Leveraging wordnet glosses using
GlossBert (huang2019glossbert?), a BERT model for word sense disambiguation with gloss knowledge.
GlossBert model on CWN gloss+ SemCor reports 82% accuracy.
sense frequency distribution in corpus
word sense acquisition
tracking sense evolution
(Autoregressive) LLMs 與 LR/KR 的關係？
Pre-train, Prompt, and Predict (Liu et al. 2021)
refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights.
prompt: a natural language description of the task.
prompt design: involves instructions and context passed to the LLM to achieve a desired task.
prompt engineering: the practice of developing optimal (clear, concise, informative) prompts to efficiently use LLMs for a variety of applications.
(Text) Generation is a meta capability of Large Language Models & Prompt Engineering is the key to unlocking it.
A prompt is composed with the following components:
zero and few shot (Weng 2023) . . .
Zero shot learning: implies that a model can recognize things that have not explicitly been taught in the training.
Few shot learning: refers to training a model with minimal data.
Chain-of-Thought (Wei et al. 2023)
Self-consistency sampling (Weng 2023)
新發展可以參見 Prompting Guide
Persona setting is also important, socio-linguistically
word sense disambiguation
sense to action
word sense induction
Exploratory Prompting Analysis
Human rating is based on the word’s appropriateness of interpretation, the meanings’ correspondence to the word’s part of speech, their avoidance of oversimplification/overgeneralization, and their compliance with the prompt’s requirements.
The top 600 frequent words are rated to further analyze their error types.
想辦法問博學者 savant 的各種技能，也會有天花板
In-context learning (~ prompting)involves providing input to the language model in a specific format to elicit the desired output.
context window sizerestricts the model’s ability to process long sequences of information effectively.
Hallucination appears when the generated content is nonsensical or unfaithful to the provided source content.
reluctance to express uncertainty, to change premise, and yields to authority
incapable of accessing factual/up-to-date information; or no authentic sources provided
Toward a more linguistic knowledge-aware LLMs
模型的壓縮技術 quantization (
QLoRA, …) 使得微調大型語言模型變得更為可行
LoLlamaon top of Taiwan-LLaMa (Lin and Chen, 2023), which was pre-trained on over 5 billion tokens of Traditional Chinese. The model was further fine-tuned on over 490K multi-turn con- versational data to enable instruction-following and context-aware responses.
開源的 llm 越來越好，壓縮技術越見成熟。但訓練不便宜，結果常動輒被政治價值審查。
Retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process.
We obtain vectors by removing the last layer and taking the output from the second-to-last layer.
Vector DB/ Vector Stores
a higer archtecture
localized and customized
upload data to calculate (word frequency, word sense frequency, etc) via
llama-index data loader.
vectorized the data and semantically search/compare
given few shot, predict the sense and automatically generate the gloss (and relations to others)
All examples are tested with
chatgpt-3.5-turbo (using OpenAI’s API) . It uses the default configurations, i.e.,
Orchestration frameworks provide a way to manage and control LLMs.
langchainis an open source framework that allows AI developers to combine LLMs like GPT-4 with external sources of computation and data.
Chains: The core of
langchain. Components (and even other chains) can be stringed together to create chains.
Prompt templates: Prompt templates are templates for different types of prompts. Like “chatbot” style templates, ELI5 question-answering, etc
LLMs: Large language models
Indexing Utils: Ways to interact with specific data (embeddings, vectorstores, document loaders)
Tools: Ways to interact with the outside world (search, calculators, etc)
Agents: Agents use LLMs to decide what actions should be taken. Tools like web search or calculators can be used, and all are packaged into a logical loop of operations.
Memory: Short-term memory, long-term memory.
langchain) 對於部署 LLM 應用變成核心技能。