
measures the quality of the summary by counting the number of overlapping units such as n-grams, word sequences, and word pairs between the model-generated text and the reference texts.
variants: ROUGE-N: Focuses on n-grams (N-word phrases). ROUGE-1 and ROUGE-2 (unigrams and bigrams, respectively) are most common.


calculated using a geometric mean of the n-gram precisions, multiplied by the brevity penalty (BP), pi is the precision for n-grams.
The range of BLEU score: It typically from 0 to 1, where 0 indicates no overlap between the translated text and the reference translations,




or System Evaluations


幾個重點差異

Retrieval Augmented Generation (RAG):從 LLMOp 角度



Weights and Biases (wandb)
promptfoo
倫理與安全

Making the MBTI Test an Amazing Evaluation for Large Language Models

AI 應該全然誠實嗎?

昨日的科幻小說,明日的現實
AI 有法律責任嗎?(要報稅?有著作權?有犯罪能力?)
可以跟 AI 結婚嗎?
讓我們來設計一個評測任務 (model or task evaluation)
參考這個 repo Papers and resources for LLMs evaluation
想一想,有什麼是沒做過的?為什麼做不到?
- Alignment and Calibration - Reinforcement Learning and LLMs
- Tracking and Developing LLMs
--- ## Alignment and Calibration - Reinforcement Learning and LLMs