How can you measure the quality of a large language model? What tools can measure bias, toxicity, and truthfulness levels in a model using Python? In this episode, I return to the Real Python podcast to discuss techniques and tools for evaluating LLMs with Python.
I provide some background on large language models and how they can absorb vast amounts of information about the relationship between words using a type of neural network called a transformer. We discuss training datasets and the potential quality issues with using imperfectly curated sources.
We dig into ways to measure levels of bias, toxicity, and hallucinations using Python. I share three benchmarking datasets and links to resources to get you started. We also discuss ways to augment models using agents or plugins, which can access search engine results or other authoritative sources.
