Real Python Podcast | Natural language processing and how ML models understand text

July 29, 2022

Cover image by Real Python Podcast

How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? In this Real Python episode I talk about how machine learning (ML) models understand text.

I explain how ML models require data in a structured format, which involves transforming text documents into columns and rows. I cover the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.

We jump into word embedding models next. I talk about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. I also share multiple resources to help you continue exploring NLP and modeling.