How do you process and classify text documents in Python? What are the fundamental techniques and building blocks for Natural Language Processing (NLP)? In this Real Python episode I talk about how machine learning (ML) models understand text.
I explain how ML models require data in a structured format, which involves transforming text documents into columns and rows. I cover the most straightforward approach, called binary vectorization. We discuss the bag-of-words method and the tools of stemming, lemmatization, and count vectorization.
We jump into word embedding models next. I talk about WordNet, Natural Language Toolkit (NLTK), word2vec, and Gensim. Our conversation lays a foundation for starting with text classification, implementing sentiment analysis, and building projects using these tools. I also share multiple resources to help you continue exploring NLP and modeling.
