Python text mining

In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar with the numpy and pandas Python libraries for data manipulation, and scikit-learn toolkit for machine learning algorithms.

Primitive constructs in Text

Sentences / input strings
Words or Tokens
Characters
Document, larger files

In this course in python we are tolking about all this concepts and their properties

Week 1

Les liens utils:

Week 2

In this module, we will tolk to Natural Language

What is Natural Language ?

Language used for everyday communication by humans
- English
- Chinese
- spanish

compared to the artificial computer language

Any computation, manipulation of natural language
Natural language evolve
- new words get added
- old words lose popularity
- language rules themselves may change.

NLT Task: A Broad Spectrum

Computing words, counting frequency of words
Finding sentence boundaries
Part of speech tagging
Parsing the sentence structure
Identifying semantic roles
Identifying entities in a sentences
Finding which pronoun refers to which entity

An Introduction to NLTK

NLTK: Natural Language Toolkit
Open source library in Python
Has support for most NLP tasks
Also provides access to numerous text corpora

Usage of NLTK

Importation
import nltk
Let's get some text corpora
nltk.download()

from nltk.dowload()

for more information see lab week2

Tokenization

Recall splitting a sentence into words / tokens

Part-of-speech (POS) Tagging

Recall high school grammar: nouns, verbs, adjectives,...

Ambiguity in POS Tagging

Parsing Sentence Structure

Ambiguity in Parsing

POS Tagging & Parsing Complexity

Task Home Concepts

Examples of Text Classification

Supervised Learning

Supervised Classification Step

Supervised Classification Model

Divide Dataset in two parts

Classification paradigms

Questions to ask in Supervised Learning

Why is textual data unique ?

Types of textual features (1)

Types of textual features (2)

Types of textual features (3)

Naive Bayes Classifiers

Case study: Classifying text search queries

Probabilistic model

Bayes' Rule

Naive Bayes Classification

Example classification

Naïve Bayes: Learning parameters

Naïve Bayes: Smoothing

Take Home Concepts

Two Naïve Bayes Variants For Text

Support Vector Machine

Decision Boundaries

Choosing a Decision Boundary

Finding a Linear Boundary

SVM: Multi-class classification

SVM Parameters (1): Parameter C

SVM Parameters (2): Others Params

Take Home Messages

Using Sklearn's NaiveBayesClassifier

Using Sklearn's SVM Classifier

Model Selection in Scikit-learn

Supervised Text Classification in NLTK

Using NLTK's NaiveBayesClassifier

Using NLTK's SkearnClassifier

Take Home Concept

Application of semantic similarity

WordNet

Semantic similarity using WordNet

Coming back to our deer example

Similarity with NLP in Python

Distributional Similarity: Context

Strength of association between words

Take Home Concepts

What is Topic Modeling ?

Generative Models for Text

Generative Model can be complex

Latent Dirichlet Allocation (LDA)

Topic Modeling in Practice

Topic Modeling: Summary

Working with LDA in Python

Take Home Concepts

Information is hidden in free-text

Information Extraction

Fields of Interest

Named Entity Recognition

Approche to identify named entities

Relation extraction

Co-reference resolution

Question Answering

Take Home Concepts

Additional Resources & Readings

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
assessement		assessement
course		course
data		data
images		images
lab		lab
other_python_files		other_python_files
python		python
.gitignore		.gitignore
Coursera 39DV6UPHE3NY.pdf		Coursera 39DV6UPHE3NY.pdf
LICENSE		LICENSE
README.md		README.md
blei03a.pdf		blei03a.pdf
other.ipynb		other.ipynb
other_sim.ipynb		other_sim.ipynb
quizz3.png		quizz3.png
quizz4-1.pdf		quizz4-1.pdf
quizz4-1.png		quizz4-1.png
sentiment_analysis.ipynb		sentiment_analysis.ipynb
similaritynltk.ipynb		similaritynltk.ipynb
t		t
text.text		text.text
wordnet.ipynb		wordnet.ipynb

License

camara94/python-text-mining

Folders and files

Latest commit

History

Repository files navigation

Python text mining

Primitive constructs in Text

Week 1

Week 2

What is Natural Language ?

NLT Task: A Broad Spectrum

An Introduction to NLTK

Usage of NLTK

Tokenization

Part-of-speech (POS) Tagging

Ambiguity in POS Tagging

Parsing Sentence Structure

Ambiguity in Parsing

POS Tagging & Parsing Complexity

Task Home Concepts

Examples of Text Classification

Supervised Learning

Supervised Classification Step

Supervised Classification Model

Divide Dataset in two parts

Classification paradigms

Questions to ask in Supervised Learning

Why is textual data unique ?

Types of textual features (1)

Types of textual features (2)

Types of textual features (3)

Naive Bayes Classifiers

Case study: Classifying text search queries

Probabilistic model

Bayes' Rule

Naive Bayes Classification

Example classification

Naïve Bayes: Learning parameters

Naïve Bayes: Smoothing

Take Home Concepts

Two Naïve Bayes Variants For Text

Support Vector Machine

Decision Boundaries

Choosing a Decision Boundary

Finding a Linear Boundary

SVM: Multi-class classification

SVM Parameters (1): Parameter C

SVM Parameters (2): Others Params

Take Home Messages

Using Sklearn's NaiveBayesClassifier

Using Sklearn's SVM Classifier

Model Selection in Scikit-learn

Supervised Text Classification in NLTK

Using NLTK's NaiveBayesClassifier

Using NLTK's SkearnClassifier

Take Home Concept

Application of semantic similarity

WordNet

Semantic similarity using WordNet

Coming back to our deer example

Similarity with NLP in Python

Distributional Similarity: Context

Strength of association between words

Take Home Concepts

What is Topic Modeling ?

Generative Models for Text

Generative Model can be complex

Latent Dirichlet Allocation (LDA)

Topic Modeling in Practice

Topic Modeling: Summary

Working with LDA in Python

Take Home Concepts

Information is hidden in free-text

Information Extraction

Fields of Interest

Named Entity Recognition

Approche to identify named entities

Relation extraction

Co-reference resolution

Packages