Skip to content

In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar …

License

Notifications You must be signed in to change notification settings

camara94/python-text-mining

Repository files navigation

Python text mining

In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar with the numpy and pandas Python libraries for data manipulation, and scikit-learn toolkit for machine learning algorithms.

Primitive constructs in Text

  • Sentences / input strings
  • Words or Tokens
  • Characters
  • Document, larger files

In this course in python we are tolking about all this concepts and their properties

Week 1

Les liens utils:

  1. https://docs.python.org/3/library/re.html

  2. https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

  3. https://ieva.rocks/2016/08/07/cleaning-text-for-nlp/

  4. https://chrisalbon.com/python/cleaning_text.html

Week 2

In this module, we will tolk to Natural Language

What is Natural Language ?

  • Language used for everyday communication by humans
    • English
    • Chinese
    • spanish

compared to the artificial computer language

  • Any computation, manipulation of natural language
  • Natural language evolve
    • new words get added
    • old words lose popularity
    • language rules themselves may change.

NLT Task: A Broad Spectrum

  • Computing words, counting frequency of words
  • Finding sentence boundaries
  • Part of speech tagging
  • Parsing the sentence structure
  • Identifying semantic roles
  • Identifying entities in a sentences
  • Finding which pronoun refers to which entity

An Introduction to NLTK

  • NLTK: Natural Language Toolkit
  • Open source library in Python
  • Has support for most NLP tasks
  • Also provides access to numerous text corpora

Usage of NLTK

  • Importation
    import nltk

  • Let's get some text corpora
    nltk.download()

    from nltk.dowload()

    for more information see lab week2

Tokenization

  • Recall splitting a sentence into words / tokens

Part-of-speech (POS) Tagging

  • Recall high school grammar: nouns, verbs, adjectives,...
    image 2

Ambiguity in POS Tagging

image 3

Parsing Sentence Structure

image 4

Ambiguity in Parsing

image 5

image 6

POS Tagging & Parsing Complexity

image 8

Task Home Concepts

image 10

Examples of Text Classification

image  11

Supervised Learning

image  12

Supervised Classification Step

image  14

Supervised Classification Model

image  15

Divide Dataset in two parts

image  16

Classification paradigms

image  19

Questions to ask in Supervised Learning

image 20

Why is textual data unique ?

image 22

Types of textual features (1)

image 23

Types of textual features (2)

image 24

Types of textual features (3)

image 25

Naive Bayes Classifiers

Case study: Classifying text search queries

image 26

image 27

Probabilistic model

image 29

Bayes' Rule

image 30

Naive Bayes Classification

image 32

image 33

Example classification

image 34

Naïve Bayes: Learning parameters

image 35

image 37

Naïve Bayes: Smoothing

image 38

Take Home Concepts

image 39

Two Naïve Bayes Variants For Text

image 40

Support Vector Machine

Decision Boundaries

image 41

Choosing a Decision Boundary

image 42

Finding a Linear Boundary

image 44

image 45

image 46

SVM: Multi-class classification

image 47

image 48

image 49

image 50

image 51

image 52

SVM Parameters (1): Parameter C

image 53

SVM Parameters (2): Others Params

image 54

Take Home Messages

image 55

Using Sklearn's NaiveBayesClassifier

image 57

Using Sklearn's SVM Classifier

image 58

Model Selection in Scikit-learn

image 60

image 59

Supervised Text Classification in NLTK

image 61

Using NLTK's NaiveBayesClassifier

image 62

Using NLTK's SkearnClassifier

image 63

Take Home Concept

image 65

Application of semantic similarity

image 68

WordNet

image 69

Semantic similarity using WordNet

image 70

Coming back to our deer example

image 71

Similarity with NLP in Python

image 66

Distributional Similarity: Context

image 72

Strength of association between words

image 75

Take Home Concepts

image 67

What is Topic Modeling ?

image 76

image 77

image 78

Generative Models for Text

image 79

Generative Model can be complex

image 80

Latent Dirichlet Allocation (LDA)

image 81

Topic Modeling in Practice

image 82

Topic Modeling: Summary

image 83

Working with LDA in Python

image 86

image 88

Take Home Concepts

image 89

Information is hidden in free-text

image 90

Information Extraction

image 92

Fields of Interest

image 93

Named Entity Recognition

image 94

Approche to identify named entities

image 95

image 96

Relation extraction

image 97

Co-reference resolution

image 98

Question Answering

image 99

Take Home Concepts

image 100

Additional Resources & Readings

About

In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar …

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published