Skip to content

Latest commit

 

History

History
144 lines (115 loc) · 4.25 KB

README.md

File metadata and controls

144 lines (115 loc) · 4.25 KB

TextPrettifier

TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.

TextPrettifier Key Features

1. Removing Emojis

The remove_emojis method removes emojis from the text.

2. Removing Internet Words

The remove_internet_words method removes internet-specific words from the text.

3. Removing HTML Tags

The remove_html_tags method removes HTML tags from the text.

4. Removing URLs

The remove_urls method removes URLs from the text.

5. Removing Numbers

The remove_numbers method removes numbers from the text.

6. Removing Special Characters

The remove_special_chars method removes special characters from the text.

7. Expanding Contractions

The remove_contractions method expands contractions in the text.

8. Removing Stopwords

The remove_stopwords method removes stopwords from the text.

Additional Functionality

  • If is_lower and is_token are both True, the text is returned in lowercase and as a list of tokens.
  • If only is_lower is True, the text is returned in lowercase.
  • If only is_token is True, the text is returned as a list of tokens.
  • If neither is_lower nor is_token is True, the text is returned as is.

Installation

You can install TextPrettifier using pip:

pip install text-prettifier
from text_prettifier import TextPrettifier

Initialize TextPrettifier

text_prettifier = TextPrettifier()

Example: Remove Emojis

html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)

Output Hi,Pythonogist! I Python.

Example: Remove HTML tags

html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)

Output Hello,world!

Example: Remove URLs

url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)

Output Visit our webiste at

Example: Remove numbers

number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)

Output There are apples

Example: Remove special characters

special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)

Output Hello world

Example: Remove contractions

contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)

Output I cannot do it

Example: Remove stopwords

stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)

Output This test

Example: Apply all cleaning methods

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)

Output Hello world 123 apples cannot test

If you are interested to tokenized and lower the cleaned text write the code

all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)

Output ['Hello','world', '123','apples', 'cannot','test']

Note: I didn't include remove_numbers in sigma_cleaner because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner.

Contact Information

Feel free to reach out to me on social media:

GitHub LinkedIn Twitter Facebook

License

This project is licensed under the MIT License - see the LICENSE file for details.