TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords.
The remove_emojis
method removes emojis from the text.
The remove_internet_words
method removes internet-specific words from the text.
The remove_html_tags
method removes HTML tags from the text.
The remove_urls
method removes URLs from the text.
The remove_numbers
method removes numbers from the text.
The remove_special_chars
method removes special characters from the text.
The remove_contractions
method expands contractions in the text.
The remove_stopwords
method removes stopwords from the text.
- If
is_lower
andis_token
are bothTrue
, the text is returned in lowercase and as a list of tokens. - If only
is_lower
isTrue
, the text is returned in lowercase. - If only
is_token
isTrue
, the text is returned as a list of tokens. - If neither
is_lower
noris_token
isTrue
, the text is returned as is.
You can install TextPrettifier using pip:
pip install text-prettifier
from text_prettifier import TextPrettifier
text_prettifier = TextPrettifier()
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)
Output Hi,Pythonogist! I Python.
html_text = "<p>Hello, <b>world</b>!</p>"
cleaned_html = text_prettifier.remove_html_tags(html_text)
print(cleaned_html)
Output Hello,world!
url_text = "Visit our website at https://example.com"
cleaned_urls = text_prettifier.remove_urls(url_text)
print(cleaned_urls)
Output Visit our webiste at
number_text = "There are 123 apples"
cleaned_numbers = text_prettifier.remove_numbers(number_text)
print(cleaned_numbers)
Output There are apples
special_text = "Hello, @world!"
cleaned_special = text_prettifier.remove_special_chars(special_text)
print(cleaned_special)
Output Hello world
contraction_text = "I can't do it"
cleaned_contractions = text_prettifier.remove_contractions(contraction_text)
print(cleaned_contractions)
Output I cannot do it
stopwords_text = "This is a test"
cleaned_stopwords = text_prettifier.remove_stopwords(stopwords_text)
print(cleaned_stopwords)
Output This test
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text)
print(all_cleaned)
Output Hello world 123 apples cannot test
If you are interested to tokenized and lower the cleaned text write the code
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text,is_token=True,is_lower=True)
print(all_cleaned)
Output ['Hello','world', '123','apples', 'cannot','test']
Note: I didn't include remove_numbers
in sigma_cleaner
because sometimes numbers carry useful information in term of NLP. If you want to remove number you can apply this method seperately on output of sigma_cleaner
.
Feel free to reach out to me on social media:
This project is licensed under the MIT License - see the LICENSE file for details.