Skip to content

Elixir package for stripping whitespace and normalizing dirty text

License

Notifications You must be signed in to change notification settings

fireproofsocks/stripper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stripper

Module Version Hex Docs Total Download License Last Updated

Stripper is an Elixir package for normalizing input from unpredictable sources (such as web scraping), useful as a pre-processing step in ETL pipelines for machine learning or data analysis. It is parser-based (not regular expression based), so it does all its work in one pass and should be performant.

Why the name? Because it describes the purpose and it's memorable -- get over it ;)

Examples

Normalizing whitespace:

iex> Stripper.Whitespace.normalize!("   random\tstuff\fI   scraped\t\t\tfrom\nthe web\n\n")
"random stuff I scraped from the web"

This will reduce all unicode whitespace and separator characters to the humble space -- multiple spaces will be collapsed into one.

Simplifying quotes:

iex> Stripper.Quotes.normalize!(~S|‘make’ «it» „stop“|)
      "'make' \"it\" \"stop\""

See the online documentation for more information.

Installation

If available in Hex, the package can be installed by adding stripper to your list of dependencies in mix.exs:

def deps do
  [
    {:stripper, "~> 1.4.0"}
  ]
end

Contributing

See the Contributing Guidelines for more information.

Image Attribution

The logo image is "wire strippers" by Designs by MB from the the Noun Project

About

Elixir package for stripping whitespace and normalizing dirty text

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages