Skip to content

A fast and accurate rule-based sentence segmentation tool for Ruby.

License

Notifications You must be signed in to change notification settings

louismullie/scalpel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

About

Scalpel is the result of my inability to find a simple and elegant solution to sentence segmentation in Ruby. Machine learning approaches - both unsupervised (punkt-segmenter) and supervised ( tactful_tokenizer) - depend on proper domain-specific training to work well. Stanford's tokenize-first group-later method (stanford-core-nlp) does not work so well in the face of ill-formatted content. Finally, extensive rule-based methods (srx-english) are very accurate but suffer from poor performance.

Scalpel is based on a very simple principle that reduces the complexity of performing sentence segmentation. The idea is that it is simpler and more efficient to find occurrences of periods that do not indicate the end of a sentence, rather than those who do. These occurrences are temporarily replaced by "placeholder" characters, and sentence splitting is subsequently performed. The placeholder characters are then replaced by the original characters.

Usage

gem install scalpel
require 'scalpel'
Scalpel.cut("some text")

Contributing

Feel free to fork the project and send me a pull request!

About

A fast and accurate rule-based sentence segmentation tool for Ruby.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages