Nlp, Python and period

**Paul Boddie** · Aug 4 '08, 10:25 AM

Re: Nlp, Python and period

On 4 Aug, 11:59, Fred Mangusta <a...@bbb.itwro te:

Hi,
>
are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., f...@home.co.uk , etc)?

I wouldn't mind finding out about such packages, either. I see that
NLTK offers a few options, with the following tokeniser being
interesting if you don't mind training the software:

http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer

There was also discussion of this topic on Ned Batchelder's blog a
while back:

Separating sentences

http://nedbatchelder.com/blog/200804/separating_sentences.html

One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.

My comment on there (that I'm using a regular expression with some
postprocessing) still stands.

Paul

**John Machin** · Aug 4 '08, 10:25 AM

Re: Nlp, Python and period

On Aug 4, 7:59 pm, Fred Mangusta <a...@bbb.itwro te:

Hi,
>
are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., f...@home.co.uk , etc)?
>

google("python nltk") ... it may do what you want.

**Fred Mangusta** · Aug 4 '08, 10:35 AM

Re: Nlp, Python and period

Hi Paul,

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

Thanks!
F.

Paul Boddie wrote:

There was also discussion of this topic on Ned Batchelder's blog a
while back:
>

Separating sentences

http://nedbatchelder.com/blog/200804/separating_sentences.html

One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.

>
My comment on there (that I'm using a regular expression with some
postprocessing) still stands.
>
Paul

**Paul Boddie** · Aug 4 '08, 01:35 PM

Re: Nlp, Python and period

On 4 Aug, 12:34, Fred Mangusta <a...@bbb.itwro te:

>
thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:

sentence_patter n = re.compile(
r'(' +
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,a-z,0-9]' + # Match sentence with specific start
character
r'.+?' + # Match sentence content - "?" means non-
greedy
r'[\.\!\?]' + # End of sentence
r'[\)\"\]]*' + # End quoting or bracketing
r')' +
r'(\s+)' + # Spaces
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,0-9]' # Match sentence with specific start
character
)

This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.

As I noted, I'd be interested to hear of any better solutions which
don't involve training.

Paul

Nlp, Python and period

Nlp, Python and period

Comment

Comment

Comment

Comment