Nlp, Python and period

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Fred Mangusta

    Nlp, Python and period

    Hi,

    are you aware of any nlp packages or algorithms in Python to spot
    whether a '.' represents an end of sentence or rather something else (eg
    Mr., foo@home.co.uk, etc)?

    Thanks

    F.
  • Paul Boddie

    #2
    Re: Nlp, Python and period

    On 4 Aug, 11:59, Fred Mangusta <a...@bbb.itwro te:
    Hi,
    >
    are you aware of any nlp packages or algorithms in Python to spot
    whether a '.' represents an end of sentence or rather something else (eg
    Mr., f...@home.co.uk , etc)?
    I wouldn't mind finding out about such packages, either. I see that
    NLTK offers a few options, with the following tokeniser being
    interesting if you don't mind training the software:



    There was also discussion of this topic on Ned Batchelder's blog a
    while back:

    One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.


    My comment on there (that I'm using a regular expression with some
    postprocessing) still stands.

    Paul

    Comment

    • John Machin

      #3
      Re: Nlp, Python and period

      On Aug 4, 7:59 pm, Fred Mangusta <a...@bbb.itwro te:
      Hi,
      >
      are you aware of any nlp packages or algorithms in Python to spot
      whether a '.' represents an end of sentence or rather something else (eg
      Mr., f...@home.co.uk , etc)?
      >
      google("python nltk") ... it may do what you want.


      Comment

      • Fred Mangusta

        #4
        Re: Nlp, Python and period

        Hi Paul,

        thanks for replying. I'm interested in knowing more about your regex
        approach, but as you point out in your comment, seems like access to the
        sourceforge mail archive is restricted. Is there any way I can read
        about it? Would you be so kind to cut and paste it here for instance?

        Thanks!
        F.

        Paul Boddie wrote:
        There was also discussion of this topic on Ned Batchelder's blog a
        while back:
        >
        One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.

        >
        My comment on there (that I'm using a regular expression with some
        postprocessing) still stands.
        >
        Paul

        Comment

        • Paul Boddie

          #5
          Re: Nlp, Python and period

          On 4 Aug, 12:34, Fred Mangusta <a...@bbb.itwro te:
          >
          thanks for replying. I'm interested in knowing more about your regex
          approach, but as you point out in your comment, seems like access to the
          sourceforge mail archive is restricted. Is there any way I can read
          about it? Would you be so kind to cut and paste it here for instance?
          I can't log into SourceForge, possibly because I've forgotten my
          password, but I can give you a fairly similar regular expression which
          does some of the work:

          sentence_patter n = re.compile(
          r'(' +
          r'[\(\"\[]*' + # Quoting or bracketing (optional)
          r'[A-Z,a-z,0-9]' + # Match sentence with specific start
          character
          r'.+?' + # Match sentence content - "?" means non-
          greedy
          r'[\.\!\?]' + # End of sentence
          r'[\)\"\]]*' + # End quoting or bracketing
          r')' +
          r'(\s+)' + # Spaces
          r'[\(\"\[]*' + # Quoting or bracketing (optional)
          r'[A-Z,0-9]' # Match sentence with specific start
          character
          )

          This is mostly the same as that posted to SourceForge, but with some
          enhancements; I've indented the part which actually produces the
          matched sentence text in a group. Unfortunately, some postprocessing
          is required to deal with abbreviations, and I maintain a list of these
          against which I test the supposed ends of sentences that the regular
          expression provides. In addition, I also try and detect initials (eg.
          G. van Rossum) which the regular expression may regard as the end of a
          sentence.

          As I noted, I'd be interested to hear of any better solutions which
          don't involve training.

          Paul

          Comment

          Working...