split into sentences with preg_split

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mark1491
    New Member
    • Apr 2008
    • 2

    split into sentences with preg_split

    I am trying to split a string into sentences with preg_split, but I would like it to not split initials.
    Examples:

    Mark H. Doolittle is my name. What is yours. ( I don't want it to split the middle initial, but want to split after 'name')

    The company is is called H.R. Block. They are a good company. ( I don't want it to split after 'H', 'R')

    Here is the code i have, but it causes a split at every period. can someone help me out, im not good a regex yet.

    [PHP]$sentences = preg_split ("/[.?!]+/", $data);[/PHP]
  • ronverdonk
    Recognized Expert Specialist
    • Jul 2006
    • 4259

    #2
    That is quite difficult to distinguish. Example: what would such an expresssion do with sentences like
    Code:
    That is for you and I. But not for him.
    No expression can distinguish between you "Mark H. Doolittle" and "and I. But".

    Ronald

    Comment

    • Atli
      Recognized Expert Expert
      • Nov 2006
      • 5062

      #3
      I cant see a simple solution to this either.

      For this to work, the code would have to be able to distinguish between the end of line periods and those used for other purposes, such as names and numbers.

      Your code would have to be able to recognize all possible uses of a period and decide whether or not it really is the end of a line.
      Honestly, I doubt that is "doable"... (I would say possible, but nothing is "impossible " ;P)
      Last edited by Atli; Apr 6 '08, 03:50 PM. Reason: Submitted before I finished writing :P

      Comment

      • mark1491
        New Member
        • Apr 2008
        • 2

        #4
        Originally posted by ronverdonk
        That is quite difficult to distinguish. Example: what would such an expresssion do with sentences like
        Code:
        That is for you and I. But not for him.
        No expression can distinguish between you "Mark H. Doolittle" and "and I. But".

        Ronald
        I understand that there will definitely be some problems, but most likely the only one letter word that would fall at the end of a sentence would be "I" like in your example:

        Code:
        That is for you and I. But not for him.
        I'll figure that out later, but for now is there anyway you could show me how i would do a preg_split that would:

        not split any "." that came after a single character instance of 'a-zA-Z', but then split the rest of the "." that did not follow that rule

        Comment

        Working...