RegExp split for Spell Check

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • SmokeWilliams

    RegExp split for Spell Check

    Hi,
    I am working on a Spell checker for my richtext editor. I cannot use
    any open source, and must develop everything myself. I need a RegExp
    pattern to split text into a word array. I have been doing it by
    splitting by spaces or <ptags. I run into a probelm with the
    richtext part of my editor. When I change the font, it wraps the text
    in a tag. the tag has something like <font face="arial>som e words</
    font This splits the text at font^face so I need to split on spaces
    unless they are within the HTML tag. I am just looking for the
    pattern for my regExp. I know there may be better ways for me to do
    it, but right now I just need help with this issue.

    Thanks in advance.

    Pete
  • Evertjan.

    #2
    Re: RegExp split for Spell Check

    SmokeWilliams wrote on 23 nov 2007 in comp.lang.javas cript:
    I am working on a Spell checker for my richtext editor.
    I cannot use any open source, and must develop everything myself.
    Why? At least look at all the code you can find. Coming up with complex
    code from scratch does not give you the benefit of years of code
    experimentation of the collective of world's programmers.
    I need a RegExp pattern to split text into a word array.
    Why? Does it matter how you do it? Parsing seems so much simpler.
    I have been doing it by
    splitting by spaces or <ptags. I run into a probelm with the
    richtext part of my editor. When I change the font, it wraps the text
    in a tag.
    the tag has something like <font face="arial>som e words</font>
    That is last century's code. Why not use <spanand CSS exclusively?
    This splits the text at font^face so I need to split on spaces
    unless they are within the HTML tag.
    I am just looking for the pattern for my regExp.
    I know there may be better ways for me to do
    it, but right now I just need help with this issue.
    I think that by stipulating the above unneccessary constraints, you will
    get yourself into much trouble.

    However try this:

    var wordArrray = textString.repl ace(/(<[^>]*>)/g,' ').split(/\s+/)

    --
    Evertjan.
    The Netherlands.
    (Please change the x'es to dots in my emailaddress)

    Comment

    • Randy Webb

      #3
      Re: RegExp split for Spell Check

      Evertjan. said the following on 11/23/2007 1:49 PM:
      SmokeWilliams wrote on 23 nov 2007 in comp.lang.javas cript:
      <snip>
      >the tag has something like <font face="arial>som e words</font>
      >
      That is last century's code. Why not use <spanand CSS exclusively?
      Because that is what the browsers put in the code in a contentEditable
      element :)

      --
      Randy
      Chance Favors The Prepared Mind
      comp.lang.javas cript FAQ - http://jibbering.com/faq/index.html
      Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/

      Comment

      • Evertjan.

        #4
        Re: RegExp split for Spell Check

        Randy Webb wrote on 23 nov 2007 in comp.lang.javas cript:
        Evertjan. said the following on 11/23/2007 1:49 PM:
        >SmokeWilliam s wrote on 23 nov 2007 in comp.lang.javas cript:
        >
        <snip>
        >
        >>the tag has something like <font face="arial>som e words</font>
        >>
        >That is last century's code. Why not use <spanand CSS exclusively?
        >
        Because that is what the browsers put in the code in a contentEditable
        element :)
        >
        So why use contentEditable if you cannot control it?

        Wouldn't a simple <divwith onkeypress do?

        --
        Evertjan.
        The Netherlands.
        (Please change the x'es to dots in my emailaddress)

        Comment

        • Dr J R Stockton

          #5
          Re: RegExp split for Spell Check

          In comp.lang.javas cript message <Xns99F1C9A6EB2 96eejj99@194.10 9.133.242>
          , Fri, 23 Nov 2007 18:49:24, Evertjan. <exjxw.hannivoo rt@interxnl.net >
          posted:
          >
          >However try this:
          >
          >var wordArrray = textString.repl ace(/(<[^>]*>)/g,' ').split(/\s+/)
          >
          If the page contains <script>...<\/script then ISTM that the script
          will be spell-checked; likewise the content of any textarea and possibly
          others.

          Could one write the full text to a page or div as HTML (useful anyway)
          and read it back as .innerText for spell-checking ?

          --
          (c) John Stockton, Surrey, UK. ?@merlyn.demon. co.uk Turnpike v6.05 MIME.
          Web <URL:http://www.merlyn.demo n.co.uk/- FAQqish topics, acronyms & links;
          Astro stuff via astron-1.htm, gravity0.htm ; quotings.htm, pascal.htm, etc.
          No Encoding. Quotes before replies. Snip well. Write clearly. Don't Mail News.

          Comment

          • Randy Webb

            #6
            Re: RegExp split for Spell Check

            Evertjan. said the following on 11/23/2007 6:16 PM:
            Randy Webb wrote on 23 nov 2007 in comp.lang.javas cript:
            >
            >Evertjan. said the following on 11/23/2007 1:49 PM:
            >>SmokeWillia ms wrote on 23 nov 2007 in comp.lang.javas cript:
            ><snip>
            >>
            >>>the tag has something like <font face="arial>som e words</font>
            >>That is last century's code. Why not use <spanand CSS exclusively?
            >Because that is what the browsers put in the code in a contentEditable
            >element :)
            >>
            >
            So why use contentEditable if you cannot control it?
            That is basically all it is. It isn't so much the contentEditable that
            does it but rather the built-in functions (most notably in IE) that do
            the formatting. I haven't messed with it in a long time but I do
            remember that the styling of text was horrible.

            --
            Randy
            Chance Favors The Prepared Mind
            comp.lang.javas cript FAQ - http://jibbering.com/faq/index.html
            Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/

            Comment

            • Randy Webb

              #7
              Re: RegExp split for Spell Check

              Dr J R Stockton said the following on 11/23/2007 3:01 PM:
              In comp.lang.javas cript message <Xns99F1C9A6EB2 96eejj99@194.10 9.133.242>
              , Fri, 23 Nov 2007 18:49:24, Evertjan. <exjxw.hannivoo rt@interxnl.net >
              posted:
              >However try this:
              >>
              >var wordArrray = textString.repl ace(/(<[^>]*>)/g,' ').split(/\s+/)
              >>
              >
              If the page contains <script>...<\/script then ISTM that the script
              will be spell-checked; likewise the content of any textarea and possibly
              others.
              >
              Could one write the full text to a page or div as HTML (useful anyway)
              and read it back as .innerText for spell-checking ?
              The idea of spell-checking, in the sense of a true spell-checker is
              almost impossible to implement in a browser due to the inherent size of
              the dictionary that you must use.

              --
              Randy
              Chance Favors The Prepared Mind
              comp.lang.javas cript FAQ - http://jibbering.com/faq/index.html
              Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/

              Comment

              • Dr J R Stockton

                #8
                Re: RegExp split for Spell Check

                In comp.lang.javas cript message <XpudnWyYkKlj69 ra4p2dnAA@gigan ews.com>,
                Fri, 23 Nov 2007 19:57:39, Randy Webb <HikksNotAtHome @aol.composted:
                >
                >The idea of spell-checking, in the sense of a true spell-checker is
                >almost impossible to implement in a browser due to the inherent size of
                >the dictionary that you must use.
                >
                Fifty thousand words is sufficient for ordinary use. I have to hand a
                "Universal" pocket dictionary of a language resembling English, with 407
                pages of two columns of about 15 words each; so about a quarter of that
                size. The Little Oxford dictionary, 606 * 2 * 20, is about 25000 words.

                I have to hand the New Testament in Basic English; its Note refers to
                Basic English having 850 words, and to the NT using another 150
                particular to the topic. It lacks the richness of the King James
                version; but the text looks quite normal.

                A spell-checker for use by the younger half of school-children would not
                need very many words.

                An alphabetical list of words, compressed, should not need much more
                than two bytes per word.

                So the list of words need be no longer than my largest Web pace,
                currently 105000 bytes; and that's quite acceptable over broadband if
                expected and cached properly.

                There should be plenty of room to store such data in Javascript, from
                what I've read here in other threads.

                Lookup needs be no faster than typing, and properly implemented should
                need only O(log2(N)) comparisons when using the main dictionary. It
                would seem a potentially smart move to cache in a sub-dictionary the
                words actually already seen (right or wrong) in the current text, since
                words are often repeated. FAQ 2.3 contains about 675 words, but only
                about 343 different ones. One third of its words are in the Top 8, "the
                to and of in not a is".

                The sub-dictionary can be pre-loaded with the commonest good and bad
                spellings, if that helps.

                Of course, it would be quite wrong to impose the full OED on an
                unsuspecting dial-up user.

                --
                (c) John Stockton, Surrey, UK. ?@merlyn.demon. co.uk Turnpike v6.05 MIME.
                Web <URL:http://www.merlyn.demo n.co.uk/- FAQish topics, acronyms, & links.
                Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
                Do not Mail News to me. Before a reply, quote with ">" or "" (SonOfRFC1036)

                Comment

                • Randy Webb

                  #9
                  Re: RegExp split for Spell Check

                  Dr J R Stockton said the following on 11/24/2007 4:32 PM:
                  In comp.lang.javas cript message <XpudnWyYkKlj69 ra4p2dnAA@gigan ews.com>,
                  Fri, 23 Nov 2007 19:57:39, Randy Webb <HikksNotAtHome @aol.composted:
                  >The idea of spell-checking, in the sense of a true spell-checker is
                  >almost impossible to implement in a browser due to the inherent size of
                  >the dictionary that you must use.
                  >>
                  >
                  Fifty thousand words is sufficient for ordinary use. I have to hand a
                  "Universal" pocket dictionary of a language resembling English, with 407
                  pages of two columns of about 15 words each; so about a quarter of that
                  size. The Little Oxford dictionary, 606 * 2 * 20, is about 25000 words.
                  I found a text file after looking for almost an hour. It has 213,558
                  words in it. The text file is 2.4 mbs. The biggest problem with even a
                  25,000 word dictionary is going to be lookup time. That can be helped a
                  lot by splitting it up into 26 dictionaries by beginning letter.

                  Too bad I can't look up half of what those words mean to know what they
                  mean. What the heck is a zakkeu?
                  --
                  Randy
                  Chance Favors The Prepared Mind
                  comp.lang.javas cript FAQ - http://jibbering.com/faq/index.html
                  Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/

                  Comment

                  • Randy Webb

                    #10
                    Re: RegExp split for Spell Check

                    Randy Webb said the following on 11/26/2007 8:26 PM:

                    <snip>
                    if(Dic['word here'])
                    Testing that with a 215,000 word dictionary, the results were almost
                    instantaneous. It did tell me that the word list I had is pretty useless
                    since it didn't have the word "test" in it.
                    --
                    Randy
                    Chance Favors The Prepared Mind
                    comp.lang.javas cript FAQ - http://jibbering.com/faq/index.html
                    Javascript Best Practices - http://www.JavascriptToolbox.com/bestpractices/

                    Comment

                    • Dr J R Stockton

                      #11
                      Re: RegExp split for Spell Check

                      In comp.lang.javas cript message <D4ydnZVYXZ2aat Ha4p2dnAA@gigan ews.com>,
                      Wed, 28 Nov 2007 00:05:18, Randy Webb <HikksNotAtHome @aol.composted:
                      >
                      >I guessed at the 500kb based on 215,000 entries being 4.5Mb. Creating a
                      >test file with 25,000 entries in it where each entry is 6 characters
                      >long - to create an "average" word length - the file was 439Kb so I
                      >wasn't far off. Of course, the actual size would depend on the 25,000
                      >words you used.
                      >
                      25000 6-character words in 7-bit ASCII, with CRLF separators, needs
                      exactly 200kB. It may use more if created in Word, or if encoded in a
                      manner allowing letters other than A-Z.

                      For ordinary English, one only needs A to Z - ' and a separator, so
                      5-bit characters could be used by mere packing - 25000*5*6/8 -under
                      100 kbytes, before any additional compression.

                      Of course, dictionary words are longer than the average.

                      --
                      (c) John Stockton, Surrey, UK. ?@merlyn.demon. co.uk Turnpike v6.05 MIME.
                      Web <URL:http://www.merlyn.demo n.co.uk/- FAQish topics, acronyms, & links.
                      Proper <= 4-line sig. separator as above, a line exactly "-- " (SonOfRFC1036)
                      Do not Mail News to me. Before a reply, quote with ">" or "" (SonOfRFC1036)

                      Comment

                      • SmokeWilliams

                        #12
                        Re: RegExp split for Spell Check

                        This is exactly what I was afraid of. I know it isn't the best
                        solution. I know there are better ways. I need a pattern to be used
                        only in the split because I need to maintain the length of the
                        string. So again, if anyone knows how to make the pattern to split
                        text by spaces or cariage returns "\r| " this is the split I am using
                        now. But as I stated above I need to ignore the spaces within HTML
                        tags. Please help me. Just the simple pattern will do. Thanks.

                        Pete

                        Comment

                        • SmokeWilliams

                          #13
                          Re: RegExp split for Spell Check

                          Hello Evertjan, thanks for replying.
                          >
                          However try this:
                          >
                          var wordArrray = textString.repl ace(/(<[^>]*>)/g,' ').split(/\s+/)
                          >
                          I need a pattern that will split without replacing. So I need to
                          split on spaces or carriage returns, but not spaces that are withing
                          html tags. I know there are better ways, but I am using an IFrame in
                          IE and I work for a government agency which doesn't allow me to use
                          open source. I am depending on a RegEx wizard out there to supply me
                          with the pattern.

                          So I need a pattern that matches any space or carriage return that is
                          not within an html tag.

                          <font face="arial" size=2>test</font><p>yo this is a test

                          Splitting this text should return an array containing:
                          1: <font face="arial" size=2>test</font>
                          2: yo
                          3: this
                          4: is
                          5: a
                          6: test

                          Thanks for your help.

                          Pete

                          Comment

                          • pr

                            #14
                            Re: RegExp split for Spell Check

                            SmokeWilliams wrote:
                            <font face="arial" size=2>test</font><p>yo this is a test
                            >
                            Splitting this text should return an array containing:
                            1: <font face="arial" size=2>test</font>
                            2: yo
                            3: this
                            4: is
                            5: a
                            6: test
                            >
                            Try:

                            alert('<font face="arial" size=2>test</font><p>yo this is a
                            test'.replace(/\s(?=[^<]*>)/g, "~").split(/<p>|\s/).join("\n"));

                            You can either replace the '~'s or leave them in; either way, your
                            string lengths are the same as the original HTML (as long as you clear
                            up the <p!= whitespace issue).

                            Comment

                            • Thomas 'PointedEars' Lahn

                              #15
                              Re: RegExp split for Spell Check

                              SmokeWilliams wrote:
                              I need a pattern that will split without replacing. So I need to
                              split on spaces or carriage returns, but not spaces that are withing
                              html tags. I know there are better ways, but I am using an IFrame in
                              IE and I work for a government agency which doesn't allow me to use
                              open source. I am depending on a RegEx wizard out there to supply me
                              with the pattern.
                              >
                              So I need a pattern that matches any space or carriage return that is
                              not within an html tag.
                              >
                              <font face="arial" size=2>test</font><p>yo this is a test
                              >
                              Splitting this text should return an array containing:
                              1: <font face="arial" size=2>test</font>
                              2: yo
                              3: this
                              4: is
                              5: a
                              6: test
                              Suppose you have

                              var s = '<font face="arial" size=2>test</font><p>yo this is a test';

                              Either you have a weird idea of "html tag" (HTML is an acronym, BTW),
                              or (which is more likely) instead you want the resulting array to be

                              ['', 'test', '', 'yo', 'this', 'is', 'a', 'test']

                              This could be achieved by using tags as additional delimiters:

                              var a = s.split(/<[^>]+>|\s+/);

                              Microsoft JScript will not include the empty strings in the array.


                              PointedEars
                              --
                              realism: HTML 4.01 Strict
                              evangelism: XHTML 1.0 Strict
                              madness: XHTML 1.1 as application/xhtml+xml
                              -- Bjoern Hoehrmann

                              Comment

                              Working...