making tsearch2 dictionaries

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ben

    making tsearch2 dictionaries

    I'm trying to make myself a dictionary for tsearch2 that converts
    numbers to their english word equivalents. This seems to be working
    great, except that I can't figure out how to make my lexize function
    return multiple lexemes. For instance, I'd like "100" to get converted
    to {one,hundred}, not {"one hundred"} as is currently happening.

    How do I specify the output of the lexize function so that this will
    happen?


    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

  • Ben

    #2
    Re: making tsearch2 dictionaries

    Okay, so I was actually able to answer this question on my own, in a
    manner of speaking. It seems the way to do this is to merely return a
    larger char** array, with one element for each word. But I was having
    trouble with postgres crashing, because (I think) it tries to free each
    element independently before using all of them. I had set each element
    to a different null-terminated chunk of the same palloc'd memory
    segment. Having never written C stored procs before, I take it that's
    bad practice?

    Anyway, now that this is working, my next question is: can I take the
    lexemes from one dictionary lookup and pipe them into another
    dictionary? I see that I can have redundant dictionaries, such that if
    lexemes aren't found in one it'll try another, but that's not quite the
    same.

    For instance, the en_stem dictionary converts "hundred" into "hundr".
    Right now, my dictionary converts "100" into "one" and "hundred", but
    I'd like it to filter both one and hundred through the en_stem
    dictionary to arrive at "one" and "hundr".

    It also occurs to me I could pipe things through an ispell dictionary
    and be able to handle misspellings... .

    On Sun, 2004-02-15 at 15:35, Ben wrote:[color=blue]
    > I'm trying to make myself a dictionary for tsearch2 that converts
    > numbers to their english word equivalents. This seems to be working
    > great, except that I can't figure out how to make my lexize function
    > return multiple lexemes. For instance, I'd like "100" to get converted
    > to {one,hundred}, not {"one hundred"} as is currently happening.
    >
    > How do I specify the output of the lexize function so that this will
    > happen?[/color]


    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

    Comment

    • Teodor Sigaev

      #3
      Re: making tsearch2 dictionaries

      From http://www.sai.msu.su/~megera/oddmus...ch_V2_in_Brief

      Table for storing dictionaries. Dict_init field store Oid of function
      that initialize dictionary. Dict_init has one option: text value from
      dict_initoption and should return internal representation (structure)
      of dictionary. Structure must be malloced or palloced in
      TopMemoryContex t. Dict_init is called only one times per process.
      dict_lexize field store Oid of function that lemmatize lexem.
      Input values: structure of dictionary, pionter to string and it's
      length. Output: pointer to array of pointers to C-strings. Last pointer
      in array must be NULL. Returns NULL means that dictionary can't resolve
      this word, but return void array means that dictionary know input word,
      but suppose that word is stop-word.

      Ben wrote:[color=blue]
      > I'm trying to make myself a dictionary for tsearch2 that converts
      > numbers to their english word equivalents. This seems to be working
      > great, except that I can't figure out how to make my lexize function
      > return multiple lexemes. For instance, I'd like "100" to get converted
      > to {one,hundred}, not {"one hundred"} as is currently happening.
      >
      > How do I specify the output of the lexize function so that this will
      > happen?
      >
      >
      > ---------------------------(end of broadcast)---------------------------
      > TIP 4: Don't 'kill -9' the postmaster[/color]

      --
      Teodor Sigaev E-mail: teodor@sigaev.r u

      ---------------------------(end of broadcast)---------------------------
      TIP 7: don't forget to increase your free space map settings

      Comment

      • Tom Lane

        #4
        Re: making tsearch2 dictionaries

        Ben <bench@silentme dia.com> writes:[color=blue]
        > Okay, so I was actually able to answer this question on my own, in a
        > manner of speaking. It seems the way to do this is to merely return a
        > larger char** array, with one element for each word. But I was having
        > trouble with postgres crashing, because (I think) it tries to free each
        > element independently before using all of them. I had set each element
        > to a different null-terminated chunk of the same palloc'd memory
        > segment. Having never written C stored procs before, I take it that's
        > bad practice?[/color]

        Given Teodor's response, I think the issue is probably that you were
        palloc'ing in too short-lived a context. But whatever the problem is,
        you'll narrow it down a lot faster if you build with --enable-cassert.
        I wouldn't ever recommend trying to debug C functions without that.

        regards, tom lane

        ---------------------------(end of broadcast)---------------------------
        TIP 5: Have you checked our extensive FAQ?



        Comment

        • Teodor Sigaev

          #5
          Re: making tsearch2 dictionaries

          Excuse me, but I was too brief.
          I mean your lexize method of dictionary should return pointer to array with 3
          elements:
          first should points to "one" C-string, second - to "hundred" C-string and 3rd is
          NULL.
          Array and C-strings should be palloc'ed in short-lived context, because it's
          lives during parse text only.




          Tom Lane wrote:[color=blue]
          > Ben <bench@silentme dia.com> writes:
          >[color=green]
          >>Okay, so I was actually able to answer this question on my own, in a
          >>manner of speaking. It seems the way to do this is to merely return a
          >>larger char** array, with one element for each word. But I was having
          >>trouble with postgres crashing, because (I think) it tries to free each
          >>element independently before using all of them. I had set each element
          >>to a different null-terminated chunk of the same palloc'd memory
          >>segment. Having never written C stored procs before, I take it that's
          >>bad practice?[/color]
          >
          >
          > Given Teodor's response, I think the issue is probably that you were
          > palloc'ing in too short-lived a context. But whatever the problem is,
          > you'll narrow it down a lot faster if you build with --enable-cassert.
          > I wouldn't ever recommend trying to debug C functions without that.
          >
          > regards, tom lane
          >
          > ---------------------------(end of broadcast)---------------------------
          > TIP 5: Have you checked our extensive FAQ?
          >
          > http://www.postgresql.org/docs/faqs/FAQ.html[/color]

          --
          Teodor Sigaev E-mail: teodor@sigaev.r u

          ---------------------------(end of broadcast)---------------------------
          TIP 4: Don't 'kill -9' the postmaster

          Comment

          • Ben

            #6
            Re: making tsearch2 dictionaries

            Thanks for the replies. Just to clarify what I was doing, quaicode
            looked something like:

            phrase = palloc(8);
            phrase = "foo\0bar\0 ";
            res = palloc(3);
            res[0] = phrase[0];
            res[1] = phrase[5];
            res[2] = 0;

            That crashed. Once I changed it to:

            res = palloc(3);
            res[0] = palloc(4);
            res[0] = "foo\0";
            res[1] = palloc(4);
            res[2] = "bar\0";
            res[3] = 0;

            it worked.

            Anyway, I'm happy to forget my pain with this if only I could figure out
            how to pipe the lexemes from one dictionary into another dictionary. :)

            On Mon, 2004-02-16 at 08:09, Teodor Sigaev wrote:[color=blue]
            > Excuse me, but I was too brief.
            > I mean your lexize method of dictionary should return pointer to array with 3
            > elements:
            > first should points to "one" C-string, second - to "hundred" C-string and 3rd is
            > NULL.
            > Array and C-strings should be palloc'ed in short-lived context, because it's
            > lives during parse text only.
            >
            >
            >
            >
            > Tom Lane wrote:[color=green]
            > > Ben <bench@silentme dia.com> writes:
            > >[color=darkred]
            > >>Okay, so I was actually able to answer this question on my own, in a
            > >>manner of speaking. It seems the way to do this is to merely return a
            > >>larger char** array, with one element for each word. But I was having
            > >>trouble with postgres crashing, because (I think) it tries to free each
            > >>element independently before using all of them. I had set each element
            > >>to a different null-terminated chunk of the same palloc'd memory
            > >>segment. Having never written C stored procs before, I take it that's
            > >>bad practice?[/color]
            > >
            > >
            > > Given Teodor's response, I think the issue is probably that you were
            > > palloc'ing in too short-lived a context. But whatever the problem is,
            > > you'll narrow it down a lot faster if you build with --enable-cassert.
            > > I wouldn't ever recommend trying to debug C functions without that.
            > >
            > > regards, tom lane
            > >
            > > ---------------------------(end of broadcast)---------------------------
            > > TIP 5: Have you checked our extensive FAQ?
            > >
            > > http://www.postgresql.org/docs/faqs/FAQ.html[/color][/color]


            ---------------------------(end of broadcast)---------------------------
            TIP 5: Have you checked our extensive FAQ?



            Comment

            • Ben

              #7
              Re: making tsearch2 dictionaries

              Like I said, quasicode. :)

              And in fact I see I even put an off-by-one error in this last email that
              wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
              in the first section.

              Are there docs for making parsers? Or anything like gendict?

              On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:
              [color=blue]
              > :)
              > I hope you mean:
              > res = palloc(3);
              > res[0] = palloc(4);
              > memcpy(res[0] ,"foo", 4);
              > res[1] = palloc(4);
              > memcpy(res[1] ,"bar", 4);
              > res[2] = 0;
              >
              > Look at indexes of res.[/color]


              ---------------------------(end of broadcast)---------------------------
              TIP 7: don't forget to increase your free space map settings

              Comment

              • Teodor Sigaev

                #8
                Re: making tsearch2 dictionaries



                Ben wrote:[color=blue]
                > Thanks for the replies. Just to clarify what I was doing, quaicode
                > looked something like:
                >
                > phrase = palloc(8);
                > phrase = "foo\0bar\0 ";
                > res = palloc(3);
                > res[0] = phrase[0];
                > res[1] = phrase[5];
                > res[2] = 0;
                >
                > That crashed. Once I changed it to:
                >
                > res = palloc(3);
                > res[0] = palloc(4);
                > res[0] = "foo\0";
                > res[1] = palloc(4);
                > res[2] = "bar\0";
                > res[3] = 0;
                >
                > it worked.
                >[/color]
                :)
                I hope you mean:
                res = palloc(3);
                res[0] = palloc(4);
                memcpy(res[0] ,"foo", 4);
                res[1] = palloc(4);
                memcpy(res[1] ,"bar", 4);
                res[2] = 0;

                Look at indexes of res.

                --
                Teodor Sigaev E-mail: teodor@sigaev.r u

                ---------------------------(end of broadcast)---------------------------
                TIP 8: explain analyze is your friend

                Comment

                • Teodor Sigaev

                  #9
                  Re: making tsearch2 dictionaries

                  Small docs are avaliable at


                  and into current implementation at contrib/tsearch2/wparser_def.c. The largest
                  code is about headline stuff.

                  Ben wrote:[color=blue]
                  > Like I said, quasicode. :)
                  >
                  > And in fact I see I even put an off-by-one error in this last email that
                  > wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
                  > in the first section.
                  >
                  > Are there docs for making parsers? Or anything like gendict?
                  >
                  > On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:
                  >
                  >[color=green]
                  >>:)
                  >>I hope you mean:
                  >>res = palloc(3);
                  >>res[0] = palloc(4);
                  >>memcpy(res[0] ,"foo", 4);
                  >>res[1] = palloc(4);
                  >>memcpy(res[1] ,"bar", 4);
                  >>res[2] = 0;
                  >>
                  >>Look at indexes of res.[/color][/color]

                  --
                  Teodor Sigaev E-mail: teodor@sigaev.r u

                  ---------------------------(end of broadcast)---------------------------
                  TIP 3: if posting/reading through Usenet, please send an appropriate
                  subscribe-nomail command to majordomo@postg resql.org so that your
                  message can get through to the mailing list cleanly

                  Comment

                  • Oleg Bartunov

                    #10
                    Re: making tsearch2 dictionaries

                    btw, Ben, if you get you dictionary working, could you describe process
                    of developing so other people will appreciate your work. This part of
                    tsearch2 documentation is very weak.

                    Oleg

                    On Mon, 16 Feb 2004, Teodor Sigaev wrote:
                    [color=blue]
                    >
                    >
                    > Ben wrote:[color=green]
                    > > Thanks for the replies. Just to clarify what I was doing, quaicode
                    > > looked something like:
                    > >
                    > > phrase = palloc(8);
                    > > phrase = "foo\0bar\0 ";
                    > > res = palloc(3);
                    > > res[0] = phrase[0];
                    > > res[1] = phrase[5];
                    > > res[2] = 0;
                    > >
                    > > That crashed. Once I changed it to:
                    > >
                    > > res = palloc(3);
                    > > res[0] = palloc(4);
                    > > res[0] = "foo\0";
                    > > res[1] = palloc(4);
                    > > res[2] = "bar\0";
                    > > res[3] = 0;
                    > >
                    > > it worked.
                    > >[/color]
                    > :)
                    > I hope you mean:
                    > res = palloc(3);
                    > res[0] = palloc(4);
                    > memcpy(res[0] ,"foo", 4);
                    > res[1] = palloc(4);
                    > memcpy(res[1] ,"bar", 4);
                    > res[2] = 0;
                    >
                    > Look at indexes of res.
                    >
                    >[/color]

                    Regards,
                    Oleg
                    _______________ _______________ _______________ _______________ _
                    Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
                    Sternberg Astronomical Institute, Moscow University (Russia)
                    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
                    phone: +007(095)939-16-83, +007(095)939-23-83

                    ---------------------------(end of broadcast)---------------------------
                    TIP 3: if posting/reading through Usenet, please send an appropriate
                    subscribe-nomail command to majordomo@postg resql.org so that your
                    message can get through to the mailing list cleanly

                    Comment

                    • Ben

                      #11
                      Re: making tsearch2 dictionaries

                      So I noticed. ;) The dictionary's working, and I'd be happy to expand
                      upon the documentation. Just point me at something to work on.

                      But, like I said, I really want to figure out a way to pipe the output
                      of my dictionary through the another dictionary. If I can't do that, it
                      doesn't seem as useful, because "100" (handled by my dictionary) and
                      "one hundred" (handled by en_stem) currently don't generate the same
                      ts_vector.

                      Once I figure out how to tweak the parser to parse things they way I
                      want, I can expand upon those docs too. Looks like I'm going to need to
                      reach waaaay back into my brain and dust off my flex knowledge for that,
                      though....

                      On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:[color=blue]
                      > btw, Ben, if you get you dictionary working, could you describe process
                      > of developing so other people will appreciate your work. This part of
                      > tsearch2 documentation is very weak.
                      >
                      > Oleg
                      >
                      > On Mon, 16 Feb 2004, Teodor Sigaev wrote:
                      >[color=green]
                      > >
                      > >
                      > > Ben wrote:[color=darkred]
                      > > > Thanks for the replies. Just to clarify what I was doing, quaicode
                      > > > looked something like:
                      > > >
                      > > > phrase = palloc(8);
                      > > > phrase = "foo\0bar\0 ";
                      > > > res = palloc(3);
                      > > > res[0] = phrase[0];
                      > > > res[1] = phrase[5];
                      > > > res[2] = 0;
                      > > >
                      > > > That crashed. Once I changed it to:
                      > > >
                      > > > res = palloc(3);
                      > > > res[0] = palloc(4);
                      > > > res[0] = "foo\0";
                      > > > res[1] = palloc(4);
                      > > > res[2] = "bar\0";
                      > > > res[3] = 0;
                      > > >
                      > > > it worked.
                      > > >[/color]
                      > > :)
                      > > I hope you mean:
                      > > res = palloc(3);
                      > > res[0] = palloc(4);
                      > > memcpy(res[0] ,"foo", 4);
                      > > res[1] = palloc(4);
                      > > memcpy(res[1] ,"bar", 4);
                      > > res[2] = 0;
                      > >
                      > > Look at indexes of res.
                      > >
                      > >[/color]
                      >
                      > Regards,
                      > Oleg
                      > _______________ _______________ _______________ _______________ _
                      > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
                      > Sternberg Astronomical Institute, Moscow University (Russia)
                      > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
                      > phone: +007(095)939-16-83, +007(095)939-23-83[/color]


                      ---------------------------(end of broadcast)---------------------------
                      TIP 6: Have you searched our list archives?



                      Comment

                      • Oleg Bartunov

                        #12
                        Re: making tsearch2 dictionaries

                        On Mon, 16 Feb 2004, Ben wrote:
                        [color=blue]
                        > So I noticed. ;) The dictionary's working, and I'd be happy to expand
                        > upon the documentation. Just point me at something to work on.
                        >[/color]

                        I think you may just write a paper "How I did custom dictionary for tsearch2".
                        From what I've read I see your dictionary could be interesting to people
                        especially if you describe the motivation and usage.
                        Do you want '100' or 'hundred' will be fully equivalent ? So,
                        if you search '100' you will find document with 'hundred'. Interesting,
                        that you will find '123', because '123' will be 'one hundred twenty three'.
                        [color=blue]
                        > But, like I said, I really want to figure out a way to pipe the output
                        > of my dictionary through the another dictionary. If I can't do that, it
                        > doesn't seem as useful, because "100" (handled by my dictionary) and
                        > "one hundred" (handled by en_stem) currently don't generate the same
                        > ts_vector.[/color]

                        What's the problem ? You may configure which dictionaries and in what order
                        should be used for given type of token (pg_ts_cfgmap table).
                        Aha, I got your problem:

                        www=# select * from ts_debug('one hundred');
                        ts_name | tok_type | description | token | dict_name | tsvector
                        -----------------+----------+-------------+---------+-----------+----------
                        default_russian | lword | Latin word | one | {en_stem} | 'one'
                        default_russian | lword | Latin word | hundred | {en_stem} | 'hundr

                        'hundred' becames 'hundr'. You may use synonym dictionary which is
                        rather simple
                        ( see http://www.sai.msu.su/~megera/oddmus...earch_V2_Notes for details ).
                        Once word is recognized by synonym dictionary it will not pass to
                        next dictionary ! This is how tsearch2 is working with any dictionary.

                        [color=blue]
                        >
                        > Once I figure out how to tweak the parser to parse things they way I
                        > want, I can expand upon those docs too. Looks like I'm going to need to
                        > reach waaaay back into my brain and dust off my flex knowledge for that,
                        > though....[/color]

                        What do you want from parser ?
                        [color=blue]
                        >
                        > On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:[color=green]
                        > > btw, Ben, if you get you dictionary working, could you describe process
                        > > of developing so other people will appreciate your work. This part of
                        > > tsearch2 documentation is very weak.
                        > >
                        > > Oleg
                        > >
                        > > On Mon, 16 Feb 2004, Teodor Sigaev wrote:
                        > >[color=darkred]
                        > > >
                        > > >
                        > > > Ben wrote:
                        > > > > Thanks for the replies. Just to clarify what I was doing, quaicode
                        > > > > looked something like:
                        > > > >
                        > > > > phrase = palloc(8);
                        > > > > phrase = "foo\0bar\0 ";
                        > > > > res = palloc(3);
                        > > > > res[0] = phrase[0];
                        > > > > res[1] = phrase[5];
                        > > > > res[2] = 0;
                        > > > >
                        > > > > That crashed. Once I changed it to:
                        > > > >
                        > > > > res = palloc(3);
                        > > > > res[0] = palloc(4);
                        > > > > res[0] = "foo\0";
                        > > > > res[1] = palloc(4);
                        > > > > res[2] = "bar\0";
                        > > > > res[3] = 0;
                        > > > >
                        > > > > it worked.
                        > > > >
                        > > > :)
                        > > > I hope you mean:
                        > > > res = palloc(3);
                        > > > res[0] = palloc(4);
                        > > > memcpy(res[0] ,"foo", 4);
                        > > > res[1] = palloc(4);
                        > > > memcpy(res[1] ,"bar", 4);
                        > > > res[2] = 0;
                        > > >
                        > > > Look at indexes of res.
                        > > >
                        > > >[/color]
                        > >
                        > > Regards,
                        > > Oleg
                        > > _______________ _______________ _______________ _______________ _
                        > > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
                        > > Sternberg Astronomical Institute, Moscow University (Russia)
                        > > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
                        > > phone: +007(095)939-16-83, +007(095)939-23-83[/color]
                        >
                        >
                        > ---------------------------(end of broadcast)---------------------------
                        > TIP 6: Have you searched our list archives?
                        >
                        > http://archives.postgresql.org
                        >[/color]

                        Regards,
                        Oleg
                        _______________ _______________ _______________ _______________ _
                        Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
                        Sternberg Astronomical Institute, Moscow University (Russia)
                        Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
                        phone: +007(095)939-16-83, +007(095)939-23-83

                        ---------------------------(end of broadcast)---------------------------
                        TIP 3: if posting/reading through Usenet, please send an appropriate
                        subscribe-nomail command to majordomo@postg resql.org so that your
                        message can get through to the mailing list cleanly

                        Comment

                        • Ben

                          #13
                          Re: making tsearch2 dictionaries

                          On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:
                          [color=blue]
                          > Do you want '100' or 'hundred' will be fully equivalent ? So,
                          > if you search '100' you will find document with 'hundred'. Interesting,
                          > that you will find '123', because '123' will be 'one hundred twenty three'.[/color]

                          Yeah, for a general case of documents I'm not sure how accurate it would
                          make things, but I'm trying to index music artist names and song titles,
                          where I'd get things like "3 Dog Night".... or is that "Three Dog
                          Night"? :)
                          [color=blue]
                          > What's the problem ? You may configure which dictionaries and in what order
                          > should be used for given type of token (pg_ts_cfgmap table).
                          > Aha, I got your problem:[/color]
                          [color=blue]
                          > Once word is recognized by synonym dictionary it will not pass to
                          > next dictionary ! This is how tsearch2 is working with any dictionary.[/color]

                          Yep, that's my problem. :) And it seems that if I could pass the normal
                          words into an ispell dictionary before passing them on to the en_stem
                          dictionary, I'd get spell checking for free. Unless there's a better way
                          to give "did you mean: <your search spelled correctly>?" results....?

                          I know doing this would increase the size of the generated ts_vector,
                          but for my case, where what I'm indexing is generally only a few words
                          anyway, that's not an issue. As it is, I'm already going to get rid of
                          the stop words file, so that I can actually find things like "The Who."

                          How hard do you think it would be to change up the behavior to make this
                          happen? I
                          [color=blue]
                          > What do you want from parser ?[/color]

                          I want to be able to recognize symbols, such as the degree (°) and
                          vulgar half (½) symbols.


                          ---------------------------(end of broadcast)---------------------------
                          TIP 5: Have you checked our extensive FAQ?



                          Comment

                          • Oleg Bartunov

                            #14
                            Re: making tsearch2 dictionaries

                            On Tue, 17 Feb 2004, Ben wrote:
                            [color=blue]
                            > On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:
                            >[color=green]
                            > > Do you want '100' or 'hundred' will be fully equivalent ? So,
                            > > if you search '100' you will find document with 'hundred'. Interesting,
                            > > that you will find '123', because '123' will be 'one hundred twenty three'.[/color]
                            >
                            > Yeah, for a general case of documents I'm not sure how accurate it would
                            > make things, but I'm trying to index music artist names and song titles,
                            > where I'd get things like "3 Dog Night".... or is that "Three Dog
                            > Night"? :)
                            >[color=green]
                            > > What's the problem ? You may configure which dictionaries and in what order
                            > > should be used for given type of token (pg_ts_cfgmap table).
                            > > Aha, I got your problem:[/color]
                            >[color=green]
                            > > Once word is recognized by synonym dictionary it will not pass to
                            > > next dictionary ! This is how tsearch2 is working with any dictionary.[/color]
                            >
                            > Yep, that's my problem. :) And it seems that if I could pass the normal
                            > words into an ispell dictionary before passing them on to the en_stem
                            > dictionary, I'd get spell checking for free. Unless there's a better way
                            > to give "did you mean: <your search spelled correctly>?" results....?
                            >[/color]

                            If ispell dictionary recognizes a word, that word will not pass to en_stem.
                            We know how to add "query spelling feature" to tsearch2, just waiting
                            for sponsorships :) meanwhile, you could use our trgm module, which
                            implements trigram based spelling correction. You need to maintain
                            separate table with all words of interests (say, from tsvectors) and
                            search query words in that table using bestmatch finction.
                            [color=blue]
                            > I know doing this would increase the size of the generated ts_vector,
                            > but for my case, where what I'm indexing is generally only a few words
                            > anyway, that's not an issue. As it is, I'm already going to get rid of
                            > the stop words file, so that I can actually find things like "The Who."
                            >
                            > How hard do you think it would be to change up the behavior to make this
                            > happen? I
                            >[color=green]
                            > > What do you want from parser ?[/color]
                            >
                            > I want to be able to recognize symbols, such as the degree (ôá) and
                            > vulgar half (ôî) symbols.[/color]

                            You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
                            a token type ( parenthesis_wor d :?)
                            [color=blue]
                            >[/color]

                            Regards,
                            Oleg
                            _______________ _______________ _______________ _______________ _
                            Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
                            Sternberg Astronomical Institute, Moscow University (Russia)
                            Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
                            phone: +007(095)939-16-83, +007(095)939-23-83

                            ---------------------------(end of broadcast)---------------------------
                            TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

                            Comment

                            • Ben

                              #15
                              Re: making tsearch2 dictionaries

                              On Tue, 17 Feb 2004, Oleg Bartunov wrote:
                              [color=blue]
                              > If ispell dictionary recognizes a word, that word will not pass to en_stem.
                              > We know how to add "query spelling feature" to tsearch2, just waiting
                              > for sponsorships :) meanwhile, you could use our trgm module, which
                              > implements trigram based spelling correction. You need to maintain
                              > separate table with all words of interests (say, from tsvectors) and
                              > search query words in that table using bestmatch finction.[/color]

                              Hm, I'll take a look at this approach. I take it you think piping
                              dictionary output to more dictionaries in the chain is a bad idea? :)
                              [color=blue][color=green][color=darkred]
                              > > > What do you want from parser ?[/color]
                              > >
                              > > I want to be able to recognize symbols, such as the degree (ôá) and
                              > > vulgar half (ôî) symbols.[/color]
                              >
                              > You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
                              > a token type ( parenthesis_wor d :?)[/color]

                              uh, not sure how you got (TA) and (TH)... if you look at the original
                              message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
                              you'd just have better luck pointing a browser at a page like
                              http://homepages.comnet.co.nz/~r-mah...text/utf8.html. I want to be
                              able to recognize a subset of these symbols, and I'd want another
                              dictionary I'd make to handle the symbol token to return both the symbol
                              and the common name as lexemes, in case people spell out the symbol
                              instead of entering it.


                              ---------------------------(end of broadcast)---------------------------
                              TIP 9: the planner will ignore your desire to choose an index scan if your
                              joining column's datatypes do not match

                              Comment

                              Working...