making tsearch2 dictionaries

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Oleg Bartunov

    #16
    Re: making tsearch2 dictionaries

    On Tue, 17 Feb 2004, Ben wrote:
    [color=blue]
    > On Tue, 17 Feb 2004, Oleg Bartunov wrote:
    >[color=green]
    > > If ispell dictionary recognizes a word, that word will not pass to en_stem.
    > > We know how to add "query spelling feature" to tsearch2, just waiting
    > > for sponsorships :) meanwhile, you could use our trgm module, which
    > > implements trigram based spelling correction. You need to maintain
    > > separate table with all words of interests (say, from tsvectors) and
    > > search query words in that table using bestmatch finction.[/color]
    >
    > Hm, I'll take a look at this approach. I take it you think piping
    > dictionary output to more dictionaries in the chain is a bad idea? :)[/color]

    it's unpredictable and I still don't get your idea of pipilining, but
    in general, I have nothing agains it.
    [color=blue]
    >[color=green][color=darkred]
    > > > > What do you want from parser ?
    > > >
    > > > I want to be able to recognize symbols, such as the degree (ôá) and
    > > > vulgar half (ôî) symbols.[/color]
    > >
    > > You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
    > > a token type ( parenthesis_wor d :?)[/color]
    >
    > uh, not sure how you got (TA) and (TH)... if you look at the original
    > message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
    > you'd just have better luck pointing a browser at a page like[/color]

    Yup:)
    [color=blue]
    > http://homepages.comnet.co.nz/~r-mah...text/utf8.html. I want to be
    > able to recognize a subset of these symbols, and I'd want another
    > dictionary I'd make to handle the symbol token to return both the symbol
    > and the common name as lexemes, in case people spell out the symbol
    > instead of entering it.
    >[/color]

    Aha, the same way as we handle complex words with hyphen - we return
    the whole word and its parts. So you need to introduce new type of token
    in parser and use synonym dictionary which in one's turn will returns
    the symbol token and human readable word.

    Regards,
    Oleg
    _______________ _______________ _______________ _______________ _
    Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
    Sternberg Astronomical Institute, Moscow University (Russia)
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(095)939-16-83, +007(095)939-23-83

    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

    Comment

    • Ben

      #17
      Re: making tsearch2 dictionaries

      On Tue, 17 Feb 2004, Oleg Bartunov wrote:
      [color=blue]
      > it's unpredictable and I still don't get your idea of pipilining, but
      > in general, I have nothing agains it.[/color]

      Oh, well, the idea is that instead of the dictionary searching stopping at
      the first dictionary in the chain that returns a lexeme, it would take
      each of the lexemes returned and pass them on to the next dictionary in
      the chain.

      So if I specified numbers were to be handled by my num2english dictionary,
      followed by en_stem, and then tried to deal get a vector for "100",
      num2english would return "one" and "hundred". Then both "one" and
      "hundred" would each be looked up in en_stem, and the union of these
      lexems would be the final result.

      Similarly, if a latin word gets piped through an ispell dictionary before
      being sent to en_stem, each possible spelling would be stemmed.
      [color=blue]
      > Aha, the same way as we handle complex words with hyphen - we return
      > the whole word and its parts. So you need to introduce new type of token
      > in parser and use synonym dictionary which in one's turn will returns
      > the symbol token and human readable word.[/color]

      Okay, that makes sense. I'll look more into how hyphenated words are being
      handled now.


      ---------------------------(end of broadcast)---------------------------
      TIP 6: Have you searched our list archives?



      Comment

      Working...