website doc search is extremely SLOW

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • John Sidney-Woollett

    #16
    Re: website doc search is extremely SLOW

    I think that Oleg's new search offering looks really good and fast. (I
    can't wait till I have some task that needs tsearch!).

    I agree with Dave that searching the docs is more important for me than
    the sites - but it would be really nice to have both, in one tool.

    I built something similar for the Tate Gallery in the UK - here you can
    select the type of content that you want returned, either static pages or
    dynamic. You can see the idea at


    This is custom built (using java/Oracle), supports stemming, boolean
    operators, exact phrase matching, relevancy and matched term highlighting.

    You can switch on/off the types of documents that you are not interested
    in. Using this analogy, a search facility that could offer you results
    from i) the docs and/or ii) the postgres sites static pages would be very
    useful.

    John Sidney-Woollett

    Dave Cramer said:[color=blue]
    > Marc,
    >
    > No it doesn't spider, it is a specialized tool for searching documents.
    >
    > I'm curious, what value is there to being able to count the number of
    > url's ?
    >
    > It does do things like query all documents where CREATE AND TABLE are n
    > words apart, just as fast, I would think these are more valuable to
    > document searching?
    >
    > I think the challenge here is what do we want to search. I am betting
    > that folks use this page as they would man? ie. what is the command for
    > create trigger?
    >
    > As I said my offer stands to help out, but I think if the goal is to
    > search the entire website, then this particular tool is not useful.
    >
    > At this point I am working on indexing the sgml directly as it has less
    > cruft in it. For instance all the links that appear in every summary are
    > just noise.
    >
    >
    > Dave
    >
    > On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:[color=green]
    >> On Wed, 31 Dec 2003, Dave Cramer wrote:
    >>[color=darkred]
    >> > I can modify mine to be client server if you want?
    >> >
    >> > It is a java app, so we need to be able to run jdk1.3 at least?[/color]
    >>
    >> jdk1.4 is available on the VMs ... does your spider? for instance, you
    >> mention that you have the docs indexed right now, but we are currently
    >> indexing:
    >>
    >> Server http://archives.postgresql.org/
    >> Server http://advocacy.postgresql.org/
    >> Server http://developer.postgresql.org/
    >> Server http://gborg.postgresql.org/
    >> Server http://pgadmin.postgresql.org/
    >> Server http://techdocs.postgresql.org/
    >> Server http://www.postgresql.org/
    >>
    >> will it be able to handle:
    >>
    >> 186_archives=# select count(*) from url;
    >> count
    >> --------
    >> 393551
    >> (1 row)
    >>
    >> as fast as you are finding with just the docs?
    >>
    >> ----
    >> Marc G. Fournier Hub.Org Networking Services
    >> (http://www.hub.org)
    >> Email: scrappy@hub.org Yahoo!: yscrappy ICQ:
    >> 7615664
    >>[/color]
    > --
    > Dave Cramer
    > 519 939 0336
    > ICQ # 1467551
    >
    >
    > ---------------------------(end of broadcast)---------------------------
    > TIP 9: the planner will ignore your desire to choose an index scan if your
    > joining column's datatypes do not match
    >[/color]


    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

    Comment

    • Ericson Smith

      #17
      Re: website doc search is extremely SLOW

      You should probably take a look at the Swish project. For a certain
      project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
      but with over 600,000 documents to index, both took too long to conduct
      searches, especially as the database was swapped in and out of memory
      based on search segment. MySQL full text was the most unusable.

      Swish uses its own internal DB format, and comes with a simple spider as
      well. You can make it search by category, date and other nifty criteria
      also.


      You can take a look over at the project and do some searches to see what
      I mean:


      Warmest regards,
      Ericson Smith
      Tracking Specialist/DBA
      +-----------------------+----------------------------+
      | http://www.did-it.com | "When I'm paid, I always |
      | eric@did-it.com | follow the job through. |
      | 516-255-0500 | You know that." -Angel Eyes|
      +-----------------------+----------------------------+



      John Sidney-Woollett wrote:
      [color=blue]
      >I think that Oleg's new search offering looks really good and fast. (I
      >can't wait till I have some task that needs tsearch!).
      >
      >I agree with Dave that searching the docs is more important for me than
      >the sites - but it would be really nice to have both, in one tool.
      >
      >I built something similar for the Tate Gallery in the UK - here you can
      >select the type of content that you want returned, either static pages or
      >dynamic. You can see the idea at
      >http://www.tate.org.uk/search/defaul...oil&action=new
      >
      >This is custom built (using java/Oracle), supports stemming, boolean
      >operators, exact phrase matching, relevancy and matched term highlighting.
      >
      >You can switch on/off the types of documents that you are not interested
      >in. Using this analogy, a search facility that could offer you results
      >from i) the docs and/or ii) the postgres sites static pages would be very
      >useful.
      >
      >John Sidney-Woollett
      >
      >Dave Cramer said:
      >
      >[color=green]
      >>Marc,
      >>
      >>No it doesn't spider, it is a specialized tool for searching documents.
      >>
      >>I'm curious, what value is there to being able to count the number of
      >>url's ?
      >>
      >>It does do things like query all documents where CREATE AND TABLE are n
      >>words apart, just as fast, I would think these are more valuable to
      >>document searching?
      >>
      >>I think the challenge here is what do we want to search. I am betting
      >>that folks use this page as they would man? ie. what is the command for
      >>create trigger?
      >>
      >>As I said my offer stands to help out, but I think if the goal is to
      >>search the entire website, then this particular tool is not useful.
      >>
      >>At this point I am working on indexing the sgml directly as it has less
      >>cruft in it. For instance all the links that appear in every summary are
      >>just noise.
      >>
      >>
      >>Dave
      >>
      >>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
      >>
      >>[color=darkred]
      >>>On Wed, 31 Dec 2003, Dave Cramer wrote:
      >>>
      >>>
      >>>
      >>>>I can modify mine to be client server if you want?
      >>>>
      >>>>It is a java app, so we need to be able to run jdk1.3 at least?
      >>>>
      >>>>
      >>>jdk1.4 is available on the VMs ... does your spider? for instance, you
      >>>mention that you have the docs indexed right now, but we are currently
      >>>indexing:
      >>>
      >>>Server http://archives.postgresql.org/
      >>>Server http://advocacy.postgresql.org/
      >>>Server http://developer.postgresql.org/
      >>>Server http://gborg.postgresql.org/
      >>>Server http://pgadmin.postgresql.org/
      >>>Server http://techdocs.postgresql.org/
      >>>Server http://www.postgresql.org/
      >>>
      >>>will it be able to handle:
      >>>
      >>>186_archives =# select count(*) from url;
      >>> count
      >>>--------
      >>> 393551
      >>>(1 row)
      >>>
      >>>as fast as you are finding with just the docs?
      >>>
      >>>----
      >>>Marc G. Fournier Hub.Org Networking Services
      >>>(http://www.hub.org)
      >>>Email: scrappy@hub.org Yahoo!: yscrappy ICQ:
      >>>7615664
      >>>
      >>>
      >>>[/color]
      >>--
      >>Dave Cramer
      >>519 939 0336
      >>ICQ # 1467551
      >>
      >>
      >>---------------------------(end of broadcast)---------------------------
      >>TIP 9: the planner will ignore your desire to choose an index scan if your
      >> joining column's datatypes do not match
      >>
      >>
      >>[/color]
      >
      >
      >---------------------------(end of broadcast)---------------------------
      >TIP 2: you can get off all lists at once with the unregister command
      > (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)
      >
      >
      >[/color]


      ---------------------------(end of broadcast)---------------------------
      TIP 5: Have you checked our extensive FAQ?



      Comment

      • John Sidney-Woollett

        #18
        Re: website doc search is extremely SLOW

        Wow, you're right - I could have probably saved myself a load of time! :)

        Although you do learn a lot reinventing the wheel... ...or at least you
        hit the same issues and insights others did before...

        John

        Ericson Smith said:[color=blue]
        > You should probably take a look at the Swish project. For a certain
        > project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
        > but with over 600,000 documents to index, both took too long to conduct
        > searches, especially as the database was swapped in and out of memory
        > based on search segment. MySQL full text was the most unusable.
        >
        > Swish uses its own internal DB format, and comes with a simple spider as
        > well. You can make it search by category, date and other nifty criteria
        > also.
        > http://swish-e.org
        >
        > You can take a look over at the project and do some searches to see what
        > I mean:
        > http://cbd-net.com
        >
        > Warmest regards,
        > Ericson Smith
        > Tracking Specialist/DBA
        > +-----------------------+----------------------------+
        > | http://www.did-it.com | "When I'm paid, I always |
        > | eric@did-it.com | follow the job through. |
        > | 516-255-0500 | You know that." -Angel Eyes|
        > +-----------------------+----------------------------+
        >
        >
        >
        > John Sidney-Woollett wrote:
        >[color=green]
        >>I think that Oleg's new search offering looks really good and fast. (I
        >>can't wait till I have some task that needs tsearch!).
        >>
        >>I agree with Dave that searching the docs is more important for me than
        >>the sites - but it would be really nice to have both, in one tool.
        >>
        >>I built something similar for the Tate Gallery in the UK - here you can
        >>select the type of content that you want returned, either static pages or
        >>dynamic. You can see the idea at
        >>http://www.tate.org.uk/search/defaul...oil&action=new
        >>
        >>This is custom built (using java/Oracle), supports stemming, boolean
        >>operators, exact phrase matching, relevancy and matched term
        >> highlighting.
        >>
        >>You can switch on/off the types of documents that you are not interested
        >>in. Using this analogy, a search facility that could offer you results
        >>from i) the docs and/or ii) the postgres sites static pages would be very
        >>useful.
        >>
        >>John Sidney-Woollett
        >>
        >>Dave Cramer said:
        >>
        >>[color=darkred]
        >>>Marc,
        >>>
        >>>No it doesn't spider, it is a specialized tool for searching documents.
        >>>
        >>>I'm curious, what value is there to being able to count the number of
        >>>url's ?
        >>>
        >>>It does do things like query all documents where CREATE AND TABLE are n
        >>>words apart, just as fast, I would think these are more valuable to
        >>>document searching?
        >>>
        >>>I think the challenge here is what do we want to search. I am betting
        >>>that folks use this page as they would man? ie. what is the command for
        >>>create trigger?
        >>>
        >>>As I said my offer stands to help out, but I think if the goal is to
        >>>search the entire website, then this particular tool is not useful.
        >>>
        >>>At this point I am working on indexing the sgml directly as it has less
        >>>cruft in it. For instance all the links that appear in every summary are
        >>>just noise.
        >>>
        >>>
        >>>Dave
        >>>
        >>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
        >>>
        >>>
        >>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
        >>>>
        >>>>
        >>>>
        >>>>>I can modify mine to be client server if you want?
        >>>>>
        >>>>>It is a java app, so we need to be able to run jdk1.3 at least?
        >>>>>
        >>>>>
        >>>>jdk1.4 is available on the VMs ... does your spider? for instance, you
        >>>>mention that you have the docs indexed right now, but we are currently
        >>>>indexing:
        >>>>
        >>>>Server http://archives.postgresql.org/
        >>>>Server http://advocacy.postgresql.org/
        >>>>Server http://developer.postgresql.org/
        >>>>Server http://gborg.postgresql.org/
        >>>>Server http://pgadmin.postgresql.org/
        >>>>Server http://techdocs.postgresql.org/
        >>>>Server http://www.postgresql.org/
        >>>>
        >>>>will it be able to handle:
        >>>>
        >>>>186_archive s=# select count(*) from url;
        >>>> count
        >>>>--------
        >>>> 393551
        >>>>(1 row)
        >>>>
        >>>>as fast as you are finding with just the docs?
        >>>>
        >>>>----
        >>>>Marc G. Fournier Hub.Org Networking Services
        >>>>(http://www.hub.org)
        >>>>Email: scrappy@hub.org Yahoo!: yscrappy ICQ:
        >>>>7615664
        >>>>
        >>>>
        >>>>
        >>>--
        >>>Dave Cramer
        >>>519 939 0336
        >>>ICQ # 1467551
        >>>
        >>>
        >>>---------------------------(end of broadcast)---------------------------
        >>>TIP 9: the planner will ignore your desire to choose an index scan if
        >>> your
        >>> joining column's datatypes do not match
        >>>
        >>>
        >>>[/color]
        >>
        >>
        >>---------------------------(end of broadcast)---------------------------
        >>TIP 2: you can get off all lists at once with the unregister command
        >> (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)
        >>
        >>
        >>[/color]
        >[/color]


        ---------------------------(end of broadcast)---------------------------
        TIP 6: Have you searched our list archives?



        Comment

        • Dave Cramer

          #19
          Re: website doc search is extremely SLOW

          The search engine I am using is lucene
          Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for...


          it too uses it's own internal database format, optimized for searching,
          it is quite flexible, and allow searching on arbitrary fields as well.


          The section on querying explains more



          It is even possible to index text data inside a database.

          Dave
          On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:[color=blue]
          > Wow, you're right - I could have probably saved myself a load of time! :)
          >
          > Although you do learn a lot reinventing the wheel... ...or at least you
          > hit the same issues and insights others did before...
          >
          > John
          >
          > Ericson Smith said:[color=green]
          > > You should probably take a look at the Swish project. For a certain
          > > project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
          > > but with over 600,000 documents to index, both took too long to conduct
          > > searches, especially as the database was swapped in and out of memory
          > > based on search segment. MySQL full text was the most unusable.
          > >
          > > Swish uses its own internal DB format, and comes with a simple spider as
          > > well. You can make it search by category, date and other nifty criteria
          > > also.
          > > http://swish-e.org
          > >
          > > You can take a look over at the project and do some searches to see what
          > > I mean:
          > > http://cbd-net.com
          > >
          > > Warmest regards,
          > > Ericson Smith
          > > Tracking Specialist/DBA
          > > +-----------------------+----------------------------+
          > > | http://www.did-it.com | "When I'm paid, I always |
          > > | eric@did-it.com | follow the job through. |
          > > | 516-255-0500 | You know that." -Angel Eyes|
          > > +-----------------------+----------------------------+
          > >
          > >
          > >
          > > John Sidney-Woollett wrote:
          > >[color=darkred]
          > >>I think that Oleg's new search offering looks really good and fast. (I
          > >>can't wait till I have some task that needs tsearch!).
          > >>
          > >>I agree with Dave that searching the docs is more important for me than
          > >>the sites - but it would be really nice to have both, in one tool.
          > >>
          > >>I built something similar for the Tate Gallery in the UK - here you can
          > >>select the type of content that you want returned, either static pages or
          > >>dynamic. You can see the idea at
          > >>http://www.tate.org.uk/search/defaul...oil&action=new
          > >>
          > >>This is custom built (using java/Oracle), supports stemming, boolean
          > >>operators, exact phrase matching, relevancy and matched term
          > >> highlighting.
          > >>
          > >>You can switch on/off the types of documents that you are not interested
          > >>in. Using this analogy, a search facility that could offer you results
          > >>from i) the docs and/or ii) the postgres sites static pages would be very
          > >>useful.
          > >>
          > >>John Sidney-Woollett
          > >>
          > >>Dave Cramer said:
          > >>
          > >>
          > >>>Marc,
          > >>>
          > >>>No it doesn't spider, it is a specialized tool for searching documents.
          > >>>
          > >>>I'm curious, what value is there to being able to count the number of
          > >>>url's ?
          > >>>
          > >>>It does do things like query all documents where CREATE AND TABLE are n
          > >>>words apart, just as fast, I would think these are more valuable to
          > >>>document searching?
          > >>>
          > >>>I think the challenge here is what do we want to search. I am betting
          > >>>that folks use this page as they would man? ie. what is the command for
          > >>>create trigger?
          > >>>
          > >>>As I said my offer stands to help out, but I think if the goal is to
          > >>>search the entire website, then this particular tool is not useful.
          > >>>
          > >>>At this point I am working on indexing the sgml directly as it has less
          > >>>cruft in it. For instance all the links that appear in every summary are
          > >>>just noise.
          > >>>
          > >>>
          > >>>Dave
          > >>>
          > >>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
          > >>>
          > >>>
          > >>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
          > >>>>
          > >>>>
          > >>>>
          > >>>>>I can modify mine to be client server if you want?
          > >>>>>
          > >>>>>It is a java app, so we need to be able to run jdk1.3 at least?
          > >>>>>
          > >>>>>
          > >>>>jdk1.4 is available on the VMs ... does your spider? for instance, you
          > >>>>mention that you have the docs indexed right now, but we are currently
          > >>>>indexing:
          > >>>>
          > >>>>Server http://archives.postgresql.org/
          > >>>>Server http://advocacy.postgresql.org/
          > >>>>Server http://developer.postgresql.org/
          > >>>>Server http://gborg.postgresql.org/
          > >>>>Server http://pgadmin.postgresql.org/
          > >>>>Server http://techdocs.postgresql.org/
          > >>>>Server http://www.postgresql.org/
          > >>>>
          > >>>>will it be able to handle:
          > >>>>
          > >>>>186_archive s=# select count(*) from url;
          > >>>> count
          > >>>>--------
          > >>>> 393551
          > >>>>(1 row)
          > >>>>
          > >>>>as fast as you are finding with just the docs?
          > >>>>
          > >>>>----
          > >>>>Marc G. Fournier Hub.Org Networking Services
          > >>>>(http://www.hub.org)
          > >>>>Email: scrappy@hub.org Yahoo!: yscrappy ICQ:
          > >>>>7615664
          > >>>>
          > >>>>
          > >>>>
          > >>>--
          > >>>Dave Cramer
          > >>>519 939 0336
          > >>>ICQ # 1467551
          > >>>
          > >>>
          > >>>---------------------------(end of broadcast)---------------------------
          > >>>TIP 9: the planner will ignore your desire to choose an index scan if
          > >>> your
          > >>> joining column's datatypes do not match
          > >>>
          > >>>
          > >>>
          > >>
          > >>
          > >>---------------------------(end of broadcast)---------------------------
          > >>TIP 2: you can get off all lists at once with the unregister command
          > >> (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)
          > >>
          > >>
          > >>[/color]
          > >[/color]
          >[/color]
          --
          Dave Cramer
          519 939 0336
          ICQ # 1467551


          ---------------------------(end of broadcast)---------------------------
          TIP 9: the planner will ignore your desire to choose an index scan if your
          joining column's datatypes do not match

          Comment

          • Dave Cramer

            #20
            Re: website doc search is extremely SLOW

            Well it appears there are quite a few solutions to use so the next
            question should be what are we trying to accomplish here?

            One thing that I think is that the documentation search should be
            limited to the documentation.

            Who is in a position to make the decision of which solution to use?

            Dave
            On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:[color=blue]
            > Wow, you're right - I could have probably saved myself a load of time! :)
            >
            > Although you do learn a lot reinventing the wheel... ...or at least you
            > hit the same issues and insights others did before...
            >
            > John
            >
            > Ericson Smith said:[color=green]
            > > You should probably take a look at the Swish project. For a certain
            > > project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
            > > but with over 600,000 documents to index, both took too long to conduct
            > > searches, especially as the database was swapped in and out of memory
            > > based on search segment. MySQL full text was the most unusable.
            > >
            > > Swish uses its own internal DB format, and comes with a simple spider as
            > > well. You can make it search by category, date and other nifty criteria
            > > also.
            > > http://swish-e.org
            > >
            > > You can take a look over at the project and do some searches to see what
            > > I mean:
            > > http://cbd-net.com
            > >
            > > Warmest regards,
            > > Ericson Smith
            > > Tracking Specialist/DBA
            > > +-----------------------+----------------------------+
            > > | http://www.did-it.com | "When I'm paid, I always |
            > > | eric@did-it.com | follow the job through. |
            > > | 516-255-0500 | You know that." -Angel Eyes|
            > > +-----------------------+----------------------------+
            > >
            > >
            > >
            > > John Sidney-Woollett wrote:
            > >[color=darkred]
            > >>I think that Oleg's new search offering looks really good and fast. (I
            > >>can't wait till I have some task that needs tsearch!).
            > >>
            > >>I agree with Dave that searching the docs is more important for me than
            > >>the sites - but it would be really nice to have both, in one tool.
            > >>
            > >>I built something similar for the Tate Gallery in the UK - here you can
            > >>select the type of content that you want returned, either static pages or
            > >>dynamic. You can see the idea at
            > >>http://www.tate.org.uk/search/defaul...oil&action=new
            > >>
            > >>This is custom built (using java/Oracle), supports stemming, boolean
            > >>operators, exact phrase matching, relevancy and matched term
            > >> highlighting.
            > >>
            > >>You can switch on/off the types of documents that you are not interested
            > >>in. Using this analogy, a search facility that could offer you results
            > >>from i) the docs and/or ii) the postgres sites static pages would be very
            > >>useful.
            > >>
            > >>John Sidney-Woollett
            > >>
            > >>Dave Cramer said:
            > >>
            > >>
            > >>>Marc,
            > >>>
            > >>>No it doesn't spider, it is a specialized tool for searching documents.
            > >>>
            > >>>I'm curious, what value is there to being able to count the number of
            > >>>url's ?
            > >>>
            > >>>It does do things like query all documents where CREATE AND TABLE are n
            > >>>words apart, just as fast, I would think these are more valuable to
            > >>>document searching?
            > >>>
            > >>>I think the challenge here is what do we want to search. I am betting
            > >>>that folks use this page as they would man? ie. what is the command for
            > >>>create trigger?
            > >>>
            > >>>As I said my offer stands to help out, but I think if the goal is to
            > >>>search the entire website, then this particular tool is not useful.
            > >>>
            > >>>At this point I am working on indexing the sgml directly as it has less
            > >>>cruft in it. For instance all the links that appear in every summary are
            > >>>just noise.
            > >>>
            > >>>
            > >>>Dave
            > >>>
            > >>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
            > >>>
            > >>>
            > >>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
            > >>>>
            > >>>>
            > >>>>
            > >>>>>I can modify mine to be client server if you want?
            > >>>>>
            > >>>>>It is a java app, so we need to be able to run jdk1.3 at least?
            > >>>>>
            > >>>>>
            > >>>>jdk1.4 is available on the VMs ... does your spider? for instance, you
            > >>>>mention that you have the docs indexed right now, but we are currently
            > >>>>indexing:
            > >>>>
            > >>>>Server http://archives.postgresql.org/
            > >>>>Server http://advocacy.postgresql.org/
            > >>>>Server http://developer.postgresql.org/
            > >>>>Server http://gborg.postgresql.org/
            > >>>>Server http://pgadmin.postgresql.org/
            > >>>>Server http://techdocs.postgresql.org/
            > >>>>Server http://www.postgresql.org/
            > >>>>
            > >>>>will it be able to handle:
            > >>>>
            > >>>>186_archive s=# select count(*) from url;
            > >>>> count
            > >>>>--------
            > >>>> 393551
            > >>>>(1 row)
            > >>>>
            > >>>>as fast as you are finding with just the docs?
            > >>>>
            > >>>>----
            > >>>>Marc G. Fournier Hub.Org Networking Services
            > >>>>(http://www.hub.org)
            > >>>>Email: scrappy@hub.org Yahoo!: yscrappy ICQ:
            > >>>>7615664
            > >>>>
            > >>>>
            > >>>>
            > >>>--
            > >>>Dave Cramer
            > >>>519 939 0336
            > >>>ICQ # 1467551
            > >>>
            > >>>
            > >>>---------------------------(end of broadcast)---------------------------
            > >>>TIP 9: the planner will ignore your desire to choose an index scan if
            > >>> your
            > >>> joining column's datatypes do not match
            > >>>
            > >>>
            > >>>
            > >>
            > >>
            > >>---------------------------(end of broadcast)---------------------------
            > >>TIP 2: you can get off all lists at once with the unregister command
            > >> (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)
            > >>
            > >>
            > >>[/color]
            > >[/color]
            >[/color]
            --
            Dave Cramer
            519 939 0336
            ICQ # 1467551


            ---------------------------(end of broadcast)---------------------------
            TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

            Comment

            • Eric Ridge

              #21
              Re: website doc search is extremely SLOW

              On Dec 31, 2003, at 5:40 AM, Arjen van der Meijden wrote:
              [color=blue]
              > The main advantage of taking this out of your sql database is that it
              > runs on its own custom built storage system (and you could offload it
              > to another machine, like we did).
              > Btw, if you really need an "in database" solution, read back the
              > postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's
              > working on integrating xapian in postgresql as a FTI)[/color]

              Hi, that's me! I'm working it right now, and it's coming along really
              well. I actually hope to have it integrated with Postgres' storage
              subsystem by the end of the day and to have it returning useful
              results.

              eric


              ---------------------------(end of broadcast)---------------------------
              TIP 6: Have you searched our list archives?



              Comment

              • Marc G. Fournier

                #22
                Re: website doc search is extremely SLOW

                On Wed, 31 Dec 2003, Dave Cramer wrote:
                [color=blue]
                > Marc,
                >
                > No it doesn't spider, it is a specialized tool for searching documents.
                >
                > I'm curious, what value is there to being able to count the number of
                > url's ?[/color]

                Sorry, that was just an example of the # of docs that have to be searched
                through ... again, the *biggest* thing that is searched is the mailing
                list archives, so without spidering, not sure how we'll be able to pull
                that in ... ?


                ----
                Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

                ---------------------------(end of broadcast)---------------------------
                TIP 5: Have you checked our extensive FAQ?



                Comment

                • Marc G. Fournier

                  #23
                  Re: website doc search is extremely SLOW

                  On Tue, 30 Dec 2003, Joshua D. Drake wrote:
                  [color=blue]
                  > Hello,
                  >
                  > Why are we not using Tsearch2?[/color]

                  Because nobody has built it yet? Oleg's stuff is nice, but we want
                  something that we can build into the existing web sites, not a standalone
                  site ...

                  I keep searching the web hoping someone has come up with a 'tsearch2'
                  based search engine that does the spidering, but, unless its sitting right
                  in front of my eyes and I'm not seeing it, I haven't found it yet :(

                  Out of everything I've found so far, mnogosearch is one of the best ... I
                  just wish I could figure out where the bottleneck for it was, since, from
                  reading their docs, their method of storing the data doesn't appear to be
                  particularly off. I'm tempted to try their caching storage manager, and
                  getting away from SQL totally, but I *really* want to showcase PostgreSQL
                  on this :(

                  ----
                  Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                  Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

                  ---------------------------(end of broadcast)---------------------------
                  TIP 2: you can get off all lists at once with the unregister command
                  (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

                  Comment

                  • Bruce Momjian

                    #24
                    Re: website doc search is extremely SLOW

                    Marc G. Fournier wrote:[color=blue]
                    > On Tue, 30 Dec 2003, Joshua D. Drake wrote:
                    >[color=green]
                    > > Hello,
                    > >
                    > > Why are we not using Tsearch2?[/color]
                    >
                    > Because nobody has built it yet? Oleg's stuff is nice, but we want
                    > something that we can build into the existing web sites, not a standalone
                    > site ...
                    >
                    > I keep searching the web hoping someone has come up with a 'tsearch2'
                    > based search engine that does the spidering, but, unless its sitting right
                    > in front of my eyes and I'm not seeing it, I haven't found it yet :(
                    >
                    > Out of everything I've found so far, mnogosearch is one of the best ... I
                    > just wish I could figure out where the bottleneck for it was, since, from
                    > reading their docs, their method of storing the data doesn't appear to be
                    > particularly off. I'm tempted to try their caching storage manager, and
                    > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                    > on this :([/color]

                    Well, PostgreSQL is being un-showcased in the current setup, that's for
                    sure. :-(

                    --
                    Bruce Momjian | http://candle.pha.pa.us
                    pgman@candle.ph a.pa.us | (610) 359-1001
                    + If your life is a hard drive, | 13 Roberts Road
                    + Christ can be your backup. | Newtown Square, Pennsylvania 19073

                    ---------------------------(end of broadcast)---------------------------
                    TIP 7: don't forget to increase your free space map settings

                    Comment

                    • Marc G. Fournier

                      #25
                      Re: website doc search is extremely SLOW

                      On Wed, 31 Dec 2003, Bruce Momjian wrote:
                      [color=blue][color=green]
                      > > Out of everything I've found so far, mnogosearch is one of the best ... I
                      > > just wish I could figure out where the bottleneck for it was, since, from
                      > > reading their docs, their method of storing the data doesn't appear to be
                      > > particularly off. I'm tempted to try their caching storage manager, and
                      > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                      > > on this :([/color]
                      >
                      > Well, PostgreSQL is being un-showcased in the current setup, that's for
                      > sure. :-([/color]

                      Agreed ... I could install the MySQL backend, whichits designed for, and
                      advertise it as PostgreSQL? :)

                      ----
                      Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                      Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

                      ---------------------------(end of broadcast)---------------------------
                      TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

                      Comment

                      • Bruce Momjian

                        #26
                        Re: website doc search is extremely SLOW

                        Marc G. Fournier wrote:[color=blue]
                        > On Wed, 31 Dec 2003, Bruce Momjian wrote:
                        >[color=green][color=darkred]
                        > > > Out of everything I've found so far, mnogosearch is one of the best ... I
                        > > > just wish I could figure out where the bottleneck for it was, since, from
                        > > > reading their docs, their method of storing the data doesn't appear to be
                        > > > particularly off. I'm tempted to try their caching storage manager, and
                        > > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                        > > > on this :([/color]
                        > >
                        > > Well, PostgreSQL is being un-showcased in the current setup, that's for
                        > > sure. :-([/color]
                        >
                        > Agreed ... I could install the MySQL backend, whichits designed for, and
                        > advertise it as PostgreSQL? :)[/color]

                        I would be curious to know if it is faster --- that would tell use if it
                        is tuned only for MySQL.

                        Have you tried CLUSTER? I think the MySQL ISAM files are
                        auto-clustered, and clustering is usually important for full-text
                        searches.


                        --
                        Bruce Momjian | http://candle.pha.pa.us
                        pgman@candle.ph a.pa.us | (610) 359-1001
                        + If your life is a hard drive, | 13 Roberts Road
                        + Christ can be your backup. | Newtown Square, Pennsylvania 19073

                        ---------------------------(end of broadcast)---------------------------
                        TIP 8: explain analyze is your friend

                        Comment

                        • Dave Cramer

                          #27
                          Re: website doc search is extremely SLOW

                          On Wed, 2003-12-31 at 18:43, Bruce Momjian wrote:[color=blue]
                          > Marc G. Fournier wrote:[color=green]
                          > > On Tue, 30 Dec 2003, Joshua D. Drake wrote:
                          > >[color=darkred]
                          > > > Hello,
                          > > >
                          > > > Why are we not using Tsearch2?[/color]
                          > >
                          > > Because nobody has built it yet? Oleg's stuff is nice, but we want
                          > > something that we can build into the existing web sites, not a standalone
                          > > site ...
                          > >
                          > > I keep searching the web hoping someone has come up with a 'tsearch2'
                          > > based search engine that does the spidering, but, unless its sitting right
                          > > in front of my eyes and I'm not seeing it, I haven't found it yet :(
                          > >
                          > > Out of everything I've found so far, mnogosearch is one of the best ... I
                          > > just wish I could figure out where the bottleneck for it was, since, from
                          > > reading their docs, their method of storing the data doesn't appear to be
                          > > particularly off. I'm tempted to try their caching storage manager, and
                          > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                          > > on this :([/color]
                          >
                          > Well, PostgreSQL is being un-showcased in the current setup, that's for
                          > sure. :-([/color]
                          In fact this is a very bad advertisement for postgres

                          --
                          Dave Cramer
                          519 939 0336
                          ICQ # 1467551


                          ---------------------------(end of broadcast)---------------------------
                          TIP 9: the planner will ignore your desire to choose an index scan if your
                          joining column's datatypes do not match

                          Comment

                          • Marc G. Fournier

                            #28
                            Mnogosearch (Was: Re: website doc search is ... )

                            On Wed, 31 Dec 2003, Bruce Momjian wrote:
                            [color=blue]
                            > Marc G. Fournier wrote:[color=green]
                            > > On Wed, 31 Dec 2003, Bruce Momjian wrote:
                            > >[color=darkred]
                            > > > > Out of everything I've found so far, mnogosearch is one of the best ... I
                            > > > > just wish I could figure out where the bottleneck for it was, since, from
                            > > > > reading their docs, their method of storing the data doesn't appear to be
                            > > > > particularly off. I'm tempted to try their caching storage manager, and
                            > > > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                            > > > > on this :(
                            > > >
                            > > > Well, PostgreSQL is being un-showcased in the current setup, that's for
                            > > > sure. :-([/color]
                            > >
                            > > Agreed ... I could install the MySQL backend, whichits designed for, and
                            > > advertise it as PostgreSQL? :)[/color]
                            >
                            > I would be curious to know if it is faster --- that would tell use if it
                            > is tuned only for MySQL.
                            >
                            > Have you tried CLUSTER? I think the MySQL ISAM files are
                            > auto-clustered, and clustering is usually important for full-text
                            > searches.[/color]

                            Actually, check out http://www.mnogosearch.com ... the way they do the
                            indexing doesn't (at least, as far as I can tell) make use of full-text
                            searching. Simplistically, it appears to take the web page, sort -u all
                            the words it finds, removes all 'stopwords' (and, the, in, etc) from the
                            result, and then dumps the resultant words to the database, link'd to the
                            URL ...

                            We're using crc-multi, so a CRC value of the word is what is stored in the
                            database, not the actual word itself ... the '-multi' part spreads the
                            words across several tables depending on the word size, to keep total # of
                            rows down ...

                            The slow part on the database is finding those words, as can be seen by
                            the following search on 'SECURITY INVOKER':

                            Jan 1 01:21:05 pgsql74 postgres[59959]: [44-1] LOG: statement: SELECT ndict8.url_id,n dict8.intag FROM ndict8, url WHERE ndict8.word_id= 417851441 AND url.rec_id=ndic t8
                            ..url_id
                            Jan 1 01:21:05 pgsql74 postgres[59959]: [44-2] AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%')
                            Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
                            Jan 1 01:22:00 pgsql74 postgres[59959]: [46-1] LOG: statement: SELECT ndict7.url_id,n dict7.intag FROM ndict7, url WHERE ndict7.word_id=-509484498 AND url.rec_id=ndic t
                            7.url_id
                            Jan 1 01:22:00 pgsql74 postgres[59959]: [46-2] AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%')
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms

                            ndict8 looks like:

                            186_archives=# select count(1) from ndict8;
                            count
                            ---------
                            6320380
                            (1 row)
                            rchives=# select count(1) from ndict8 where word_id=4178514 41;
                            count
                            -------
                            15532
                            (1 row)

                            186_archives=# \d ndict8
                            Table "public.ndi ct8"
                            Column | Type | Modifiers
                            ---------+---------+--------------------
                            url_id | integer | not null default 0
                            word_id | integer | not null default 0
                            intag | integer | not null default 0
                            Indexes:
                            "n8_url" btree (url_id)
                            "n8_word" btree (word_id)


                            and ndict7 looks like:

                            186_archives=# select count(1) from ndict7;
                            count
                            ---------
                            8400616
                            (1 row)
                            186_archives=# select count(1) from ndict7 where word_id=-509484498;
                            count
                            -------
                            333
                            (1 row)

                            186_archives=# \d ndict7
                            Table "public.ndi ct7"
                            Column | Type | Modifiers
                            ---------+---------+--------------------
                            url_id | integer | not null default 0
                            word_id | integer | not null default 0
                            intag | integer | not null default 0
                            Indexes:
                            "n7_url" btree (url_id)
                            "n7_word" btree (word_id)


                            The slowdown is the LIKE condition, as the ndict[78] word_id conditions
                            return near instantly when run individually, and when I run the 'url/LIKE'
                            condition, it takes "forever" ...

                            186_archives-# ;
                            count
                            --------
                            304811
                            (1 row)

                            186_archives=# explain analyze select count(1) from url where ((url.url || '') LIKE 'http://archives.postgr esql.org/%%');
                            QUERY PLAN
                            ------------------------------------------------------------------------------------------------------------------
                            Aggregate (cost=93962.19. .93962.19 rows=1 width=0) (actual time=5833.084.. 5833.088 rows=1 loops=1)
                            -> Seq Scan on url (cost=0.00..939 57.26 rows=1968 width=0) (actual time=0.069..438 7.378 rows=304811 loops=1)
                            Filter: ((url || ''::text) ~~ 'http://archives.postgr esql.org/%%'::text)
                            Total runtime: 5833.179 ms
                            (4 rows)


                            Hrmmm ... I don't have much (any) experience with tsearch, but could it be
                            used to replace the LIKE? Then again, when its returning 300k rows out of
                            393k, it wouldn't help much on the above, would it?

                            The full first query:

                            SELECT ndict8.url_id,n dict8.intag
                            FROM ndict8, url
                            WHERE ndict8.word_id= 417851441
                            AND url.rec_id=ndic t8.url_id
                            AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%');

                            returns 13415 rows, and explain analyze shows:

                            -----------------------------------------------------------------------------------------------------------------------------------
                            Nested Loop (cost=0.00..301 99.82 rows=17 width=8) (actual time=0.312..145 9.504 rows=13415 loops=1)
                            -> Index Scan using n8_word on ndict8 (cost=0.00..126 16.09 rows=3219 width=8) (actual time=0.186..387 .673 rows=15532 loops=1)
                            Index Cond: (word_id = 417851441)
                            -> Index Scan using url_rec_id on url (cost=0.00..5.4 5 rows=1 width=4) (actual time=0.029..0.0 50 rows=1 loops=15532)
                            Index Cond: (url.rec_id = "outer".url _id)
                            Filter: ((url || ''::text) ~~ 'http://archives.postgr esql.org/%%'::text)
                            Total runtime: 1520.145 ms
                            (7 rows)

                            Which, of course, doesn't come close to matching what the duration showed
                            in the original, most likely due to catching :(

                            The server that the database is on rarely jumps abov a loadavg of 1, isn't
                            using any swap (after 77 days up, used swap is 0% -or- 17meg) and the
                            database itself is on a strip'd file system ...

                            I'm open to ideas/things to try here ...

                            The whole 'process' of the search shows the following times for the
                            queries:

                            pgsql74# grep 59959 /var/log/pgsql | grep duration
                            Jan 1 01:21:05 pgsql74 postgres[59959]: [39-1] LOG: duration: 25.663 ms
                            Jan 1 01:21:05 pgsql74 postgres[59959]: [41-1] LOG: duration: 4.376 ms
                            Jan 1 01:21:05 pgsql74 postgres[59959]: [43-1] LOG: duration: 11.179 ms
                            Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [49-1] LOG: duration: 7.886 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [51-1] LOG: duration: 1.516 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [53-1] LOG: duration: 3.539 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [55-1] LOG: duration: 109.890 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [57-1] LOG: duration: 15.582 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [59-1] LOG: duration: 1.631 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [61-1] LOG: duration: 0.838 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [63-1] LOG: duration: 2.148 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [65-1] LOG: duration: 0.810 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [67-1] LOG: duration: 1.211 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [69-1] LOG: duration: 0.798 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [71-1] LOG: duration: 0.861 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [73-1] LOG: duration: 0.748 ms
                            Jan 1 01:22:01 pgsql74 postgres[59959]: [75-1] LOG: duration: 0.555 ms

                            With the two >1000ms queries being the above two ndict[78] queries ...

                            Doing two subsequent searches, on "setuid functions" and "privilege
                            rules", just so that caching isn't involved, shows pretty much the same
                            distribution:

                            grep 61697 /var/log/pgsql | grep duration
                            Jan 1 01:44:25 pgsql74 postgres[61697]: [41-1] LOG: duration: 1.244 ms
                            Jan 1 01:44:25 pgsql74 postgres[61697]: [43-1] LOG: duration: 21.868 ms
                            Jan 1 01:44:25 pgsql74 postgres[61697]: [45-1] LOG: duration: 17.956 ms
                            Jan 1 01:44:29 pgsql74 postgres[61697]: [47-1] LOG: duration: 4452.326 ms
                            Jan 1 01:44:57 pgsql74 postgres[61697]: [49-1] LOG: duration: 27992.581 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [51-1] LOG: duration: 357.158 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [53-1] LOG: duration: 1.338 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [55-1] LOG: duration: 11.438 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [57-1] LOG: duration: 63.389 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [59-1] LOG: duration: 134.941 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [61-1] LOG: duration: 0.570 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [63-1] LOG: duration: 0.489 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [65-1] LOG: duration: 0.477 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [67-1] LOG: duration: 0.470 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [69-1] LOG: duration: 0.471 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [71-1] LOG: duration: 0.468 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [73-1] LOG: duration: 0.473 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [75-1] LOG: duration: 0.466 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [77-1] LOG: duration: 0.469 ms
                            Jan 1 01:44:58 pgsql74 postgres[61697]: [79-1] LOG: duration: 0.515 ms

                            and:

                            grep 61869 /var/log/pgsql | grep duration
                            Jan 1 01:46:50 pgsql74 postgres[61869]: [41-1] LOG: duration: 19.776 ms
                            Jan 1 01:46:50 pgsql74 postgres[61869]: [43-1] LOG: duration: 58.352 ms
                            Jan 1 01:46:50 pgsql74 postgres[61869]: [45-1] LOG: duration: 0.897 ms
                            Jan 1 01:46:53 pgsql74 postgres[61869]: [47-1] LOG: duration: 2859.331 ms
                            Jan 1 01:47:47 pgsql74 postgres[61869]: [49-1] LOG: duration: 54774.241 ms
                            Jan 1 01:47:47 pgsql74 postgres[61869]: [51-1] LOG: duration: 14.926 ms
                            Jan 1 01:47:47 pgsql74 postgres[61869]: [53-1] LOG: duration: 1.502 ms
                            Jan 1 01:47:47 pgsql74 postgres[61869]: [55-1] LOG: duration: 3.865 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [57-1] LOG: duration: 110.435 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [59-1] LOG: duration: 0.646 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [61-1] LOG: duration: 0.503 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [63-1] LOG: duration: 0.498 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [65-1] LOG: duration: 0.484 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [67-1] LOG: duration: 0.487 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [69-1] LOG: duration: 0.478 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [71-1] LOG: duration: 0.479 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [73-1] LOG: duration: 0.480 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [75-1] LOG: duration: 0.478 ms
                            Jan 1 01:47:48 pgsql74 postgres[61869]: [77-1] LOG: duration: 0.477 ms

                            So it looks like its those joins that are really killing things ...

                            Note that I haven't made many changes to the postgresql.conf file, so
                            there might be something really obvious I've overlooked, but here are the
                            uncommented ones (ie. ones I've modified from defaults):

                            tcpip_socket = true
                            max_connections = 512
                            shared_buffers = 10000 # min 16, at least max_connections *2, 8KB each
                            sort_mem = 10240 # min 64, size in KB
                            vacuum_mem = 81920 # min 1024, size in KB
                            syslog = 2 # range 0-2; 0=stdout; 1=both; 2=syslog
                            syslog_facility = 'LOCAL0'
                            syslog_ident = 'postgres'
                            log_connections = true
                            log_duration = false
                            log_statement = false
                            lc_messages = 'C' # locale for system error message strings
                            lc_monetary = 'C' # locale for monetary formatting
                            lc_numeric = 'C' # locale for number formatting
                            lc_time = 'C' # locale for time formatting


                            ----
                            Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                            Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

                            ---------------------------(end of broadcast)---------------------------
                            TIP 6: Have you searched our list archives?



                            Comment

                            • Marc G. Fournier

                              #29
                              Re: website doc search is extremely SLOW

                              On Wed, 31 Dec 2003, Dave Cramer wrote:
                              [color=blue]
                              > In fact this is a very bad advertisement for postgres[/color]

                              I just posted a very very long email of what I'm seeing in the logs, as
                              well as various query runs ... it may just be something that I need to
                              tune that I'm overlooking:( the queries aren't particularly complex :(

                              ----
                              Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                              Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

                              ---------------------------(end of broadcast)---------------------------
                              TIP 6: Have you searched our list archives?



                              Comment

                              • Dave Cramer

                                #30
                                Re: Mnogosearch (Was: Re: website doc search is ... )

                                What is the locale of the database?

                                like won't use an index, unless it is 'C' locale, or you use 7.4 and
                                change the operator of the index.

                                Dave
                                On Wed, 2003-12-31 at 20:49, Marc G. Fournier wrote:[color=blue]
                                > On Wed, 31 Dec 2003, Bruce Momjian wrote:
                                >[color=green]
                                > > Marc G. Fournier wrote:[color=darkred]
                                > > > On Wed, 31 Dec 2003, Bruce Momjian wrote:
                                > > >
                                > > > > > Out of everything I've found so far, mnogosearch is one of the best ... I
                                > > > > > just wish I could figure out where the bottleneck for it was, since, from
                                > > > > > reading their docs, their method of storing the data doesn't appear to be
                                > > > > > particularly off. I'm tempted to try their caching storage manager, and
                                > > > > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
                                > > > > > on this :(
                                > > > >
                                > > > > Well, PostgreSQL is being un-showcased in the current setup, that's for
                                > > > > sure. :-(
                                > > >
                                > > > Agreed ... I could install the MySQL backend, whichits designed for, and
                                > > > advertise it as PostgreSQL? :)[/color]
                                > >
                                > > I would be curious to know if it is faster --- that would tell use if it
                                > > is tuned only for MySQL.
                                > >
                                > > Have you tried CLUSTER? I think the MySQL ISAM files are
                                > > auto-clustered, and clustering is usually important for full-text
                                > > searches.[/color]
                                >
                                > Actually, check out http://www.mnogosearch.com ... the way they do the
                                > indexing doesn't (at least, as far as I can tell) make use of full-text
                                > searching. Simplistically, it appears to take the web page, sort -u all
                                > the words it finds, removes all 'stopwords' (and, the, in, etc) from the
                                > result, and then dumps the resultant words to the database, link'd to the
                                > URL ...
                                >
                                > We're using crc-multi, so a CRC value of the word is what is stored in the
                                > database, not the actual word itself ... the '-multi' part spreads the
                                > words across several tables depending on the word size, to keep total # of
                                > rows down ...
                                >
                                > The slow part on the database is finding those words, as can be seen by
                                > the following search on 'SECURITY INVOKER':
                                >
                                > Jan 1 01:21:05 pgsql74 postgres[59959]: [44-1] LOG: statement: SELECT ndict8.url_id,n dict8.intag FROM ndict8, url WHERE ndict8.word_id= 417851441 AND url.rec_id=ndic t8
                                > .url_id
                                > Jan 1 01:21:05 pgsql74 postgres[59959]: [44-2] AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%')
                                > Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
                                > Jan 1 01:22:00 pgsql74 postgres[59959]: [46-1] LOG: statement: SELECT ndict7.url_id,n dict7.intag FROM ndict7, url WHERE ndict7.word_id=-509484498 AND url.rec_id=ndic t
                                > 7.url_id
                                > Jan 1 01:22:00 pgsql74 postgres[59959]: [46-2] AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%')
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms
                                >
                                > ndict8 looks like:
                                >
                                > 186_archives=# select count(1) from ndict8;
                                > count
                                > ---------
                                > 6320380
                                > (1 row)
                                > rchives=# select count(1) from ndict8 where word_id=4178514 41;
                                > count
                                > -------
                                > 15532
                                > (1 row)
                                >
                                > 186_archives=# \d ndict8
                                > Table "public.ndi ct8"
                                > Column | Type | Modifiers
                                > ---------+---------+--------------------
                                > url_id | integer | not null default 0
                                > word_id | integer | not null default 0
                                > intag | integer | not null default 0
                                > Indexes:
                                > "n8_url" btree (url_id)
                                > "n8_word" btree (word_id)
                                >
                                >
                                > and ndict7 looks like:
                                >
                                > 186_archives=# select count(1) from ndict7;
                                > count
                                > ---------
                                > 8400616
                                > (1 row)
                                > 186_archives=# select count(1) from ndict7 where word_id=-509484498;
                                > count
                                > -------
                                > 333
                                > (1 row)
                                >
                                > 186_archives=# \d ndict7
                                > Table "public.ndi ct7"
                                > Column | Type | Modifiers
                                > ---------+---------+--------------------
                                > url_id | integer | not null default 0
                                > word_id | integer | not null default 0
                                > intag | integer | not null default 0
                                > Indexes:
                                > "n7_url" btree (url_id)
                                > "n7_word" btree (word_id)
                                >
                                >
                                > The slowdown is the LIKE condition, as the ndict[78] word_id conditions
                                > return near instantly when run individually, and when I run the 'url/LIKE'
                                > condition, it takes "forever" ...
                                >
                                > 186_archives-# ;
                                > count
                                > --------
                                > 304811
                                > (1 row)
                                >
                                > 186_archives=# explain analyze select count(1) from url where ((url.url || '') LIKE 'http://archives.postgr esql.org/%%');
                                > QUERY PLAN
                                > ------------------------------------------------------------------------------------------------------------------
                                > Aggregate (cost=93962.19. .93962.19 rows=1 width=0) (actual time=5833.084.. 5833.088 rows=1 loops=1)
                                > -> Seq Scan on url (cost=0.00..939 57.26 rows=1968 width=0) (actual time=0.069..438 7.378 rows=304811 loops=1)
                                > Filter: ((url || ''::text) ~~ 'http://archives.postgr esql.org/%%'::text)
                                > Total runtime: 5833.179 ms
                                > (4 rows)
                                >
                                >
                                > Hrmmm ... I don't have much (any) experience with tsearch, but could it be
                                > used to replace the LIKE? Then again, when its returning 300k rows out of
                                > 393k, it wouldn't help much on the above, would it?
                                >
                                > The full first query:
                                >
                                > SELECT ndict8.url_id,n dict8.intag
                                > FROM ndict8, url
                                > WHERE ndict8.word_id= 417851441
                                > AND url.rec_id=ndic t8.url_id
                                > AND ((url.url || '') LIKE 'http://archives.postgr esql.org/%%');
                                >
                                > returns 13415 rows, and explain analyze shows:
                                >
                                > -----------------------------------------------------------------------------------------------------------------------------------
                                > Nested Loop (cost=0.00..301 99.82 rows=17 width=8) (actual time=0.312..145 9.504 rows=13415 loops=1)
                                > -> Index Scan using n8_word on ndict8 (cost=0.00..126 16.09 rows=3219 width=8) (actual time=0.186..387 .673 rows=15532 loops=1)
                                > Index Cond: (word_id = 417851441)
                                > -> Index Scan using url_rec_id on url (cost=0.00..5.4 5 rows=1 width=4) (actual time=0.029..0.0 50 rows=1 loops=15532)
                                > Index Cond: (url.rec_id = "outer".url _id)
                                > Filter: ((url || ''::text) ~~ 'http://archives.postgr esql.org/%%'::text)
                                > Total runtime: 1520.145 ms
                                > (7 rows)
                                >
                                > Which, of course, doesn't come close to matching what the duration showed
                                > in the original, most likely due to catching :(
                                >
                                > The server that the database is on rarely jumps abov a loadavg of 1, isn't
                                > using any swap (after 77 days up, used swap is 0% -or- 17meg) and the
                                > database itself is on a strip'd file system ...
                                >
                                > I'm open to ideas/things to try here ...
                                >
                                > The whole 'process' of the search shows the following times for the
                                > queries:
                                >
                                > pgsql74# grep 59959 /var/log/pgsql | grep duration
                                > Jan 1 01:21:05 pgsql74 postgres[59959]: [39-1] LOG: duration: 25.663 ms
                                > Jan 1 01:21:05 pgsql74 postgres[59959]: [41-1] LOG: duration: 4.376 ms
                                > Jan 1 01:21:05 pgsql74 postgres[59959]: [43-1] LOG: duration: 11.179 ms
                                > Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [49-1] LOG: duration: 7.886 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [51-1] LOG: duration: 1.516 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [53-1] LOG: duration: 3.539 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [55-1] LOG: duration: 109.890 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [57-1] LOG: duration: 15.582 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [59-1] LOG: duration: 1.631 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [61-1] LOG: duration: 0.838 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [63-1] LOG: duration: 2.148 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [65-1] LOG: duration: 0.810 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [67-1] LOG: duration: 1.211 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [69-1] LOG: duration: 0.798 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [71-1] LOG: duration: 0.861 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [73-1] LOG: duration: 0.748 ms
                                > Jan 1 01:22:01 pgsql74 postgres[59959]: [75-1] LOG: duration: 0.555 ms
                                >
                                > With the two >1000ms queries being the above two ndict[78] queries ...
                                >
                                > Doing two subsequent searches, on "setuid functions" and "privilege
                                > rules", just so that caching isn't involved, shows pretty much the same
                                > distribution:
                                >
                                > grep 61697 /var/log/pgsql | grep duration
                                > Jan 1 01:44:25 pgsql74 postgres[61697]: [41-1] LOG: duration: 1.244 ms
                                > Jan 1 01:44:25 pgsql74 postgres[61697]: [43-1] LOG: duration: 21.868 ms
                                > Jan 1 01:44:25 pgsql74 postgres[61697]: [45-1] LOG: duration: 17.956 ms
                                > Jan 1 01:44:29 pgsql74 postgres[61697]: [47-1] LOG: duration: 4452.326 ms
                                > Jan 1 01:44:57 pgsql74 postgres[61697]: [49-1] LOG: duration: 27992.581 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [51-1] LOG: duration: 357.158 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [53-1] LOG: duration: 1.338 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [55-1] LOG: duration: 11.438 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [57-1] LOG: duration: 63.389 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [59-1] LOG: duration: 134.941 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [61-1] LOG: duration: 0.570 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [63-1] LOG: duration: 0.489 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [65-1] LOG: duration: 0.477 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [67-1] LOG: duration: 0.470 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [69-1] LOG: duration: 0.471 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [71-1] LOG: duration: 0.468 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [73-1] LOG: duration: 0.473 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [75-1] LOG: duration: 0.466 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [77-1] LOG: duration: 0.469 ms
                                > Jan 1 01:44:58 pgsql74 postgres[61697]: [79-1] LOG: duration: 0.515 ms
                                >
                                > and:
                                >
                                > grep 61869 /var/log/pgsql | grep duration
                                > Jan 1 01:46:50 pgsql74 postgres[61869]: [41-1] LOG: duration: 19.776 ms
                                > Jan 1 01:46:50 pgsql74 postgres[61869]: [43-1] LOG: duration: 58.352 ms
                                > Jan 1 01:46:50 pgsql74 postgres[61869]: [45-1] LOG: duration: 0.897 ms
                                > Jan 1 01:46:53 pgsql74 postgres[61869]: [47-1] LOG: duration: 2859.331 ms
                                > Jan 1 01:47:47 pgsql74 postgres[61869]: [49-1] LOG: duration: 54774.241 ms
                                > Jan 1 01:47:47 pgsql74 postgres[61869]: [51-1] LOG: duration: 14.926 ms
                                > Jan 1 01:47:47 pgsql74 postgres[61869]: [53-1] LOG: duration: 1.502 ms
                                > Jan 1 01:47:47 pgsql74 postgres[61869]: [55-1] LOG: duration: 3.865 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [57-1] LOG: duration: 110.435 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [59-1] LOG: duration: 0.646 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [61-1] LOG: duration: 0.503 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [63-1] LOG: duration: 0.498 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [65-1] LOG: duration: 0.484 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [67-1] LOG: duration: 0.487 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [69-1] LOG: duration: 0.478 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [71-1] LOG: duration: 0.479 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [73-1] LOG: duration: 0.480 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [75-1] LOG: duration: 0.478 ms
                                > Jan 1 01:47:48 pgsql74 postgres[61869]: [77-1] LOG: duration: 0.477 ms
                                >
                                > So it looks like its those joins that are really killing things ...
                                >
                                > Note that I haven't made many changes to the postgresql.conf file, so
                                > there might be something really obvious I've overlooked, but here are the
                                > uncommented ones (ie. ones I've modified from defaults):
                                >
                                > tcpip_socket = true
                                > max_connections = 512
                                > shared_buffers = 10000 # min 16, at least max_connections *2, 8KB each
                                > sort_mem = 10240 # min 64, size in KB
                                > vacuum_mem = 81920 # min 1024, size in KB
                                > syslog = 2 # range 0-2; 0=stdout; 1=both; 2=syslog
                                > syslog_facility = 'LOCAL0'
                                > syslog_ident = 'postgres'
                                > log_connections = true
                                > log_duration = false
                                > log_statement = false
                                > lc_messages = 'C' # locale for system error message strings
                                > lc_monetary = 'C' # locale for monetary formatting
                                > lc_numeric = 'C' # locale for number formatting
                                > lc_time = 'C' # locale for time formatting
                                >
                                >
                                > ----
                                > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
                                > Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
                                >
                                > ---------------------------(end of broadcast)---------------------------
                                > TIP 6: Have you searched our list archives?
                                >
                                > http://archives.postgresql.org
                                >[/color]
                                --
                                Dave Cramer
                                519 939 0336
                                ICQ # 1467551


                                ---------------------------(end of broadcast)---------------------------
                                TIP 2: you can get off all lists at once with the unregister command
                                (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

                                Comment

                                Working...