performing a whitespace-insensitive query

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andy Fish

    performing a whitespace-insensitive query

    Hi,

    I am trying to concoct a query that will join rows on the basis of a
    whitespace insensitive comparison. For instance if one row has the value
    'a<space>b' and another has the value 'a<space><space >b' I want them to be
    considered equal (but not the same as 'ab')

    I am happy to do some T-SQL if that helps. Also I have a full-text index on
    the column in question, but note that I am comparing rows against each
    other, not to a fixed string.

    This is for a one-off job, so if there is no obvious way to do it on the
    database, I will just bcp out the data and knock up some perl or something
    to do it. All other things being equal I would rather do it on the database
    though.

    Many thanks

    Andy


  • Simon Hayes

    #2
    Re: performing a whitespace-insensitive query


    "Andy Fish" <ajfish@blueyon der.co.uk> wrote in message
    news:QTaUb.6399 $t05.58794173@n ews-text.cableinet. net...[color=blue]
    > Hi,
    >
    > I am trying to concoct a query that will join rows on the basis of a
    > whitespace insensitive comparison. For instance if one row has the value
    > 'a<space>b' and another has the value 'a<space><space >b' I want them to be
    > considered equal (but not the same as 'ab')
    >
    > I am happy to do some T-SQL if that helps. Also I have a full-text index[/color]
    on[color=blue]
    > the column in question, but note that I am comparing rows against each
    > other, not to a fixed string.
    >
    > This is for a one-off job, so if there is no obvious way to do it on the
    > database, I will just bcp out the data and knock up some perl or something
    > to do it. All other things being equal I would rather do it on the[/color]
    database[color=blue]
    > though.
    >
    > Many thanks
    >
    > Andy
    >
    >[/color]

    If you want to it in the database, and assuming that you have SQL2000, then
    one possibility is to write a user-defined function to iterate over a string
    and reduce all whitespace characters to a single space, using the standard
    string functions. If you only have spaces (ie no tabs or newlines), then
    even some nested REPLACE() functions might work.

    Another way would be to write a function which instantiates the Regex COM
    object using the sp_OA% procedures, and does the same thing with a regular
    expression. But both these approaches would be very slow (unless perhaps you
    used your UDF as a computed column and indexed it), and the second one
    requires sysadmin permissions.

    I suspect, though, that your BCP and Perl approach would probably be the
    quickest and easiest solution.

    Simon


    Comment

    • Gert-Jan Strik

      #3
      Re: performing a whitespace-insensitive query

      SELECT *
      FROM A
      INNER JOIN B
      ON Replace(
      Replace(
      Replace(
      Replace(
      Replace(
      Replace(A.col, replicate('X',1 32),'X')
      , replicate('X', 21),'X')
      , replicate('X', 6),'X')
      , replicate('X', 3),'X')
      , replicate('X', 2),'X')
      , replicate('X', 2),'X')
      = Replace(
      Replace(
      Replace(
      Replace(
      Replace(
      Replace(B.col, replicate('X',1 32),'X')
      , replicate('X', 21),'X')
      , replicate('X', 6),'X')
      , replicate('X', 3),'X')
      , replicate('X', 2),'X')
      , replicate('X', 2),'X')


      Hope this helps,
      Gert-Jan

      Andy Fish wrote:[color=blue]
      >
      > Hi,
      >
      > I am trying to concoct a query that will join rows on the basis of a
      > whitespace insensitive comparison. For instance if one row has the value
      > 'a<space>b' and another has the value 'a<space><space >b' I want them to be
      > considered equal (but not the same as 'ab')
      >
      > I am happy to do some T-SQL if that helps. Also I have a full-text index on
      > the column in question, but note that I am comparing rows against each
      > other, not to a fixed string.
      >
      > This is for a one-off job, so if there is no obvious way to do it on the
      > database, I will just bcp out the data and knock up some perl or something
      > to do it. All other things being equal I would rather do it on the database
      > though.
      >
      > Many thanks
      >
      > Andy[/color]

      Comment

      • Gert-Jan Strik

        #4
        Re: performing a whitespace-insensitive query

        Hmmm, again I posted to fast. Replace all X with space in the solution
        below and it should work...

        Gert-Jan


        Gert-Jan Strik wrote:[color=blue]
        >
        > SELECT *
        > FROM A
        > INNER JOIN B
        > ON Replace(
        > Replace(
        > Replace(
        > Replace(
        > Replace(
        > Replace(A.col, replicate('X',1 32),'X')
        > , replicate('X', 21),'X')
        > , replicate('X', 6),'X')
        > , replicate('X', 3),'X')
        > , replicate('X', 2),'X')
        > , replicate('X', 2),'X')
        > = Replace(
        > Replace(
        > Replace(
        > Replace(
        > Replace(
        > Replace(B.col, replicate('X',1 32),'X')
        > , replicate('X', 21),'X')
        > , replicate('X', 6),'X')
        > , replicate('X', 3),'X')
        > , replicate('X', 2),'X')
        > , replicate('X', 2),'X')
        >
        > Hope this helps,
        > Gert-Jan
        >
        > Andy Fish wrote:[color=green]
        > >
        > > Hi,
        > >
        > > I am trying to concoct a query that will join rows on the basis of a
        > > whitespace insensitive comparison. For instance if one row has the value
        > > 'a<space>b' and another has the value 'a<space><space >b' I want them to be
        > > considered equal (but not the same as 'ab')
        > >
        > > I am happy to do some T-SQL if that helps. Also I have a full-text index on
        > > the column in question, but note that I am comparing rows against each
        > > other, not to a fixed string.
        > >
        > > This is for a one-off job, so if there is no obvious way to do it on the
        > > database, I will just bcp out the data and knock up some perl or something
        > > to do it. All other things being equal I would rather do it on the database
        > > though.
        > >
        > > Many thanks
        > >
        > > Andy[/color][/color]

        Comment

        • Andy Fish

          #5
          Re: performing a whitespace-insensitive query

          Thanks Gert-Jan, this definitely looks like the right way to do it.

          unfortunately I neglected to mention that I have 3/4 million rows in the
          table. I left the query running overnight for 9 hours (it used 100%cpu on my
          athlon 1.5 ghz) but I had to cancel it. However, it did return 22,500
          correct results (I have no idea what proportion this represents).

          I think it must have thrown the query optimization out because I can't
          beleive it really takes that long. I think I will create temporary columns
          with the whitespace stripped (using your function) and then join on those
          instead.

          after seeing your post I was fascinated by the magic numbers used. I found a
          previous thread from 2002 where there was a lot of discussion and you said
          that they were an educated guess but verified empirically. The problem of
          finding the optimal numbers has lodged itself into my brain now and won't
          let go, so I will let you know if I get anywhere with it.

          Andy

          "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
          news:402153E9.B F779772@toomuch spamalready.nl. ..[color=blue]
          > Hmmm, again I posted to fast. Replace all X with space in the solution
          > below and it should work...
          >
          > Gert-Jan
          >
          >
          > Gert-Jan Strik wrote:[color=green]
          > >
          > > SELECT *
          > > FROM A
          > > INNER JOIN B
          > > ON Replace(
          > > Replace(
          > > Replace(
          > > Replace(
          > > Replace(
          > > Replace(A.col, replicate('X',1 32),'X')
          > > , replicate('X', 21),'X')
          > > , replicate('X', 6),'X')
          > > , replicate('X', 3),'X')
          > > , replicate('X', 2),'X')
          > > , replicate('X', 2),'X')
          > > = Replace(
          > > Replace(
          > > Replace(
          > > Replace(
          > > Replace(
          > > Replace(B.col, replicate('X',1 32),'X')
          > > , replicate('X', 21),'X')
          > > , replicate('X', 6),'X')
          > > , replicate('X', 3),'X')
          > > , replicate('X', 2),'X')
          > > , replicate('X', 2),'X')
          > >
          > > Hope this helps,
          > > Gert-Jan
          > >
          > > Andy Fish wrote:[color=darkred]
          > > >
          > > > Hi,
          > > >
          > > > I am trying to concoct a query that will join rows on the basis of a
          > > > whitespace insensitive comparison. For instance if one row has the[/color][/color][/color]
          value[color=blue][color=green][color=darkred]
          > > > 'a<space>b' and another has the value 'a<space><space >b' I want them[/color][/color][/color]
          to be[color=blue][color=green][color=darkred]
          > > > considered equal (but not the same as 'ab')
          > > >
          > > > I am happy to do some T-SQL if that helps. Also I have a full-text[/color][/color][/color]
          index on[color=blue][color=green][color=darkred]
          > > > the column in question, but note that I am comparing rows against each
          > > > other, not to a fixed string.
          > > >
          > > > This is for a one-off job, so if there is no obvious way to do it on[/color][/color][/color]
          the[color=blue][color=green][color=darkred]
          > > > database, I will just bcp out the data and knock up some perl or[/color][/color][/color]
          something[color=blue][color=green][color=darkred]
          > > > to do it. All other things being equal I would rather do it on the[/color][/color][/color]
          database[color=blue][color=green][color=darkred]
          > > > though.
          > > >
          > > > Many thanks
          > > >
          > > > Andy[/color][/color][/color]


          Comment

          • Andy Fish

            #6
            Re: performing a whitespace-insensitive query

            OK, I think I have figured out the algorithm to produce the magic numbers in
            the formula.

            say 10 is a magic number (i.e. it's one of the ones in the list of replicate
            functions), then doing a replace of 10 spaces with 1 will reduce anything up
            to 108 spaces down to 18 or less. obviously it will also reduce anything up
            to 109 spaces with 19 or less, but it just naturally seems that the 108->18
            is giving us the best "leverage" from the number 10.

            so in general, putting 'n' in the formula reduces n(n+1)-2 spaces to 2(n-1)

            In the above example the next term above 10 would need to reduce to X spaces
            to 108 spaces or less

            solving 2(n-1)=108 gives us n=55, so 55 would reduce anything up to 3078
            spaces down to 108 or less

            in the general case, say the term after n is m, then

            n(n+1)-2 = 2(m-1)

            or

            m = 1/2 (n)(n+1)

            this generates the sequence 2, 3, 6, 21, 231, 26796, 359026206,
            644499084768903 00, 207689535133977 000000000000000 0000. I'm sure there is
            some rounding in the last couple of terms (I only used excel).

            Anyone with high school math will notice that the formula for m is the sum
            of all numbers adding up to m, which does make sense although I'm not sure I
            could prove why. one analagy is that if you took a triangle of billiard
            balls (start with 3) then spread them out so that these form the bottom row
            of a bigger next triangle, then spread all those out to form the bottom row
            of the next triangle and so on.

            Anyway I hope that's of interest to someone. Oh who am I kidding, it was
            such a buzz figuring it out I don't care if anyone ever reads this.

            Andy


            "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
            news:402153E9.B F779772@toomuch spamalready.nl. ..[color=blue]
            > Hmmm, again I posted to fast. Replace all X with space in the solution
            > below and it should work...
            >
            > Gert-Jan
            >
            >
            > Gert-Jan Strik wrote:[color=green]
            > >
            > > SELECT *
            > > FROM A
            > > INNER JOIN B
            > > ON Replace(
            > > Replace(
            > > Replace(
            > > Replace(
            > > Replace(
            > > Replace(A.col, replicate('X',1 32),'X')
            > > , replicate('X', 21),'X')
            > > , replicate('X', 6),'X')
            > > , replicate('X', 3),'X')
            > > , replicate('X', 2),'X')
            > > , replicate('X', 2),'X')
            > > = Replace(
            > > Replace(
            > > Replace(
            > > Replace(
            > > Replace(
            > > Replace(B.col, replicate('X',1 32),'X')
            > > , replicate('X', 21),'X')
            > > , replicate('X', 6),'X')
            > > , replicate('X', 3),'X')
            > > , replicate('X', 2),'X')
            > > , replicate('X', 2),'X')
            > >
            > > Hope this helps,
            > > Gert-Jan
            > >
            > > Andy Fish wrote:[color=darkred]
            > > >
            > > > Hi,
            > > >
            > > > I am trying to concoct a query that will join rows on the basis of a
            > > > whitespace insensitive comparison. For instance if one row has the[/color][/color][/color]
            value[color=blue][color=green][color=darkred]
            > > > 'a<space>b' and another has the value 'a<space><space >b' I want them[/color][/color][/color]
            to be[color=blue][color=green][color=darkred]
            > > > considered equal (but not the same as 'ab')
            > > >
            > > > I am happy to do some T-SQL if that helps. Also I have a full-text[/color][/color][/color]
            index on[color=blue][color=green][color=darkred]
            > > > the column in question, but note that I am comparing rows against each
            > > > other, not to a fixed string.
            > > >
            > > > This is for a one-off job, so if there is no obvious way to do it on[/color][/color][/color]
            the[color=blue][color=green][color=darkred]
            > > > database, I will just bcp out the data and knock up some perl or[/color][/color][/color]
            something[color=blue][color=green][color=darkred]
            > > > to do it. All other things being equal I would rather do it on the[/color][/color][/color]
            database[color=blue][color=green][color=darkred]
            > > > though.
            > > >
            > > > Many thanks
            > > >
            > > > Andy[/color][/color][/color]


            Comment

            • Delbert Glass

              #7
              Re: performing a whitespace-insensitive query

              Consider: literals.

              Here are some solutions
              with maximium literal length of 19:
              [19, 18, [6,7], 3, 2, 2]
              [19, 17, [6,7], 4, 3, 2]
              [19, 18, [6,7], 4, 3, 2]

              Bye,
              Delbert Glass


              Comment

              • Gert-Jan Strik

                #8
                Re: performing a whitespace-insensitive query

                In case you are interested, I just gave sort of an explanation a few
                days ago. See


                Gert-Jan


                Andy Fish wrote:[color=blue]
                >
                > OK, I think I have figured out the algorithm to produce the magic numbers in
                > the formula.
                >
                > say 10 is a magic number (i.e. it's one of the ones in the list of replicate
                > functions), then doing a replace of 10 spaces with 1 will reduce anything up
                > to 108 spaces down to 18 or less. obviously it will also reduce anything up
                > to 109 spaces with 19 or less, but it just naturally seems that the 108->18
                > is giving us the best "leverage" from the number 10.
                >
                > so in general, putting 'n' in the formula reduces n(n+1)-2 spaces to 2(n-1)
                >
                > In the above example the next term above 10 would need to reduce to X spaces
                > to 108 spaces or less
                >
                > solving 2(n-1)=108 gives us n=55, so 55 would reduce anything up to 3078
                > spaces down to 108 or less
                >
                > in the general case, say the term after n is m, then
                >
                > n(n+1)-2 = 2(m-1)
                >
                > or
                >
                > m = 1/2 (n)(n+1)
                >
                > this generates the sequence 2, 3, 6, 21, 231, 26796, 359026206,
                > 644499084768903 00, 207689535133977 000000000000000 0000. I'm sure there is
                > some rounding in the last couple of terms (I only used excel).
                >
                > Anyone with high school math will notice that the formula for m is the sum
                > of all numbers adding up to m, which does make sense although I'm not sure I
                > could prove why. one analagy is that if you took a triangle of billiard
                > balls (start with 3) then spread them out so that these form the bottom row
                > of a bigger next triangle, then spread all those out to form the bottom row
                > of the next triangle and so on.
                >
                > Anyway I hope that's of interest to someone. Oh who am I kidding, it was
                > such a buzz figuring it out I don't care if anyone ever reads this.
                >
                > Andy
                >
                > "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
                > news:402153E9.B F779772@toomuch spamalready.nl. ..[color=green]
                > > Hmmm, again I posted to fast. Replace all X with space in the solution
                > > below and it should work...
                > >
                > > Gert-Jan
                > >
                > >
                > > Gert-Jan Strik wrote:[color=darkred]
                > > >
                > > > SELECT *
                > > > FROM A
                > > > INNER JOIN B
                > > > ON Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(A.col, replicate('X',1 32),'X')
                > > > , replicate('X', 21),'X')
                > > > , replicate('X', 6),'X')
                > > > , replicate('X', 3),'X')
                > > > , replicate('X', 2),'X')
                > > > , replicate('X', 2),'X')
                > > > = Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(
                > > > Replace(B.col, replicate('X',1 32),'X')
                > > > , replicate('X', 21),'X')
                > > > , replicate('X', 6),'X')
                > > > , replicate('X', 3),'X')
                > > > , replicate('X', 2),'X')
                > > > , replicate('X', 2),'X')
                > > >
                > > > Hope this helps,
                > > > Gert-Jan
                > > >
                > > > Andy Fish wrote:
                > > > >
                > > > > Hi,
                > > > >
                > > > > I am trying to concoct a query that will join rows on the basis of a
                > > > > whitespace insensitive comparison. For instance if one row has the[/color][/color]
                > value[color=green][color=darkred]
                > > > > 'a<space>b' and another has the value 'a<space><space >b' I want them[/color][/color]
                > to be[color=green][color=darkred]
                > > > > considered equal (but not the same as 'ab')
                > > > >
                > > > > I am happy to do some T-SQL if that helps. Also I have a full-text[/color][/color]
                > index on[color=green][color=darkred]
                > > > > the column in question, but note that I am comparing rows against each
                > > > > other, not to a fixed string.
                > > > >
                > > > > This is for a one-off job, so if there is no obvious way to do it on[/color][/color]
                > the[color=green][color=darkred]
                > > > > database, I will just bcp out the data and knock up some perl or[/color][/color]
                > something[color=green][color=darkred]
                > > > > to do it. All other things being equal I would rather do it on the[/color][/color]
                > database[color=green][color=darkred]
                > > > > though.
                > > > >
                > > > > Many thanks
                > > > >
                > > > > Andy[/color][/color][/color]

                Comment

                • Delbert Glass

                  #9
                  Re: performing a whitespace-insensitive query

                  Here are (the?) 24 solutions
                  with fewer then 7 stages
                  and lengths less than 20.

                  Solution TotalLength
                  -------------------- -----------
                  [19, 17, 6, 3, 2, 2] 49
                  [19, 17, 6, 3, 3, 2] 50
                  [19, 17, 6, 4, 2, 2] 50
                  [19, 17, 6, 4, 3, 2] 51
                  [19, 17, 7, 3, 2, 2] 50
                  [19, 17, 7, 3, 3, 2] 51
                  [19, 17, 7, 4, 2, 2] 51
                  [19, 17, 7, 4, 3, 2] 52
                  [19, 18, 6, 3, 2, 2] 50
                  [19, 18, 6, 3, 3, 2] 51
                  [19, 18, 6, 4, 2, 2] 51
                  [19, 18, 6, 4, 3, 2] 52
                  [19, 18, 7, 3, 2, 2] 51
                  [19, 18, 7, 3, 3, 2] 52
                  [19, 18, 7, 4, 2, 2] 52
                  [19, 18, 7, 4, 3, 2] 53
                  [19, 19, 6, 3, 2, 2] 51
                  [19, 19, 6, 3, 3, 2] 52
                  [19, 19, 6, 4, 2, 2] 52
                  [19, 19, 6, 4, 3, 2] 53
                  [19, 19, 7, 3, 2, 2] 52
                  [19, 19, 7, 3, 3, 2] 53
                  [19, 19, 7, 4, 2, 2] 53
                  [19, 19, 7, 4, 3, 2] 54

                  Bye,
                  Delbert Glass


                  Comment

                  • Gert-Jan Strik

                    #10
                    Re: performing a whitespace-insensitive query

                    I have run just a few tests with solution (1) [19, 17, 6, 3, 2, 2] and
                    (2) [132, 21, 6, 3, 2, 2] on SQL7.0, and the second solution uses
                    significantly less CPU.

                    On a testset with values 'a'+space(2)+'a ' .. 'a'+space(7998) +'a', the
                    results were:
                    solution 1: approx. 3780 ms
                    solution 2: approx. 2750 ms

                    On a testset with more small values and fewer large values, the results
                    show similar differences: 4280 vs 3110 ms and 953 vs 719 ms

                    Gert-Jan


                    Delbert Glass wrote:[color=blue]
                    >
                    > Here are (the?) 24 solutions
                    > with fewer then 7 stages
                    > and lengths less than 20.
                    >
                    > Solution TotalLength
                    > -------------------- -----------
                    > [19, 17, 6, 3, 2, 2] 49
                    > [19, 17, 6, 3, 3, 2] 50
                    > [19, 17, 6, 4, 2, 2] 50
                    > [19, 17, 6, 4, 3, 2] 51
                    > [19, 17, 7, 3, 2, 2] 50
                    > [19, 17, 7, 3, 3, 2] 51
                    > [19, 17, 7, 4, 2, 2] 51
                    > [19, 17, 7, 4, 3, 2] 52
                    > [19, 18, 6, 3, 2, 2] 50
                    > [19, 18, 6, 3, 3, 2] 51
                    > [19, 18, 6, 4, 2, 2] 51
                    > [19, 18, 6, 4, 3, 2] 52
                    > [19, 18, 7, 3, 2, 2] 51
                    > [19, 18, 7, 3, 3, 2] 52
                    > [19, 18, 7, 4, 2, 2] 52
                    > [19, 18, 7, 4, 3, 2] 53
                    > [19, 19, 6, 3, 2, 2] 51
                    > [19, 19, 6, 3, 3, 2] 52
                    > [19, 19, 6, 4, 2, 2] 52
                    > [19, 19, 6, 4, 3, 2] 53
                    > [19, 19, 7, 3, 2, 2] 52
                    > [19, 19, 7, 3, 3, 2] 53
                    > [19, 19, 7, 4, 2, 2] 53
                    > [19, 19, 7, 4, 3, 2] 54
                    >
                    > Bye,
                    > Delbert Glass[/color]

                    Comment

                    • Delbert Glass

                      #11
                      Re: performing a whitespace-insensitive query

                      I was looking for the earliest solution (with values in decreasing sequence)
                      not the fastest (earliest finishing) solution.
                      I think what's fastest is going to depend on what the
                      distribution of the data is and how REPLACE is implemented.

                      Here are some solutions:
                      [[77..104], 9, 4, 2, 2, 2]
                      with a bunch of two-s.
                      I'm thinking [104, 9, 4, 2, 2, 2]
                      might be good to use on data that have lots
                      of occurances of two space runs.
                      (I suspect how REPLACE is implemented
                      will have a significant impact how efficiently
                      these solutions are executed.)

                      Here is a solution:
                      [445, 22, 7, 4, 3, 2]
                      in which I upsized the numbers
                      but made no attempt to find the absolute largest,
                      most largest, or etc.

                      As you surely already know,
                      larger values help you skip along;
                      while, smaller values help you shorten
                      the string so there is not so much to
                      dig through (and copy, etc) in later passes.
                      I think, in general, the last two solutions
                      mentioned above [104, 9, 4, 2, 2, 2]
                      and [445, 22, 7, 4, 3, 2]
                      will run much closer to the speed of
                      your solution [132, 21, 6, 3, 2, 2]
                      then the solution [ 19, 17, 6, 3, 2, 2] does
                      even though they have fewer values in common.

                      My guess is the way you picked your values
                      makes a reasonable trade off between
                      wanting the values to be large and
                      wanting the values to be small
                      and thus makes it a reasonable general solution.

                      ---

                      In some solutions, the values are not in decreasing order.
                      Such solutions should not be overlooked.

                      ---

                      Don't forget there might be 8000 "spaces"
                      that need to get reduce to one "space".

                      Bye,
                      Delbert Glass


                      "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
                      news:4023EEDE.D 13DFD21@toomuch spamalready.nl. ..[color=blue]
                      > I have run just a few tests with solution (1) [19, 17, 6, 3, 2, 2] and
                      > (2) [132, 21, 6, 3, 2, 2] on SQL7.0, and the second solution uses
                      > significantly less CPU.
                      >
                      > On a testset with values 'a'+space(2)+'a ' .. 'a'+space(7998) +'a', the
                      > results were:
                      > solution 1: approx. 3780 ms
                      > solution 2: approx. 2750 ms
                      >
                      > On a testset with more small values and fewer large values, the results
                      > show similar differences: 4280 vs 3110 ms and 953 vs 719 ms
                      >
                      > Gert-Jan
                      >
                      >
                      > Delbert Glass wrote:[color=green]
                      > >
                      > > Here are (the?) 24 solutions
                      > > with fewer then 7 stages
                      > > and lengths less than 20.
                      > >
                      > > Solution TotalLength
                      > > -------------------- -----------
                      > > [19, 17, 6, 3, 2, 2] 49
                      > > [19, 17, 6, 3, 3, 2] 50
                      > > [19, 17, 6, 4, 2, 2] 50
                      > > [19, 17, 6, 4, 3, 2] 51
                      > > [19, 17, 7, 3, 2, 2] 50
                      > > [19, 17, 7, 3, 3, 2] 51
                      > > [19, 17, 7, 4, 2, 2] 51
                      > > [19, 17, 7, 4, 3, 2] 52
                      > > [19, 18, 6, 3, 2, 2] 50
                      > > [19, 18, 6, 3, 3, 2] 51
                      > > [19, 18, 6, 4, 2, 2] 51
                      > > [19, 18, 6, 4, 3, 2] 52
                      > > [19, 18, 7, 3, 2, 2] 51
                      > > [19, 18, 7, 3, 3, 2] 52
                      > > [19, 18, 7, 4, 2, 2] 52
                      > > [19, 18, 7, 4, 3, 2] 53
                      > > [19, 19, 6, 3, 2, 2] 51
                      > > [19, 19, 6, 3, 3, 2] 52
                      > > [19, 19, 6, 4, 2, 2] 52
                      > > [19, 19, 6, 4, 3, 2] 53
                      > > [19, 19, 7, 3, 2, 2] 52
                      > > [19, 19, 7, 3, 3, 2] 53
                      > > [19, 19, 7, 4, 2, 2] 53
                      > > [19, 19, 7, 4, 3, 2] 54
                      > >
                      > > Bye,
                      > > Delbert Glass[/color][/color]


                      Comment

                      • Andy Fish

                        #12
                        Re: performing a whitespace-insensitive query

                        sorry for a late reply but I have been away

                        thanks for pointing me at this post - I guess it hadn't reached google's
                        server when I did the search.

                        I especially liked Robert Carnegie's approach. Unfortunately I don't have
                        the time to benchmark the possible options but using robert's approach I did
                        this

                        1. create a second column on the table with the duplicate spaces removed (30
                        secs)
                        2. created an index on this new column (30 secs)
                        3. do a join to find all rows where the original value is different but
                        space-insensitive value is the same (15 secs)

                        as for the proper solution with the stacked replaces, I think my numbers
                        give the greatest amount of squashing power for the least number of replace
                        functions, even though they might use more CPU.

                        Andy


                        "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
                        news:4022C3CE.5 94D1E89@toomuch spamalready.nl. ..[color=blue]
                        > In case you are interested, I just gave sort of an explanation a few
                        > days ago. See
                        >[/color]
                        http://groups.google.com/groups?hl=e...spamalready.nl[color=blue]
                        >
                        > Gert-Jan
                        >
                        >
                        > Andy Fish wrote:[color=green]
                        > >
                        > > OK, I think I have figured out the algorithm to produce the magic[/color][/color]
                        numbers in[color=blue][color=green]
                        > > the formula.
                        > >
                        > > say 10 is a magic number (i.e. it's one of the ones in the list of[/color][/color]
                        replicate[color=blue][color=green]
                        > > functions), then doing a replace of 10 spaces with 1 will reduce[/color][/color]
                        anything up[color=blue][color=green]
                        > > to 108 spaces down to 18 or less. obviously it will also reduce anything[/color][/color]
                        up[color=blue][color=green]
                        > > to 109 spaces with 19 or less, but it just naturally seems that the[/color][/color]
                        108->18[color=blue][color=green]
                        > > is giving us the best "leverage" from the number 10.
                        > >
                        > > so in general, putting 'n' in the formula reduces n(n+1)-2 spaces to[/color][/color]
                        2(n-1)[color=blue][color=green]
                        > >
                        > > In the above example the next term above 10 would need to reduce to X[/color][/color]
                        spaces[color=blue][color=green]
                        > > to 108 spaces or less
                        > >
                        > > solving 2(n-1)=108 gives us n=55, so 55 would reduce anything up to 3078
                        > > spaces down to 108 or less
                        > >
                        > > in the general case, say the term after n is m, then
                        > >
                        > > n(n+1)-2 = 2(m-1)
                        > >
                        > > or
                        > >
                        > > m = 1/2 (n)(n+1)
                        > >
                        > > this generates the sequence 2, 3, 6, 21, 231, 26796, 359026206,
                        > > 644499084768903 00, 207689535133977 000000000000000 0000. I'm sure there is
                        > > some rounding in the last couple of terms (I only used excel).
                        > >
                        > > Anyone with high school math will notice that the formula for m is the[/color][/color]
                        sum[color=blue][color=green]
                        > > of all numbers adding up to m, which does make sense although I'm not[/color][/color]
                        sure I[color=blue][color=green]
                        > > could prove why. one analagy is that if you took a triangle of billiard
                        > > balls (start with 3) then spread them out so that these form the bottom[/color][/color]
                        row[color=blue][color=green]
                        > > of a bigger next triangle, then spread all those out to form the bottom[/color][/color]
                        row[color=blue][color=green]
                        > > of the next triangle and so on.
                        > >
                        > > Anyway I hope that's of interest to someone. Oh who am I kidding, it was
                        > > such a buzz figuring it out I don't care if anyone ever reads this.
                        > >
                        > > Andy
                        > >
                        > > "Gert-Jan Strik" <sorry@toomuchs pamalready.nl> wrote in message
                        > > news:402153E9.B F779772@toomuch spamalready.nl. ..[color=darkred]
                        > > > Hmmm, again I posted to fast. Replace all X with space in the solution
                        > > > below and it should work...
                        > > >
                        > > > Gert-Jan
                        > > >
                        > > >
                        > > > Gert-Jan Strik wrote:
                        > > > >
                        > > > > SELECT *
                        > > > > FROM A
                        > > > > INNER JOIN B
                        > > > > ON Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(A.col, replicate('X',1 32),'X')
                        > > > > , replicate('X', 21),'X')
                        > > > > , replicate('X', 6),'X')
                        > > > > , replicate('X', 3),'X')
                        > > > > , replicate('X', 2),'X')
                        > > > > , replicate('X', 2),'X')
                        > > > > = Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(
                        > > > > Replace(B.col, replicate('X',1 32),'X')
                        > > > > , replicate('X', 21),'X')
                        > > > > , replicate('X', 6),'X')
                        > > > > , replicate('X', 3),'X')
                        > > > > , replicate('X', 2),'X')
                        > > > > , replicate('X', 2),'X')
                        > > > >
                        > > > > Hope this helps,
                        > > > > Gert-Jan
                        > > > >
                        > > > > Andy Fish wrote:
                        > > > > >
                        > > > > > Hi,
                        > > > > >
                        > > > > > I am trying to concoct a query that will join rows on the basis of[/color][/color][/color]
                        a[color=blue][color=green][color=darkred]
                        > > > > > whitespace insensitive comparison. For instance if one row has the[/color]
                        > > value[color=darkred]
                        > > > > > 'a<space>b' and another has the value 'a<space><space >b' I want[/color][/color][/color]
                        them[color=blue][color=green]
                        > > to be[color=darkred]
                        > > > > > considered equal (but not the same as 'ab')
                        > > > > >
                        > > > > > I am happy to do some T-SQL if that helps. Also I have a full-text[/color]
                        > > index on[color=darkred]
                        > > > > > the column in question, but note that I am comparing rows against[/color][/color][/color]
                        each[color=blue][color=green][color=darkred]
                        > > > > > other, not to a fixed string.
                        > > > > >
                        > > > > > This is for a one-off job, so if there is no obvious way to do it[/color][/color][/color]
                        on[color=blue][color=green]
                        > > the[color=darkred]
                        > > > > > database, I will just bcp out the data and knock up some perl or[/color]
                        > > something[color=darkred]
                        > > > > > to do it. All other things being equal I would rather do it on the[/color]
                        > > database[color=darkred]
                        > > > > > though.
                        > > > > >
                        > > > > > Many thanks
                        > > > > >
                        > > > > > Andy[/color][/color][/color]


                        Comment

                        Working...