How to find transposed data and near misses

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • JJA

    How to find transposed data and near misses

    I would like some advice on a data and query problem I face. I have a
    data table with a "raw key" value which is not guaranteed to be valid
    at its source. Normally, this value will be 9 numeric digits and map to
    a "names" table where the entity is given assigned an "official name".

    My problem is that I'd like to be able to identify data values that are
    "close" to being "correct". For example, in the case of a
    nine digit number such as 077467881, I'd like to be able to identify
    rows with values close to this raw string. That is, if
    there were a row with a value for this column that was "off" by say, a
    transposed single digit (such as 077647881 in this example)
    I would like to find a query to locate the "close candidates" in a
    result set. If I can find rows having a raw key
    value that is close to a "good key" then I can allow my user to use
    other criteria to possibly assign the "close key" as
    an alternate or alias of the official key. Here is part of my schema:

    CREATE TABLE MYData (
    StateCD char (2) NOT NULL ,
    CountyCD char (3) NOT NULL ,
    MYID int NULL ,
    RawNumString varchar(9) NULL ,
    SaleMnYear datetime NOT NULL ,
    NumberWidgets int NOT NULL ,
    )

    CREATE TABLE MYNames (
    MYID int IDENTITY (1, 1) NOT NULL ,
    OfficialName varchar (70) NOT NULL ,
    CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
    (
    MYID
    )
    )
    CREATE TABLE MYAltID (
    RawNumString varchar (9) NOT NULL ,
    MYID int NOT NULL ,
    CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
    (
    RawNumString
    ) ,
    CONSTRAINT FK_HasName FOREIGN KEY
    (
    MYID
    ) REFERENCES MYNames (
    MYID
    )
    )
    So, how to generalize something like:
    SELECT * FROM MYData WHERE RawNumString = '077467881'
    OR RawNumString = '077647881'

  • JT

    #2
    Re: How to find transposed data and near misses

    For what it's worth...

    The LIKE operator can perform several forms of wildcard comparisons against
    2 strings. For example:

    if '90120' like '9_120' print 'Yes' else print 'No'
    if '90120' like '9012[0..9]' print 'Yes' else print 'No'
    if '90120' like '*0120' print 'Yes' else print 'No'

    Yes
    Yes
    Yes

    The SoundEx function returns a checksum for a character string, but not
    numbers. It basically disregards vowels and double letters and returns a 4
    char result. For example:

    print soundex('Robert ')
    print soundex('Robert o')
    print soundex('Rabert ie')
    print soundex('Rabbit ')
    print soundex('Rob')

    R163
    R163
    R163
    R130
    R100

    These can be included in a where clause. For example:
    SELECT * FROM MYData WHERE RawNumString like '*7746*'
    SELECT * FROM MYData WHERE SoundEx(RawName ) = SoundEx('France sco')

    Keep in mind that performing like or soundex comparisons do not take
    advantage of indexes, so performance could be a problem on a large table.


    "JJA" <johna@cbmiweb. com> wrote in message
    news:1117641848 .807351.148700@ g47g2000cwa.goo glegroups.com.. .[color=blue]
    > I would like some advice on a data and query problem I face. I have a
    > data table with a "raw key" value which is not guaranteed to be valid
    > at its source. Normally, this value will be 9 numeric digits and map to
    > a "names" table where the entity is given assigned an "official name".
    >
    > My problem is that I'd like to be able to identify data values that are
    > "close" to being "correct". For example, in the case of a
    > nine digit number such as 077467881, I'd like to be able to identify
    > rows with values close to this raw string. That is, if
    > there were a row with a value for this column that was "off" by say, a
    > transposed single digit (such as 077647881 in this example)
    > I would like to find a query to locate the "close candidates" in a
    > result set. If I can find rows having a raw key
    > value that is close to a "good key" then I can allow my user to use
    > other criteria to possibly assign the "close key" as
    > an alternate or alias of the official key. Here is part of my schema:
    >
    > CREATE TABLE MYData (
    > StateCD char (2) NOT NULL ,
    > CountyCD char (3) NOT NULL ,
    > MYID int NULL ,
    > RawNumString varchar(9) NULL ,
    > SaleMnYear datetime NOT NULL ,
    > NumberWidgets int NOT NULL ,
    > )
    >
    > CREATE TABLE MYNames (
    > MYID int IDENTITY (1, 1) NOT NULL ,
    > OfficialName varchar (70) NOT NULL ,
    > CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
    > (
    > MYID
    > )
    > )
    > CREATE TABLE MYAltID (
    > RawNumString varchar (9) NOT NULL ,
    > MYID int NOT NULL ,
    > CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
    > (
    > RawNumString
    > ) ,
    > CONSTRAINT FK_HasName FOREIGN KEY
    > (
    > MYID
    > ) REFERENCES MYNames (
    > MYID
    > )
    > )
    > So, how to generalize something like:
    > SELECT * FROM MYData WHERE RawNumString = '077467881'
    > OR RawNumString = '077647881'
    >[/color]


    Comment

    • Erland Sommarskog

      #3
      Re: How to find transposed data and near misses

      [posted and mailed, please reply in ews]

      JJA (johna@cbmiweb. com) writes:[color=blue]
      > I would like some advice on a data and query problem I face. I have a
      > data table with a "raw key" value which is not guaranteed to be valid
      > at its source. Normally, this value will be 9 numeric digits and map to
      > a "names" table where the entity is given assigned an "official name".
      >
      > My problem is that I'd like to be able to identify data values that are
      > "close" to being "correct". For example, in the case of a
      > nine digit number such as 077467881, I'd like to be able to identify
      > rows with values close to this raw string. That is, if
      > there were a row with a value for this column that was "off" by say, a
      > transposed single digit (such as 077647881 in this example)
      > I would like to find a query to locate the "close candidates" in a
      > result set. If I can find rows having a raw key
      > value that is close to a "good key" then I can allow my user to use
      > other criteria to possibly assign the "close key" as
      > an alternate or alias of the official key. Here is part of my schema:[/color]

      Fuzzy logic is not for the faint of heart, and it's definitely not my
      area of expertise.

      Assuming that you always have nine digits, one approach is compare
      character by character and if 7 or more match, count this as a possible
      match:

      SELECT *
      FROM tbl
      WHERE CASE WHEN substring(col, 1, 1) = substring(@val, 1, 1)
      THEN 1 ELSE 0
      END +
      CASE WHEN substring(col, 2, 1) = substring(@val, 2, 1)
      THEN 1 ELSE 0
      END +
      ...
      CASE WHEN substring(col, 9, 1) = substring(@val, 9, 1)
      THEN 1 ELSE 0
      END >= 7


      --
      Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

      Books Online for SQL Server SP3 at
      Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.

      Comment

      • --CELKO--

        #4
        Re: How to find transposed data and near misses

        Have you ever worked with check digits before? They can prevent errors
        in data entry instead of trying to patch them after the fact. The idea
        of keeping an invalid key does not sound like a good design.

        Comment

        • JJA

          #5
          Re: How to find transposed data and near misses

          Yes, I know this is not good design but we are getting a raw data file
          from another organization and we have no control over their practices.
          Most occurrences of this number are "valid" but it is clear from
          looking at the data that there is no validation at the source. The
          nature of the data is such that if we can identify 7 or 8 bytes of data
          as being the same as another 9 byte and valid "key", we could assume
          the key could be improved to point at the same 9 byte valid entity. So,
          I thought I'd run this notion past the world of experts for some ideas.

          Comment

          • JJA

            #6
            Re: How to find transposed data and near misses

            Thanks very much for this neat suggestion. It is exactly what I hoped
            for and I can implement this a stored procedure with a couple of
            parameters. I will provide a little interface where the analyst can
            launch the sproc and see if there are any "near-misses". Very cool
            application of the CASE facility. Thanks again.

            Comment

            • Erland Sommarskog

              #7
              Re: How to find transposed data and near misses

              JJA (johna@cbmiweb. com) writes:[color=blue]
              > Thanks very much for this neat suggestion. It is exactly what I hoped
              > for and I can implement this a stored procedure with a couple of
              > parameters. I will provide a little interface where the analyst can
              > launch the sproc and see if there are any "near-misses". Very cool
              > application of the CASE facility. Thanks again.[/color]

              Glad to hear that the idea was useful to use. Whether it suffices remains
              to see. As I said that fuzzy-logic stuff is horrible.


              --
              Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

              Books Online for SQL Server SP3 at
              Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.

              Comment

              Working...