How to parse address string using any language

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • pranyht
    New Member
    • Jul 2010
    • 5

    How to parse address string using any language

    for example if user enters Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

    2299 Lewes-Georgetown Hwy
    A. P. Croll & Son
    Georgetown
    DE
    19947

    i have thought of the following algorithm but i am having trouble implementing it...can anyone help me write the code? i know C and a bit of C++, so would prefer if the code was in these languages..

    my algo

    1)Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.

    The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
    before that is the city, and it's probably on the same line as the state.

    You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
    The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.

    It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

    any help would be appreciated
  • weaknessforcats
    Recognized Expert Expert
    • Mar 2007
    • 9214

    #2
    Unless you know the format of the input, it will be difficult to parse it.

    Can you insist on a) specific field widths, b) CSV format, c)token identifiers ?

    Token identifiers are things like NAME=, ADDRESS=, etc.

    Comment

    • donbock
      Recognized Expert Top Contributor
      • Mar 2008
      • 2427

      #3
      Your example has comma separators between street address and city; and between city and state.
      • Can you count on all these commas being present?
      • Are you sure there isn't a separator between name and street address?
      • Is the comma between city and state optional?
      • What other variant formats do you have to support (post office box, rural route, suite number, c/o, department, military APO/FPO, etc)? [and those are just some US variants]

      You might find Frank's Compulsive Guide to Postal Addresses interesting.

      Comment

      • pranyht
        New Member
        • Jul 2010
        • 5

        #4
        well yes i understand its almost impossible to a code that would be 100% accurate. luckily though we've been told that we can make any assumptions we like, so maybe you could restrict the code to some standard format and write the code just based on that..... how would one write code then?

        Comment

        • pranyht
          New Member
          • Jul 2010
          • 5

          #5
          public class Address
          {
          public string Street {get;set;}; // Lunkad Tower, 6th floor
          public string Locality {get;set;}; // Viman Nagar
          public string City {get;set;}; // Pune
          public string State {get;set;}; // MH, Maharashtra
          public string PostalCode {get;set;}; // 60611
          public string Country {get;set;}; // e.g. India, IN
          }

          can anyone help me write the code?

          Comment

          • Oralloy
            Recognized Expert Contributor
            • Jun 2010
            • 988

            #6
            pranyht,

            There are a huge number of ways an address can be expressed. Have you got any limits on what you are going to have to process? U.S.A. only addresses? German addresses? International mail to Italy?

            Constrain the problem, so you can start to generate a viable solution.

            Once you have the problem space worked out, then you can start analysis and formulate a solution.

            That said, I would recommend that you write a form of pattern match engine to scan the addresses. Then, take the first or the best hit, depending on how you implement. After you get your hits, then you can re-check the processed address to make sure what you found was valid.

            By way of comparison, I spent several days of my life writing a general date/time/timestamp parser for a commercial web-site. There were about 15 general patterns that I expected (e.g. "MMDDYY", "YYYY-MM-DD", RFC, etc...). I checked all of them using pattern matching, validated them, and when there were multiple valid hits, I selected the "correct" one (or I compared and verified that the result was the same). Invalid inputs were kicked out with exceptions.

            Days worth of analysis and false starts. The final code only took a few hours, once I'd worked out how I was going to solve the problem.

            That said, be sure to code self-defensively. Any algorithm you may come up with will surely fail at some point, especially when dealing with input from real people. Expect Failures!

            Comment

            Working...