best design for parse

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • gs

    best design for parse

    let say I have to deal with various date format and I am give format string
    from one of the following
    dd/mm/yyyy
    mm/dd/yyyy
    dd/mmm/yyyy
    mmm/dd/yyyy
    dd/mm/yy
    mm/dd/yy
    dd/mmm/yy
    mmm/dd/yy
    dd/mm
    what is the best way to come up a relevant regex for the incoming format
    string
    a) use two array and statically match
    b) use regex to find the order


  • Herfried K. Wagner [MVP]

    #2
    Re: best design for parse

    "gs" <gs@dontMail.te lusschrieb:
    let say I have to deal with various date format and I am give format
    string from one of the following
    dd/mm/yyyy
    mm/dd/yyyy
    dd/mmm/yyyy
    mmm/dd/yyyy
    dd/mm/yy
    mm/dd/yy
    dd/mmm/yy
    mmm/dd/yy
    dd/mm
    what is the best way to come up a relevant regex for the incoming format
    string
    Maybe you are looking for 'DateTime.Parse Exact'.

    --
    M S Herfried K. Wagner
    M V P <URL:http://dotnet.mvps.org/>
    V B <URL:http://dotnet.mvps.org/dotnet/faqs/>

    Comment

    • GS

      #3
      Re: best design for parse

      thank you, I give that a shot Hopefully . it will take care most of what I
      need, or at least make the rest easier. except one thing,

      I am dealing with lines of string data (up to 300 lines) and the date fields
      position may not be known before hand although for a given set of lines,
      they stay in the same place 99.999 of the time except for the odd comments
      which is not that critical;

      "Herfried K. Wagner [MVP]" <hirf-spam-me-here@gmx.atwrot e in message
      news:eTeYgRTMHH A.4376@TK2MSFTN GP03.phx.gbl...
      "gs" <gs@dontMail.te lusschrieb:
      let say I have to deal with various date format and I am give format
      string from one of the following
      dd/mm/yyyy
      mm/dd/yyyy
      dd/mmm/yyyy
      mmm/dd/yyyy
      dd/mm/yy
      mm/dd/yy
      dd/mmm/yy
      mmm/dd/yy
      dd/mm
      what is the best way to come up a relevant regex for the incoming format
      string
      >
      Maybe you are looking for 'DateTime.Parse Exact'.
      >
      --
      M S Herfried K. Wagner
      M V P <URL:http://dotnet.mvps.org/>
      V B <URL:http://dotnet.mvps.org/dotnet/faqs/>
      >

      Comment

      • Cor Ligthert [MVP]

        #4
        Re: best design for parse

        GS,

        Maybe can you avoid this in 2007 and all things like that as
        DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
        globalization and than the to that related ToString option.

        Cor

        "gs" <gs@dontMail.te lusschreef in bericht
        news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
        let say I have to deal with various date format and I am give format
        string from one of the following
        dd/mm/yyyy
        mm/dd/yyyy
        dd/mmm/yyyy
        mmm/dd/yyyy
        dd/mm/yy
        mm/dd/yy
        dd/mmm/yy
        mmm/dd/yy
        dd/mm
        what is the best way to come up a relevant regex for the incoming format
        string
        a) use two array and statically match
        b) use regex to find the order
        >

        Comment

        • GS

          #5
          Re: best design for parse

          thank you, Cor.

          However, I must be thick. I don't quite get the drift as with regard to
          2007. are we talking about a new release of studio, .net frame work or just
          the release or patch to come out in 2007.

          how would that handle string date mixed with other data?

          Actually the original source of the data is displayed html table placed in
          clipboard. the objective to standardize the date string to yyyy-mm-dd and
          then pass on to other components for processing and storage

          "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
          news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
          GS,
          >
          Maybe can you avoid this in 2007 and all things like that as
          DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
          globalization and than the to that related ToString option.
          >
          Cor
          >
          "gs" <gs@dontMail.te lusschreef in bericht
          news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
          let say I have to deal with various date format and I am give format
          string from one of the following
          dd/mm/yyyy
          mm/dd/yyyy
          dd/mmm/yyyy
          mmm/dd/yyyy
          dd/mm/yy
          mm/dd/yy
          dd/mmm/yy
          mmm/dd/yy
          dd/mm
          what is the best way to come up a relevant regex for the incoming format
          string
          a) use two array and statically match
          b) use regex to find the order
          >
          >

          Comment

          • Cor Ligthert [MVP]

            #6
            Re: best design for parse

            GS,

            I was thinking about writting that this was not in the case with webpages.
            However windowforms is the default in this newsgroup, therefore please tell
            this next time.

            Cor

            "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comsc hreef in bericht
            news:OIkxvRhMHH A.992@TK2MSFTNG P06.phx.gbl...
            thank you, Cor.
            >
            However, I must be thick. I don't quite get the drift as with regard to
            2007. are we talking about a new release of studio, .net frame work or
            just
            the release or patch to come out in 2007.
            >
            how would that handle string date mixed with other data?
            >
            Actually the original source of the data is displayed html table placed in
            clipboard. the objective to standardize the date string to yyyy-mm-dd and
            then pass on to other components for processing and storage
            >
            "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
            news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
            >GS,
            >>
            >Maybe can you avoid this in 2007 and all things like that as
            >DateTime.parse Exact, but have a look to the nicely by Microsoft inbuild
            >globalizatio n and than the to that related ToString option.
            >>
            >Cor
            >>
            >"gs" <gs@dontMail.te lusschreef in bericht
            >news:OtrnsPTMH HA.4720@TK2MSFT NGP03.phx.gbl.. .
            let say I have to deal with various date format and I am give format
            string from one of the following
            dd/mm/yyyy
            mm/dd/yyyy
            dd/mmm/yyyy
            mmm/dd/yyyy
            dd/mm/yy
            mm/dd/yy
            dd/mmm/yy
            mmm/dd/yy
            dd/mm
            what is the best way to come up a relevant regex for the incoming
            format
            string
            a) use two array and statically match
            b) use regex to find the order
            >
            >>
            >>
            >
            >

            Comment

            • Stephany Young

              #7
              Re: best design for parse

              It is?

              Newsgroup microsoft.publi c.dotnet.langua ges.vb provides a forum for
              questions and general discussion of Visual Basic .NET.

              Source:
              Find official documentation, practical know-how, and expert guidance for builders working and troubleshooting in Microsoft products.



              "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
              news:%23%23RArb hMHHA.536@TK2MS FTNGP02.phx.gbl ...
              GS,
              >
              I was thinking about writting that this was not in the case with webpages.
              However windowforms is the default in this newsgroup, therefore please
              tell this next time.
              >
              Cor
              >
              "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comsc hreef in bericht
              news:OIkxvRhMHH A.992@TK2MSFTNG P06.phx.gbl...
              >thank you, Cor.
              >>
              >However, I must be thick. I don't quite get the drift as with regard to
              >2007. are we talking about a new release of studio, .net frame work or
              >just
              >the release or patch to come out in 2007.
              >>
              >how would that handle string date mixed with other data?
              >>
              >Actually the original source of the data is displayed html table placed
              >in
              >clipboard. the objective to standardize the date string to yyyy-mm-dd and
              >then pass on to other components for processing and storage
              >>
              >"Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
              >news:%23Qj7TbW MHHA.3944@TK2MS FTNGP06.phx.gbl ...
              >>GS,
              >>>
              >>Maybe can you avoid this in 2007 and all things like that as
              >>DateTime.pars eExact, but have a look to the nicely by Microsoft inbuild
              >>globalizati on and than the to that related ToString option.
              >>>
              >>Cor
              >>>
              >>"gs" <gs@dontMail.te lusschreef in bericht
              >>news:OtrnsPTM HHA.4720@TK2MSF TNGP03.phx.gbl. ..
              >let say I have to deal with various date format and I am give format
              >string from one of the following
              >dd/mm/yyyy
              >mm/dd/yyyy
              >dd/mmm/yyyy
              >mmm/dd/yyyy
              >dd/mm/yy
              >mm/dd/yy
              >dd/mmm/yy
              >mmm/dd/yy
              >dd/mm
              >what is the best way to come up a relevant regex for the incoming
              >format
              >string
              >a) use two array and statically match
              >b) use regex to find the order
              >>
              >>>
              >>>
              >>
              >>
              >
              >

              Comment

              • GS

                #8
                Re: best design for parse

                the target is actually part of a windows .net application with winform that
                embed webbrowser control.

                I despite the clipboard source may well be in html table, but I can get the
                text. the resulting text will have columns delimited by a couple of space
                like characters

                I am just in the designing stage to find the an easy to maintain approach
                that will yield adequate performance on target PCs.

                "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                news:%23%23RArb hMHHA.536@TK2MS FTNGP02.phx.gbl ...
                GS,
                >
                I was thinking about writting that this was not in the case with webpages.
                However windowforms is the default in this newsgroup, therefore please
                tell
                this next time.
                >
                Cor
                >
                "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comsc hreef in bericht
                news:OIkxvRhMHH A.992@TK2MSFTNG P06.phx.gbl...
                thank you, Cor.

                However, I must be thick. I don't quite get the drift as with regard to
                2007. are we talking about a new release of studio, .net frame work or
                just
                the release or patch to come out in 2007.

                how would that handle string date mixed with other data?

                Actually the original source of the data is displayed html table placed
                in
                clipboard. the objective to standardize the date string to yyyy-mm-dd
                and
                then pass on to other components for processing and storage

                "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
                GS,
                >
                Maybe can you avoid this in 2007 and all things like that as
                DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
                globalization and than the to that related ToString option.
                >
                Cor
                >
                "gs" <gs@dontMail.te lusschreef in bericht
                news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                let say I have to deal with various date format and I am give format
                string from one of the following
                dd/mm/yyyy
                mm/dd/yyyy
                dd/mmm/yyyy
                mmm/dd/yyyy
                dd/mm/yy
                mm/dd/yy
                dd/mmm/yy
                mmm/dd/yy
                dd/mm
                what is the best way to come up a relevant regex for the incoming
                format
                string
                a) use two array and statically match
                b) use regex to find the order

                >
                >
                >
                >

                Comment

                • GS

                  #9
                  Re: best design for parse

                  thanks for all pitched in so far.

                  let give it another shot.

                  looks like an easier way out would be
                  1.copy the date format string regex string holder and then derive the
                  relevant regex expression to be used for date normalization later in part 2:
                  replace the regex string the yyyy to regex year expression with year
                  identifier
                  look for yy and replace with 20yy and repeat the step above
                  replace mmm with the month regex expression associated with month
                  identifier
                  replace mm with the 2 digit month regex expression associated with month
                  identifier
                  replace dd with the 2 digit day regix expression assoc. with day
                  identifier

                  2. use the resulting regex in regex replace to normalize to yyyy--mm-dd


                  any problem with the above approach?

                  "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                  news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
                  GS,
                  >
                  Maybe can you avoid this in 2007 and all things like that as
                  DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
                  globalization and than the to that related ToString option.
                  >
                  Cor
                  >
                  "gs" <gs@dontMail.te lusschreef in bericht
                  news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                  let say I have to deal with various date format and I am give format
                  string from one of the following
                  dd/mm/yyyy
                  mm/dd/yyyy
                  dd/mmm/yyyy
                  mmm/dd/yyyy
                  dd/mm/yy
                  mm/dd/yy
                  dd/mmm/yy
                  mmm/dd/yy
                  dd/mm
                  what is the best way to come up a relevant regex for the incoming format
                  string
                  a) use two array and statically match
                  b) use regex to find the order
                  >
                  >

                  Comment

                  • Cor Ligthert [MVP]

                    #10
                    Re: best design for parse

                    Stephany,

                    You would have seen (you are not a newbie) how much time it took especially
                    for me, before I got it accepted that the used VB.net language in ASPNET was
                    also a part of the language and not of the framework and therefore suspect
                    of this newsgroup. Maybe you even saw that last week I wrote that again in
                    the C# newsgroup.

                    I only ask to the OP to tell that if it is specialized on a webpage (what
                    seems to be not the case) to tell that. Most of the persons answering here
                    are taking windowsforms as default, and in the case of date times I seldom
                    ask that, because there is "leiter" no DateTime Value equivalent in HTML.

                    Cor

                    "Stephany Young" <noone@localhos tschreef in bericht
                    news:utTgAjhMHH A.3288@TK2MSFTN GP03.phx.gbl...
                    It is?
                    >
                    Newsgroup microsoft.publi c.dotnet.langua ges.vb provides a forum for
                    questions and general discussion of Visual Basic .NET.
                    >
                    Source:
                    Find official documentation, practical know-how, and expert guidance for builders working and troubleshooting in Microsoft products.

                    >
                    >
                    "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                    news:%23%23RArb hMHHA.536@TK2MS FTNGP02.phx.gbl ...
                    >GS,
                    >>
                    >I was thinking about writting that this was not in the case with
                    >webpages. However windowforms is the default in this newsgroup, therefore
                    >please tell this next time.
                    >>
                    >Cor
                    >>
                    >"GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comsc hreef in bericht
                    >news:OIkxvRhMH HA.992@TK2MSFTN GP06.phx.gbl...
                    >>thank you, Cor.
                    >>>
                    >>However, I must be thick. I don't quite get the drift as with regard to
                    >>2007. are we talking about a new release of studio, .net frame work or
                    >>just
                    >>the release or patch to come out in 2007.
                    >>>
                    >>how would that handle string date mixed with other data?
                    >>>
                    >>Actually the original source of the data is displayed html table placed
                    >>in
                    >>clipboard. the objective to standardize the date string to yyyy-mm-dd
                    >>and
                    >>then pass on to other components for processing and storage
                    >>>
                    >>"Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                    >>news:%23Qj7Tb WMHHA.3944@TK2M SFTNGP06.phx.gb l...
                    >>>GS,
                    >>>>
                    >>>Maybe can you avoid this in 2007 and all things like that as
                    >>>DateTime.par seExact, but have a look to the nicely by Microsoft inbuild
                    >>>globalizatio n and than the to that related ToString option.
                    >>>>
                    >>>Cor
                    >>>>
                    >>>"gs" <gs@dontMail.te lusschreef in bericht
                    >>>news:OtrnsPT MHHA.4720@TK2MS FTNGP03.phx.gbl ...
                    >>let say I have to deal with various date format and I am give format
                    >>string from one of the following
                    >>dd/mm/yyyy
                    >>mm/dd/yyyy
                    >>dd/mmm/yyyy
                    >>mmm/dd/yyyy
                    >>dd/mm/yy
                    >>mm/dd/yy
                    >>dd/mmm/yy
                    >>mmm/dd/yy
                    >>dd/mm
                    >>what is the best way to come up a relevant regex for the incoming
                    >>format
                    >>string
                    >>a) use two array and statically match
                    >>b) use regex to find the order
                    >>>
                    >>>>
                    >>>>
                    >>>
                    >>>
                    >>
                    >>
                    >
                    >

                    Comment

                    • Stephany Young

                      #11
                      Re: best design for parse

                      I think that you are missing the whole point.

                      Regular Expressions (Regex) are about pattern matching, not format matching.

                      It does not matter whether the source data comes from a HTML page, a Windows
                      Forms TextBox or a disk file. The source data is the source data and that is
                      all there is to it.

                      If the source data only contained one instance of a 'date' in dd/MM/yyyy
                      format then to find it by your methodology, you would need to test for up to
                      3,719,628 permutations from 01/01/0001 all the way up to 31/12/9999, i.e.,
                      31 (days) * 12 (months) * 9999 (years). Couple this up with the other 8
                      'formats' and you can how such a task will quickly become unmanagable.

                      But ... what you really are looking for is a sequence of 2 digits followed
                      by a slash followed by 2 digits followed by a slash followed by 4 digits.
                      That immediately takes care of 2 of your 'formats'. Off the top of my head
                      the regex for that is "[0-9]{2}/[0-9]{2}/[0-9]{4}".

                      The next pattern you are looking for is 2 digits followed by a slash
                      followed by 3 alphas followed by a slash followed by 4 digits.
                      "[0-9]{2}/{A-Za-z}{3}/{0-9}{4}".

                      The next pattern you are looking for is 3 alphas followed by a slash
                      followed by 2 digits followed by a slash followed by 4 digits.
                      "{A-Za-z}{3}/[0-9]{2}/{0-9}{4}".

                      The next 4 formats are taken care of by varying the above.
                      "[0-9]{2}/[0-9]{2}/[0-9]{2}", "[0-9]{2}/{A-Za-z}{3}/{0-9}" and
                      "{A-Za-z}{3}/[0-9]{2}/{0-9}{2}" respectively.

                      The last format is simply the pattern "[0-9]{2}/{0-9}{2}".

                      Now, the real secret is what directly precedes and follows your 'dates'. For
                      instance, are your 'dates' ALWAYS 'wrapped' in a tag? E.g.,
                      <td>07/01/2007</td>. It might be that there is always a space character
                      directly before 'date and another directly after the 'date'. Any such
                      information will allow you to 'tune' your pattern so that it doesn't pick up
                      false positives. The pattern [0-9]{2}/{0-9}{2} would pick up the 01/02 out
                      of 01MyQuite01/02YourQuote02.

                      All the patterns need to be put together in a regular expression woth or's
                      so that you can find all the candidate dates in one operation.

                      "\d{2}/\d{2}/\d{4}|\d{2}/[A-Za-z]{3}/\d{4}|[A-Za-z]{3}/\d{2}/\d{4}|\d{2}/\d{2}/\d{2}|\d{2}/[A-Za-z]{3}/\d{2}|[A-Za-z]{3}/\d{2}/\d{2}|\d{2}/\d{2}"

                      Please feel free to jump in here if I've got that wrong because I'm by no
                      means a regex expert.


                      Once you have your candidate dates (matches) you need to deal with each one
                      in turn.

                      As Herfried said earlier you need to use DateTime.ParseE xact.

                      For that you need an array of strings to hold all your formats.

                      Dim _formats As String() = new String() {"dd/MM/yyyy", "MM/dd/yyyy",
                      "dd/MMM/yyyy", "MMM/dd/yyyy", "dd/MM/yy", "MM/dd/yy", "dd/MMM/yy",
                      "MMM/dd/yy", "dd/MM"}

                      For each candidate call DateTime.ParseE xact, trapping an exception if it
                      occurs:

                      Dim _d As DateTime

                      Try
                      _d = DateTime.ParseE xact(_candidate , _formats, Nothing,
                      DateTimeStyles. None)
                      ' DateTime.ParseE xact succeeded so we can deal with it
                      ...
                      Catch _ex As FormatException
                      ' Because we know that _candidate is not an empty string and none of the
                      elements of _formats is an empty string then _candidate does not contain a
                      date and time that corresponds to any element of _formats
                      ....
                      End Try



                      "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                      news:%23vnOBJiM HHA.1280@TK2MSF TNGP04.phx.gbl. ..
                      thanks for all pitched in so far.
                      >
                      let give it another shot.
                      >
                      looks like an easier way out would be
                      1.copy the date format string regex string holder and then derive the
                      relevant regex expression to be used for date normalization later in part
                      2:
                      replace the regex string the yyyy to regex year expression with year
                      identifier
                      look for yy and replace with 20yy and repeat the step above
                      replace mmm with the month regex expression associated with month
                      identifier
                      replace mm with the 2 digit month regex expression associated with
                      month
                      identifier
                      replace dd with the 2 digit day regix expression assoc. with day
                      identifier
                      >
                      2. use the resulting regex in regex replace to normalize to yyyy--mm-dd
                      >
                      >
                      any problem with the above approach?
                      >
                      "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                      news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
                      >GS,
                      >>
                      >Maybe can you avoid this in 2007 and all things like that as
                      >DateTime.parse Exact, but have a look to the nicely by Microsoft inbuild
                      >globalizatio n and than the to that related ToString option.
                      >>
                      >Cor
                      >>
                      >"gs" <gs@dontMail.te lusschreef in bericht
                      >news:OtrnsPTMH HA.4720@TK2MSFT NGP03.phx.gbl.. .
                      let say I have to deal with various date format and I am give format
                      string from one of the following
                      dd/mm/yyyy
                      mm/dd/yyyy
                      dd/mmm/yyyy
                      mmm/dd/yyyy
                      dd/mm/yy
                      mm/dd/yy
                      dd/mmm/yy
                      mmm/dd/yy
                      dd/mm
                      what is the best way to come up a relevant regex for the incoming
                      format
                      string
                      a) use two array and statically match
                      b) use regex to find the order
                      >
                      >>
                      >>
                      >
                      >

                      Comment

                      • GS

                        #12
                        Re: best design for parse

                        You are sort of on the same track as mine.


                        I must first apologize I did not tell you the complete story.

                        Although the application does not exactly know before hand what format the
                        data may come in, however part of the application allow user to define and
                        record favourite for a website
                        - to extract by text or html
                        - header content and format
                        - record format and date format ( that is where the date format mask
                        come in)
                        - optionally ordinal number for each column or re-ordering
                        - trailer content and format

                        For a given batch, at least for the body, date format are uniform

                        furthermore, the need to make the extract process generic and adaptable to
                        the front end that takes the user definitions, I believe it would be easier
                        to "normalize" date string to "yyyy-mm-dd".

                        Also the end target for of may not necessarily be SQL database but may be
                        text, pasted to word report. or excel by user


                        Therefore, I can transform the date format mask to regex in the appropriate
                        format and identifier I can use regex,replace to normalize the date. As a
                        matter of fact the date separator does not have to / but can be space as
                        long as there are identifiable delimiter around the date string.

                        I already have code for dealing with regex for dates from prior project.
                        all I have to do is adapt to the present need

                        who knows, maybe I taken on a totally offbeat tract

                        "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                        news:%23vnOBJiM HHA.1280@TK2MSF TNGP04.phx.gbl. ..
                        thanks for all pitched in so far.
                        >
                        let give it another shot.
                        >
                        looks like an easier way out would be
                        1.copy the date format string regex string holder and then derive the
                        relevant regex expression to be used for date normalization later in part
                        2:
                        replace the regex string the yyyy to regex year expression with year
                        identifier
                        look for yy and replace with 20yy and repeat the step above
                        replace mmm with the month regex expression associated with month
                        identifier
                        replace mm with the 2 digit month regex expression associated with
                        month
                        identifier
                        replace dd with the 2 digit day regix expression assoc. with day
                        identifier
                        >
                        2. use the resulting regex in regex replace to normalize to yyyy--mm-dd
                        >
                        >
                        any problem with the above approach?
                        >
                        "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                        news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
                        GS,

                        Maybe can you avoid this in 2007 and all things like that as
                        DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
                        globalization and than the to that related ToString option.

                        Cor

                        "gs" <gs@dontMail.te lusschreef in bericht
                        news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                        let say I have to deal with various date format and I am give format
                        string from one of the following
                        dd/mm/yyyy
                        mm/dd/yyyy
                        dd/mmm/yyyy
                        mmm/dd/yyyy
                        dd/mm/yy
                        mm/dd/yy
                        dd/mmm/yy
                        mmm/dd/yy
                        dd/mm
                        what is the best way to come up a relevant regex for the incoming
                        format
                        string
                        a) use two array and statically match
                        b) use regex to find the order
                        >
                        >
                        >

                        Comment

                        • Stephany Young

                          #13
                          Re: best design for parse

                          Again you're missing the point.

                          I think the best thing you can do is post a relatively small sample of the
                          text you are attempting to parse.

                          While you're doing that, execute the following and observe the results. It
                          demonstrates what I am talking about:

                          Dim _source As String = "On 07/01/2007 the quick brown fox jumps over the
                          lazy dog." & Environment.New Line & _
                          "On 08/01/2007 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On Jan/09/2007 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On 10/Jan/2007 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On 11/01/07 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On 01/12/07 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On Jan/13/07 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On 14/Jan/07 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "On 15/01 the quick brown fox again jumps over the lazy dog." &
                          Environment.New Line & _
                          "The part number XYZ/72/84 is now discontinued."

                          Dim _regex As New
                          Regex("\d{2}/\d{2}/\d{4}|[A-Za-z]{3}/\d{2}/\d{4}|\d{2}/[A-Za-z]{3}/\d{4}|\d{2}/\d{2}/\d{2}|[A-Za-z]{3}/\d{2}/\d{2}|\d{2}/[A-Za-z]{3}/\d{2}|\d{2}/\d{2}")

                          Dim _candidates As Integer = 0
                          Dim _matches As Integer = 0

                          Dim _match As Match = _regex.Match(_s ource)

                          While _match.Success
                          _candidates += 1
                          Console.WriteLi ne("{0} found at index {1}", _match.Value, _match.Index)
                          Try
                          Console.WriteLi ne("Converted value = {0:yyyy-MM-dd}",
                          DateTime.ParseE xact(_match.Val ue, New String() {"dd/MM/yyyy", "MM/dd/yyyy",
                          "MMM/dd/yyyy", "dd/MMM/yyyy", "dd/MM/yy", "MM/dd/yy", "dd/MMM/yy",
                          "MMM/dd/yy", "dd/MM"}, Nothing, DateTimeStyles. None))
                          _matches += 1
                          Catch _ex As Exception
                          Console.WriteLi ne(_ex.Message)
                          End Try
                          _match = _match.NextMatc h()
                          End While

                          Console.WriteLi ne("{0} candidates found", _candidates)

                          Console.WriteLi ne("{0} matches found", _matches)


                          "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                          news:eFm$y5rMHH A.4376@TK2MSFTN GP03.phx.gbl...
                          You are sort of on the same track as mine.
                          >
                          >
                          I must first apologize I did not tell you the complete story.
                          >
                          Although the application does not exactly know before hand what format the
                          data may come in, however part of the application allow user to define and
                          record favourite for a website
                          - to extract by text or html
                          - header content and format
                          - record format and date format ( that is where the date format mask
                          come in)
                          - optionally ordinal number for each column or re-ordering
                          - trailer content and format
                          >
                          For a given batch, at least for the body, date format are uniform
                          >
                          furthermore, the need to make the extract process generic and adaptable to
                          the front end that takes the user definitions, I believe it would be
                          easier
                          to "normalize" date string to "yyyy-mm-dd".
                          >
                          Also the end target for of may not necessarily be SQL database but may be
                          text, pasted to word report. or excel by user
                          >
                          >
                          Therefore, I can transform the date format mask to regex in the
                          appropriate
                          format and identifier I can use regex,replace to normalize the date. As a
                          matter of fact the date separator does not have to / but can be space as
                          long as there are identifiable delimiter around the date string.
                          >
                          I already have code for dealing with regex for dates from prior project.
                          all I have to do is adapt to the present need
                          >
                          who knows, maybe I taken on a totally offbeat tract
                          >
                          "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                          news:%23vnOBJiM HHA.1280@TK2MSF TNGP04.phx.gbl. ..
                          >thanks for all pitched in so far.
                          >>
                          >let give it another shot.
                          >>
                          >looks like an easier way out would be
                          >1.copy the date format string regex string holder and then derive the
                          >relevant regex expression to be used for date normalization later in part
                          2:
                          > replace the regex string the yyyy to regex year expression with year
                          >identifier
                          > look for yy and replace with 20yy and repeat the step above
                          > replace mmm with the month regex expression associated with month
                          >identifier
                          > replace mm with the 2 digit month regex expression associated with
                          month
                          >identifier
                          > replace dd with the 2 digit day regix expression assoc. with day
                          >identifier
                          >>
                          >2. use the resulting regex in regex replace to normalize to yyyy--mm-dd
                          >>
                          >>
                          >any problem with the above approach?
                          >>
                          >"Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                          >news:%23Qj7TbW MHHA.3944@TK2MS FTNGP06.phx.gbl ...
                          GS,
                          >
                          Maybe can you avoid this in 2007 and all things like that as
                          DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
                          globalization and than the to that related ToString option.
                          >
                          Cor
                          >
                          "gs" <gs@dontMail.te lusschreef in bericht
                          news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                          let say I have to deal with various date format and I am give format
                          string from one of the following
                          dd/mm/yyyy
                          mm/dd/yyyy
                          dd/mmm/yyyy
                          mmm/dd/yyyy
                          dd/mm/yy
                          mm/dd/yy
                          dd/mmm/yy
                          mmm/dd/yy
                          dd/mm
                          what is the best way to come up a relevant regex for the incoming
                          format
                          string
                          a) use two array and statically match
                          b) use regex to find the order
                          >
                          >
                          >
                          >>
                          >>
                          >
                          >

                          Comment

                          • Cor Ligthert [MVP]

                            #14
                            Re: best design for parse

                            GS,

                            As long as you don't know the date format, you can probably do nothing.
                            As soon as you know the dateformat, you can try to use the
                            DateTime.ParseE xact with the given patern.
                            (Don't forget to set the mm in Upercase and let it not be done by the user).

                            Cor

                            "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comsc hreef in bericht
                            news:eFm$y5rMHH A.4376@TK2MSFTN GP03.phx.gbl...
                            You are sort of on the same track as mine.
                            >
                            >
                            I must first apologize I did not tell you the complete story.
                            >
                            Although the application does not exactly know before hand what format the
                            data may come in, however part of the application allow user to define and
                            record favourite for a website
                            - to extract by text or html
                            - header content and format
                            - record format and date format ( that is where the date format mask
                            come in)
                            - optionally ordinal number for each column or re-ordering
                            - trailer content and format
                            >
                            For a given batch, at least for the body, date format are uniform
                            >
                            furthermore, the need to make the extract process generic and adaptable to
                            the front end that takes the user definitions, I believe it would be
                            easier
                            to "normalize" date string to "yyyy-mm-dd".
                            >
                            Also the end target for of may not necessarily be SQL database but may be
                            text, pasted to word report. or excel by user
                            >
                            >
                            Therefore, I can transform the date format mask to regex in the
                            appropriate
                            format and identifier I can use regex,replace to normalize the date. As a
                            matter of fact the date separator does not have to / but can be space as
                            long as there are identifiable delimiter around the date string.
                            >
                            I already have code for dealing with regex for dates from prior project.
                            all I have to do is adapt to the present need
                            >
                            who knows, maybe I taken on a totally offbeat tract
                            >
                            "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                            news:%23vnOBJiM HHA.1280@TK2MSF TNGP04.phx.gbl. ..
                            >thanks for all pitched in so far.
                            >>
                            >let give it another shot.
                            >>
                            >looks like an easier way out would be
                            >1.copy the date format string regex string holder and then derive the
                            >relevant regex expression to be used for date normalization later in part
                            2:
                            > replace the regex string the yyyy to regex year expression with year
                            >identifier
                            > look for yy and replace with 20yy and repeat the step above
                            > replace mmm with the month regex expression associated with month
                            >identifier
                            > replace mm with the 2 digit month regex expression associated with
                            month
                            >identifier
                            > replace dd with the 2 digit day regix expression assoc. with day
                            >identifier
                            >>
                            >2. use the resulting regex in regex replace to normalize to yyyy--mm-dd
                            >>
                            >>
                            >any problem with the above approach?
                            >>
                            >"Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                            >news:%23Qj7TbW MHHA.3944@TK2MS FTNGP06.phx.gbl ...
                            GS,
                            >
                            Maybe can you avoid this in 2007 and all things like that as
                            DateTime.parseE xact, but have a look to the nicely by Microsoft inbuild
                            globalization and than the to that related ToString option.
                            >
                            Cor
                            >
                            "gs" <gs@dontMail.te lusschreef in bericht
                            news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                            let say I have to deal with various date format and I am give format
                            string from one of the following
                            dd/mm/yyyy
                            mm/dd/yyyy
                            dd/mmm/yyyy
                            mmm/dd/yyyy
                            dd/mm/yy
                            mm/dd/yy
                            dd/mmm/yy
                            mmm/dd/yy
                            dd/mm
                            what is the best way to come up a relevant regex for the incoming
                            format
                            string
                            a) use two array and statically match
                            b) use regex to find the order
                            >
                            >
                            >
                            >>
                            >>
                            >
                            >

                            Comment

                            • GS

                              #15
                              Re: best design for parse

                              look like I am not expressing myself clearly. although the application does
                              not know which format is used but does know for a given Set which date
                              format I deals with and can expect the same format for a given Set of input.
                              I should not have used the term batch but a set of record. The only
                              possible variations are some records in certain sets may be split into 2
                              lines but that is not critical as the conditions can be described before
                              hand and normalized by the another parse component

                              sample date

                              Set1: date format mask is "dd MMM"
                              Date Parts ID Parts Description location Quantitiy Unit Cost Total
                              Cost
                              11 Dec A1234987 Sample Parts description W1I1R4S1 2 10.00 20.00
                              15 Dec A1234988 Sample Parts description 1 10.00 20
                              18 Dec A1234988 Sample Parts description 1 10.00 20
                              19 Dec A1234988 Sample Parts description 1 10.00 20
                              12 Dec A1234988 Sample Parts description 1 10.00 20


                              Set 2 date format Mask is "dd MM yy"
                              Date Parts ID Parts Description location Quantitiy Unit Cost Total
                              Cost
                              11 12 06 A1234987 Sample Parts description W1I1R4S1 2 10.00 20.00
                              15 12 06 A1234988 Sample Parts description 1 10.00 20
                              18 12 06 A1234988 Sample Parts description 1 10.00 20
                              19 12 06 A1234988 Sample Parts description 1 10.00 20
                              12 12 06 A1234988 Sample Parts description 1 10.00 20

                              Set 3 date format mask "dd/MMM/06"
                              Parts Description location Quantitiy Unit Cost Total Cost
                              11/12/06 A1234987 Sample Parts description W1I1R4S1 2 10.00 20.00
                              15/12/06 A1234988 Sample Parts description 1 10.00
                              2018/12/06 A1234988 Sample Parts description 1 10.00
                              2019/12/06 A1234988 Sample Parts description 1 10.00
                              2012/12/06 A1234988 Sample Parts description 1 10.00 20

                              Set 4 date format mask ""
                              Date Parts ID Parts Description location Quantitiy Unit Cost Total
                              Cost
                              11/dec/06 A1234987 Sample Parts description W1I1R4S1 2 10.00 20.00
                              15/dec/06 A1234988 Sample Parts description 1 10.00 20
                              18/dec/06 A1234988 Sample Parts description 1 10.00 20
                              19/dec/06 A1234988 Sample Parts description 1 10.00 20
                              12/dec/06 A1234988 Sample Parts description 1 10.00 20

                              how do I deal with format without year, I do have cluse for other parts of
                              teh originatin website and optional default set by user

                              the sample data show variation of date format from set to set but the date
                              format that I need to deal within a given set are consistant and user has
                              influence to date format mask used.

                              Like Cor suggestion. don't let user enter the format but let the user pick
                              from a list. that will like be case at least n the version 0


                              "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                              news:eFm$y5rMHH A.4376@TK2MSFTN GP03.phx.gbl...
                              You are sort of on the same track as mine.
                              >
                              >
                              I must first apologize I did not tell you the complete story.
                              >
                              Although the application does not exactly know before hand what format the
                              data may come in, however part of the application allow user to define and
                              record favourite for a website
                              - to extract by text or html
                              - header content and format
                              - record format and date format ( that is where the date format mask
                              come in)
                              - optionally ordinal number for each column or re-ordering
                              - trailer content and format
                              >
                              For a given batch, at least for the body, date format are uniform
                              >
                              furthermore, the need to make the extract process generic and adaptable to
                              the front end that takes the user definitions, I believe it would be
                              easier
                              to "normalize" date string to "yyyy-mm-dd".
                              >
                              Also the end target for of may not necessarily be SQL database but may be
                              text, pasted to word report. or excel by user
                              >
                              >
                              Therefore, I can transform the date format mask to regex in the
                              appropriate
                              format and identifier I can use regex,replace to normalize the date. As a
                              matter of fact the date separator does not have to / but can be space as
                              long as there are identifiable delimiter around the date string.
                              >
                              I already have code for dealing with regex for dates from prior project.
                              all I have to do is adapt to the present need
                              >
                              who knows, maybe I taken on a totally offbeat tract
                              >
                              "GS" <gsmsnews.micro soft.comGS@msne ws.Nomail.comwr ote in message
                              news:%23vnOBJiM HHA.1280@TK2MSF TNGP04.phx.gbl. ..
                              thanks for all pitched in so far.

                              let give it another shot.

                              looks like an easier way out would be
                              1.copy the date format string regex string holder and then derive the
                              relevant regex expression to be used for date normalization later in
                              part
                              2:
                              replace the regex string the yyyy to regex year expression with year
                              identifier
                              look for yy and replace with 20yy and repeat the step above
                              replace mmm with the month regex expression associated with month
                              identifier
                              replace mm with the 2 digit month regex expression associated with
                              month
                              identifier
                              replace dd with the 2 digit day regix expression assoc. with day
                              identifier

                              2. use the resulting regex in regex replace to normalize to yyyy--mm-dd


                              any problem with the above approach?

                              "Cor Ligthert [MVP]" <notmyfirstname @planet.nlwrote in message
                              news:%23Qj7TbWM HHA.3944@TK2MSF TNGP06.phx.gbl. ..
                              GS,
                              >
                              Maybe can you avoid this in 2007 and all things like that as
                              DateTime.parseE xact, but have a look to the nicely by Microsoft
                              inbuild
                              globalization and than the to that related ToString option.
                              >
                              Cor
                              >
                              "gs" <gs@dontMail.te lusschreef in bericht
                              news:OtrnsPTMHH A.4720@TK2MSFTN GP03.phx.gbl...
                              let say I have to deal with various date format and I am give format
                              string from one of the following
                              dd/mm/yyyy
                              mm/dd/yyyy
                              dd/mmm/yyyy
                              mmm/dd/yyyy
                              dd/mm/yy
                              mm/dd/yy
                              dd/mmm/yy
                              mmm/dd/yy
                              dd/mm
                              what is the best way to come up a relevant regex for the incoming
                              format
                              string
                              a) use two array and statically match
                              b) use regex to find the order

                              >
                              >
                              >
                              >

                              Comment

                              Working...