C# and regex issue

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nightcrawler

    C# and regex issue

    Hi all,

    I am trying to use regular expressions to parse out mp3 titles into
    three different groups (artist, title and remix). I currently have
    three ways to name a mp3 file:

    Artist - Title [Remix]
    Artist - Title (Remix)
    Artist - Title

    I have approached the problem the following way.

    First I start by looking to see if the following regex matches (?
    <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
    see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
    not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
    however I run into two problems.

    1. The last regex does not work.
    2. I have to execute these regular expressions in the above order for
    it to be correct. If I would execute a working version of the last
    regex it would match every time.

    So my two questions are:

    1. Is there a better way to do this? Do I have to execute the regular
    expressions in order for this to work? It could be problematic if I
    introduce more naming conventions.
    2. How do I get the last regular expression to work.

    Any help is appreciated.

    Thanks

  • Jesse Houwing

    #2
    Re: C# and regex issue

    * Nightcrawler wrote, On 21-5-2007 5:56:
    Hi all,
    >
    I am trying to use regular expressions to parse out mp3 titles into
    three different groups (artist, title and remix). I currently have
    three ways to name a mp3 file:
    >
    Artist - Title [Remix]
    Artist - Title (Remix)
    Artist - Title
    >
    I have approached the problem the following way.
    >
    First I start by looking to see if the following regex matches (?
    <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
    see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
    not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
    however I run into two problems.
    >
    1. The last regex does not work.
    The last regex has nothing to force it to go beyond the first captured
    letter of the title, and because of the *, it won't even have to match
    that. Your reluctant modifier to the * tells the engine to stop ias soon
    as possible. Either modifying it to read

    (?<artist>.*?) - (?<title>.*)
    or better yet, force it to capture the whole name:
    ^(?<artist>.*?) - (?<title>.*?)$

    fixes your problem.
    2. I have to execute these regular expressions in the above order for
    it to be correct. If I would execute a working version of the last
    regex it would match every time.
    >
    So my two questions are:
    >
    1. Is there a better way to do this? Do I have to execute the regular
    expressions in order for this to work? It could be problematic if I
    introduce more naming conventions.
    My guess is that you'll have to use a predetermined order in which to
    execute your search. Otherwise there is no way for any engine to know
    which of the matching variants to use. Alternatively, you could be more
    precise as to which characters each captured group can contain. So
    instead of .*? you could write [a-z0-9'.,]*? which would make it easier
    to write patterns that don't actually overlap.

    Jesse

    Comment

    • Kevin Spencer

      #3
      Re: C# and regex issue

      (?<artist>\w+)\ s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?

      Explanation:
      There are 4 distinct parts to this:

      (?<artist>\w+) Find a string of word characters. Captures to group "artist"

      \s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
      or more spaces

      (?<title>\w+) Find a string of word characters. Captures to group "title"

      (?:\s+[\(\[](?<remix>\w+)[)\]])?

      Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
      spaces, followed by 1 of the characters '(' or '['. This is followed by a
      named capturing group called "remix" which is defined as 1 or more word
      characters. This is followed by 1 of the characters ')' or ']'.

      This assumes that there will always be an artist and a title, but that remix
      may be omitted.

      --
      HTH,

      Kevin Spencer
      Microsoft MVP

      Printing Components, Email Components,
      FTP Client Classes, Enhanced Data Controls, much more.
      DSI PrintManager, Miradyne Component Libraries:


      "Nightcrawl er" <thomas.zaleski @gmail.comwrote in message
      news:1179719785 .490989.245220@ z28g2000prd.goo glegroups.com.. .
      Hi all,
      >
      I am trying to use regular expressions to parse out mp3 titles into
      three different groups (artist, title and remix). I currently have
      three ways to name a mp3 file:
      >
      Artist - Title [Remix]
      Artist - Title (Remix)
      Artist - Title
      >
      I have approached the problem the following way.
      >
      First I start by looking to see if the following regex matches (?
      <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
      see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
      not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
      however I run into two problems.
      >
      1. The last regex does not work.
      2. I have to execute these regular expressions in the above order for
      it to be correct. If I would execute a working version of the last
      regex it would match every time.
      >
      So my two questions are:
      >
      1. Is there a better way to do this? Do I have to execute the regular
      expressions in order for this to work? It could be problematic if I
      introduce more naming conventions.
      2. How do I get the last regular expression to work.
      >
      Any help is appreciated.
      >
      Thanks
      >

      Comment

      • Nightcrawler

        #4
        Re: C# and regex issue

        On May 21, 7:31 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
        (?<artist>\w+)\ s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?
        >
        Explanation:
        There are 4 distinct parts to this:
        >
        (?<artist>\w+) Find a string of word characters. Captures to group "artist"
        >
        \s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
        or more spaces
        >
        (?<title>\w+) Find a string of word characters. Captures to group "title"
        >
        (?:\s+[\(\[](?<remix>\w+)[)\]])?
        >
        Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
        spaces, followed by 1 of the characters '(' or '['. This is followed by a
        named capturing group called "remix" which is defined as 1 or more word
        characters. This is followed by 1 of the characters ')' or ']'.
        >
        This assumes that there will always be an artist and a title, but that remix
        may be omitted.
        >
        --
        HTH,
        >
        Kevin Spencer
        Microsoft MVP
        >
        Printing Components, Email Components,
        FTP Client Classes, Enhanced Data Controls, much more.
        DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
        >
        "Nightcrawl er" <thomas.zale... @gmail.comwrote in message
        >
        news:1179719785 .490989.245220@ z28g2000prd.goo glegroups.com.. .
        >
        >
        >
        Hi all,
        >
        I am trying to use regular expressions to parse out mp3 titles into
        three different groups (artist, title and remix). I currently have
        three ways to name a mp3 file:
        >
        Artist - Title [Remix]
        Artist - Title (Remix)
        Artist - Title
        >
        I have approached the problem the following way.
        >
        First I start by looking to see if the following regex matches (?
        <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
        see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
        not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
        however I run into two problems.
        >
        1. The last regex does not work.
        2. I have to execute these regular expressions in the above order for
        it to be correct. If I would execute a working version of the last
        regex it would match every time.
        >
        So my two questions are:
        >
        1. Is there a better way to do this? Do I have to execute the regular
        expressions in order for this to work? It could be problematic if I
        introduce more naming conventions.
        2. How do I get the last regular expression to work.
        >
        Any help is appreciated.
        >
        Thanks- Hide quoted text -
        >
        - Show quoted text -
        Thank you. I tried your regex on a sample of 10 titles and it didn't
        really work. Here are my ten samples that I used:
        >From P-60 - Sinking With The Fall
        JP Conley - Karma Moods [Soul Mix]
        Soul Beats - Wherever You Go... [Love Mix]
        Thievery Corporation - Doors Of Perception
        Thievery Corporation - Holographic Universe
        Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
        Collective Sound Members - Switch
        Cool Touch - Gravity
        Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
        Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

        After I ran the regular expression on the titles above. Here is what
        the groups caught:

        Artist
        -----------------------------
        60
        Conley
        Beats
        Corporation
        Corporation
        Project
        Members
        Touch
        Ferrer
        Air

        Title
        -----------------------------
        Sinking
        Karma
        Wherever
        Doors
        Holographic
        Universal
        Switch
        Gravity
        Church
        Cherry

        Remix
        -----------------------------
        Nothing was captured here

        Please let me know what is wrong.

        Thanks

        Comment

        • Jesse Houwing

          #5
          Re: C# and regex issue

          * Nightcrawler wrote, On 21-5-2007 18:07:
          On May 21, 7:31 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
          >(?<artist>\w+) \s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?
          >>
          >Explanation:
          >There are 4 distinct parts to this:
          >>
          >(?<artist>\w +) Find a string of word characters. Captures to group "artist"
          >>
          >\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by 1
          >or more spaces
          >>
          >(?<title>\w+ ) Find a string of word characters. Captures to group "title"
          >>
          >(?:\s+[\(\[](?<remix>\w+)[)\]])?
          >>
          >Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
          >spaces, followed by 1 of the characters '(' or '['. This is followed by a
          >named capturing group called "remix" which is defined as 1 or more word
          >characters. This is followed by 1 of the characters ')' or ']'.
          >>
          >This assumes that there will always be an artist and a title, but that remix
          >may be omitted.
          >>
          >--
          >HTH,
          >>
          >Kevin Spencer
          >Microsoft MVP
          >>
          >Printing Components, Email Components,
          >FTP Client Classes, Enhanced Data Controls, much more.
          >DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
          >>
          >"Nightcrawle r" <thomas.zale... @gmail.comwrote in message
          >>
          >news:117971978 5.490989.245220 @z28g2000prd.go oglegroups.com. ..
          >>
          >>
          >>
          >>Hi all,
          >>I am trying to use regular expressions to parse out mp3 titles into
          >>three different groups (artist, title and remix). I currently have
          >>three ways to name a mp3 file:
          >>Artist - Title [Remix]
          >>Artist - Title (Remix)
          >>Artist - Title
          >>I have approached the problem the following way.
          >>First I start by looking to see if the following regex matches (?
          >><artist>.*? ) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
          >>see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
          >>not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
          >>however I run into two problems.
          >>1. The last regex does not work.
          >>2. I have to execute these regular expressions in the above order for
          >>it to be correct. If I would execute a working version of the last
          >>regex it would match every time.
          >>So my two questions are:
          >>1. Is there a better way to do this? Do I have to execute the regular
          >>expressions in order for this to work? It could be problematic if I
          >>introduce more naming conventions.
          >>2. How do I get the last regular expression to work.
          >>Any help is appreciated.
          >>Thanks- Hide quoted text -
          >- Show quoted text -
          >
          Thank you. I tried your regex on a sample of 10 titles and it didn't
          really work. Here are my ten samples that I used:
          >
          >>From P-60 - Sinking With The Fall
          JP Conley - Karma Moods [Soul Mix]
          Soul Beats - Wherever You Go... [Love Mix]
          Thievery Corporation - Doors Of Perception
          Thievery Corporation - Holographic Universe
          Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
          Collective Sound Members - Switch
          Cool Touch - Gravity
          Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
          Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
          The regex only looks for titles and authors that are made up of \w+
          whick means one word. As you can see there are multiple words here. A
          better solution could be to substiture '\w+' with '\w+( \w+)*' which
          catches one word followed by any number of other words.

          Same goes for the title. There are also some characters in your titles,
          like the ' and a . which aren't in the \w shortcut. A better solution
          might be \S here, which means no whitespace.

          (?<artist>\S+(\ s+\S+)*)\s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?

          This does not actually do the trick yet, as it captures too much in the
          artist field. After playing around with reluctant modifiers and a few
          other small modifications I came up with this:

          ^(?<artist>\S+( \s+\S+)*?)\s+?-\s+?(?<title>\S +(\s+\S+)*?)(?: \s+[\(\[](?<remix>\S+(\s +[^)\]]+)*?)[)\]])?\r?$
          MultiLine ON
          ExplicitCapture ON

          which works on all the examples you've provided which didn't work
          before. But after expanding on the testcases I found a few other things
          that didn't work.

          This is what I finally came up with:
          ^(?<artist>\S+? ([ \t]+\S+)*?)[ \t]+-[ \t]+(?<title>\S+([ \t]+\S+)*?)([
          \t]+?(\((?<remix>[^\)]+(\s+[^\)]+)*?)\)|\[(?<remix>[^\]]+(\s+[^\]]+)*?)\]
          ))?\r?$
          MultiLine ON
          ExplicitCapture ON

          Even though it works, I would recommend not to use it as such. Please
          try to come up with a better way, unless this is a one time thing. The
          regex above is hardly readable and almost unmaintainable.

          Jesse
          >
          After I ran the regular expression on the titles above. Here is what
          the groups caught:
          >
          Artist
          -----------------------------
          60
          Conley
          Beats
          Corporation
          Corporation
          Project
          Members
          Touch
          Ferrer
          Air
          >
          Title
          -----------------------------
          Sinking
          Karma
          Wherever
          Doors
          Holographic
          Universal
          Switch
          Gravity
          Church
          Cherry
          >
          Remix
          -----------------------------
          Nothing was captured here
          >
          Please let me know what is wrong.
          >
          Thanks
          >

          Comment

          • Kevin Spencer

            #6
            Re: C# and regex issue

            My apologies, Nightcrawler.

            Revised Standard Version:

            (?<artist>.+)(? =(\s+-\s+))\1(?:(?<ti tle>.+)??(?<rem ix>(?:\([^\)]+\)|\[[^\]]+\]))|(?<title>.+) )

            Part of the problem with my first was that it didn't account for spaces in
            the Artist or Title. Another was that I was not aware of the rules, which
            include the possibility that there might be hyphens (or other characters) in
            the Artist, Title, or Remix, and finally, that Title might contain
            parenthetized groups of characters, just like Remix. Your examples were very
            helpful!

            A short explanation of the above:

            (?<artist>.+)(? =(\s+-\s+))\1

            This indicates that "artist" should be any characters that MUST be followed
            by 1 or more spaces, a hyphen, and 1 or more spaces. This means that the
            test will fail if the Artist contains a hyphen which has 1 or more spaces on
            both sides, but that a hyphen which does NOT have a space on either the left
            or right side is okay. The assertion is that the hyphen between "artist" and
            "title" will have spaces on BOTH sides.

            I put the "space-space" sequence into an unnamed capturing group, because it
            has to be captured after the assertion, which does NOT capture it, in order
            to match the rest of the line. Thus, the first part ends with "\1" which
            captures the "space-space" sequence.

            (?:(?<title>.+) ??(?<remix>(?:\ ([^\)]+\)|\[[^\]]+\]))|(?<title>.+) )

            This was the tricky part, since the "title" may have parenthetized character
            groups in it, which look just like the "remix," further complicated by the
            fact that "remix" may be absent. Note that this is not perfect, and I will
            explain why in a bit.

            It puts 2 possible combinations into an OR-ing non-capturing group. The
            first possible combination is:

            (?<title>.+)??( ?<remix>(?:\([^\)]+\)|\[[^\]]+\]))

            This uses a double-question-mark quantifier, which makes the first ("title")
            part optional, and matches it lazily, a rare construct, but necessary in
            this case, as we assume that the title WILL be there, but the lazy part
            leaves room for the last part if there are any parenthetized groups of
            characters in the "title." This is followed by the "remix" group, which is
            defined as either a '(' followed by 1 or more non-')' characters, followed
            by a ')', or a '[' followed by 1 or more non-']' characters, followed by a
            ']'. This ensures that if the remix is present, it will be captured.
            However, if the remix is NOT present, we need an alternative:

            (?<title>.+)

            Captures the rest of the string, if the first alternative fails.

            Now, as to why these rules are not perfect, let's have a look at one of the
            items in your list:

            Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]

            Obviously, the [DJ AM Mix] is the Remix. Why obviously? Well, it is the last
            parenthetized expression in the string. But what if you left the Remix off?

            Air - Cherry Blossom Girl (Because You Blossom)

            NOW, "Cherry Blossom Girl" becomes the title, and "(Because You Blossom)"
            becomes the Remix. Why? Because it is the last parenthetized expression in
            the string. Now, even a human being could not tell the difference, because
            you are using a rule that states that the last the parenthetized expression
            in the string is the Remix. In other words, your rules for "remix" overlap
            your rules for "title." The only solution to this would be to further
            qualify the rules. That is, you would have to either restrict the rules for
            "title" to a certain pair of brackets, or restrict the rules for "remix" to
            a certain pair of brackets.

            Thanks for the challenge!

            --
            HTH,

            Kevin Spencer
            Microsoft MVP

            Printing Components, Email Components,
            FTP Client Classes, Enhanced Data Controls, much more.
            DSI PrintManager, Miradyne Component Libraries:


            "Nightcrawl er" <thomas.zaleski @gmail.comwrote in message
            news:1179763631 .747550.209130@ y2g2000prf.goog legroups.com...
            On May 21, 7:31 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
            >(?<artist>\w+) \s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?
            >>
            >Explanation:
            >There are 4 distinct parts to this:
            >>
            >(?<artist>\w +) Find a string of word characters. Captures to group
            >"artist"
            >>
            >\s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followed by
            >1
            >or more spaces
            >>
            >(?<title>\w+ ) Find a string of word characters. Captures to group "title"
            >>
            >(?:\s+[\(\[](?<remix>\w+)[)\]])?
            >>
            >Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
            >spaces, followed by 1 of the characters '(' or '['. This is followed by a
            >named capturing group called "remix" which is defined as 1 or more word
            >characters. This is followed by 1 of the characters ')' or ']'.
            >>
            >This assumes that there will always be an artist and a title, but that
            >remix
            >may be omitted.
            >>
            >--
            >HTH,
            >>
            >Kevin Spencer
            >Microsoft MVP
            >>
            >Printing Components, Email Components,
            >FTP Client Classes, Enhanced Data Controls, much more.
            >DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
            >>
            >"Nightcrawle r" <thomas.zale... @gmail.comwrote in message
            >>
            >news:117971978 5.490989.245220 @z28g2000prd.go oglegroups.com. ..
            >>
            >>
            >>
            Hi all,
            >>
            I am trying to use regular expressions to parse out mp3 titles into
            three different groups (artist, title and remix). I currently have
            three ways to name a mp3 file:
            >>
            Artist - Title [Remix]
            Artist - Title (Remix)
            Artist - Title
            >>
            I have approached the problem the following way.
            >>
            First I start by looking to see if the following regex matches (?
            <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
            see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
            not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
            however I run into two problems.
            >>
            1. The last regex does not work.
            2. I have to execute these regular expressions in the above order for
            it to be correct. If I would execute a working version of the last
            regex it would match every time.
            >>
            So my two questions are:
            >>
            1. Is there a better way to do this? Do I have to execute the regular
            expressions in order for this to work? It could be problematic if I
            introduce more naming conventions.
            2. How do I get the last regular expression to work.
            >>
            Any help is appreciated.
            >>
            Thanks- Hide quoted text -
            >>
            >- Show quoted text -
            >
            Thank you. I tried your regex on a sample of 10 titles and it didn't
            really work. Here are my ten samples that I used:
            >
            >>From P-60 - Sinking With The Fall
            JP Conley - Karma Moods [Soul Mix]
            Soul Beats - Wherever You Go... [Love Mix]
            Thievery Corporation - Doors Of Perception
            Thievery Corporation - Holographic Universe
            Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
            Collective Sound Members - Switch
            Cool Touch - Gravity
            Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
            Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
            >
            After I ran the regular expression on the titles above. Here is what
            the groups caught:
            >
            Artist
            -----------------------------
            60
            Conley
            Beats
            Corporation
            Corporation
            Project
            Members
            Touch
            Ferrer
            Air
            >
            Title
            -----------------------------
            Sinking
            Karma
            Wherever
            Doors
            Holographic
            Universal
            Switch
            Gravity
            Church
            Cherry
            >
            Remix
            -----------------------------
            Nothing was captured here
            >
            Please let me know what is wrong.
            >
            Thanks
            >

            Comment

            • Nightcrawler

              #7
              Re: C# and regex issue

              On May 22, 8:40 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
              My apologies, Nightcrawler.
              >
              Revised Standard Version:
              >
              (?<artist>.+)(? =(\s+-\s+))\1(?:(?<ti tle>.+)??(?<rem ix>(?:\([^\)]+\)|\[[^\]]­+\]))|(?<title>.+) )
              >
              Part of the problem with my first was that it didn't account for spaces in
              the Artist or Title. Another was that I was not aware of the rules, which
              include the possibility that there might be hyphens (or other characters)in
              the Artist, Title, or Remix, and finally, that Title might contain
              parenthetized groups of characters, just like Remix. Your examples were very
              helpful!
              >
              A short explanation of the above:
              >
              (?<artist>.+)(? =(\s+-\s+))\1
              >
              This indicates that "artist" should be any characters that MUST be followed
              by 1 or more spaces, a hyphen, and 1 or more spaces. This means that the
              test will fail if the Artist contains a hyphen which has 1 or more spaceson
              both sides, but that a hyphen which does NOT have a space on either the left
              or right side is okay. The assertion is that the hyphen between "artist" and
              "title" will have spaces on BOTH sides.
              >
              I put the "space-space" sequence into an unnamed capturing group, becauseit
              has to be captured after the assertion, which does NOT capture it, in order
              to match the rest of the line. Thus, the first part ends with "\1" which
              captures the "space-space" sequence.
              >
              (?:(?<title>.+) ??(?<remix>(?:\ ([^\)]+\)|\[[^\]]+\]))|(?<title>.+) )
              >
              This was the tricky part, since the "title" may have parenthetized character
              groups in it, which look just like the "remix," further complicated by the
              fact that "remix" may be absent. Note that this is not perfect, and I will
              explain why in a bit.
              >
              It puts 2 possible combinations into an OR-ing non-capturing group. The
              first possible combination is:
              >
              (?<title>.+)??( ?<remix>(?:\([^\)]+\)|\[[^\]]+\]))
              >
              This uses a double-question-mark quantifier, which makes the first ("title")
              part optional, and matches it lazily, a rare construct, but necessary in
              this case, as we assume that the title WILL be there, but the lazy part
              leaves room for the last part if there are any parenthetized groups of
              characters in the "title." This is followed by the "remix" group, which is
              defined as either a '(' followed by 1 or more non-')' characters, followed
              by a ')', or a '[' followed by 1 or more non-']' characters, followed by a
              ']'. This ensures that if the remix is present, it will be captured.
              However, if the remix is NOT present, we need an alternative:
              >
              (?<title>.+)
              >
              Captures the rest of the string, if the first alternative fails.
              >
              Now, as to why these rules are not perfect, let's have a look at one of the
              items in your list:
              >
              Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
              >
              Obviously, the [DJ AM Mix] is the Remix. Why obviously? Well, it is the last
              parenthetized expression in the string. But what if you left the Remix off?
              >
              Air - Cherry Blossom Girl (Because You Blossom)
              >
              NOW, "Cherry Blossom Girl" becomes the title, and "(Because You Blossom)"
              becomes the Remix. Why? Because it is the last parenthetized expression in
              the string. Now, even a human being could not tell the difference, because
              you are using a rule that states that the last the parenthetized expression
              in the string is the Remix. In other words, your rules for "remix" overlap
              your rules for "title." The only solution to this would be to further
              qualify the rules. That is, you would have to either restrict the rules for
              "title" to a certain pair of brackets, or restrict the rules for "remix" to
              a certain pair of brackets.
              >
              Thanks for the challenge!
              >
              --
              HTH,
              >
              Kevin Spencer
              Microsoft MVP
              >
              Printing Components, Email Components,
              FTP Client Classes, Enhanced Data Controls, much more.
              DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
              >
              "Nightcrawl er" <thomas.zale... @gmail.comwrote in message
              >
              news:1179763631 .747550.209130@ y2g2000prf.goog legroups.com...
              >
              >
              >
              On May 21, 7:31 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
              (?<artist>\w+)\ s+-\s+(?<title>\w+ )(?:\s+[\(\[](?<remix>\w+)[)\]])?
              >
              Explanation:
              There are 4 distinct parts to this:
              >
              (?<artist>\w+) Find a string of word characters. Captures to group
              "artist"
              >
              \s+-\s+ Followed by 1 or more spaces, followed by a hyphen, followedby
              1
              or more spaces
              >
              (?<title>\w+) Find a string of word characters. Captures to group "title"
              >
              (?:\s+[\(\[](?<remix>\w+)[)\]])?
              >
              Non-capturing group, of which there may be 0 or 1. Begins with 1 or more
              spaces, followed by 1 of the characters '(' or '['. This is followed by a
              named capturing group called "remix" which is defined as 1 or more word
              characters. This is followed by 1 of the characters ')' or ']'.
              >
              This assumes that there will always be an artist and a title, but that
              remix
              may be omitted.
              >
              --
              HTH,
              >
              Kevin Spencer
              Microsoft MVP
              >
              Printing Components, Email Components,
              FTP Client Classes, Enhanced Data Controls, much more.
              DSI PrintManager, Miradyne Component Libraries:http://www.miradyne.net
              >
              "Nightcrawl er" <thomas.zale... @gmail.comwrote in message
              >
              >news:117971978 5.490989.245220 @z28g2000prd.go oglegroups.com. ..
              >
              Hi all,
              >
              I am trying to use regular expressions to parse out mp3 titles into
              three different groups (artist, title and remix). I currently have
              three ways to name a mp3 file:
              >
              Artist - Title [Remix]
              Artist - Title (Remix)
              Artist - Title
              >
              I have approached the problem the following way.
              >
              First I start by looking to see if the following regex matches (?
              <artist>.*?) - (?<title>.*?) \[(?<remix>.*?)\]. If not I move on to
              see if (?<artist>.*?) - (?<title>.*?) \((?<remix>.*?) \) matches. If
              not I move on to see if (?<artist>.*?) - (?<title>.*?) matches,
              however I run into two problems.
              >
              1. The last regex does not work.
              2. I have to execute these regular expressions in the above order for
              it to be correct. If I would execute a working version of the last
              regex it would match every time.
              >
              So my two questions are:
              >
              1. Is there a better way to do this? Do I have to execute the regular
              expressions in order for this to work? It could be problematic if I
              introduce more naming conventions.
              2. How do I get the last regular expression to work.
              >
              Any help is appreciated.
              >
              Thanks- Hide quoted text -
              >
              - Show quoted text -
              >
              Thank you. I tried your regex on a sample of 10 titles and it didn't
              really work. Here are my ten samples that I used:
              >
              >From P-60 - Sinking With The Fall
              JP Conley - Karma Moods [Soul Mix]
              Soul Beats - Wherever You Go... [Love Mix]
              Thievery Corporation - Doors Of Perception
              Thievery Corporation - Holographic Universe
              Ananda Project - Universal Love [Jay-J's Shifted Up Mix]
              Collective Sound Members - Switch
              Cool Touch - Gravity
              Dennis Ferrer - Church Lady Part 2 [Bryan Cox Remix]
              Air - Cherry Blossom Girl (Because You Blossom) [DJ AM Mix]
              >
              After I ran the regular expression on the titles above. Here is what
              the groups caught:
              >
              Artist
              -----------------------------
              60
              Conley
              Beats
              Corporation
              Corporation
              Project
              Members
              Touch
              Ferrer
              Air
              >
              Title
              -----------------------------
              Sinking
              Karma
              Wherever
              Doors
              Holographic
              Universal
              Switch
              Gravity
              Church
              Cherry
              >
              Remix
              -----------------------------
              Nothing was captured here
              >
              Please let me know what is wrong.
              >
              Thanks- Hide quoted text -
              >
              - Show quoted text -
              Kevin,

              Wow! Thanks!

              Do you think it would make more sense to ask the user to define how
              they name their files is simple terms and then build the regex in code
              instead? I can see how it can get compllicated trying to guess.

              Say someone told me they use %artist% - %title% [%remix%], how would
              you build a regex for that case for instance?

              On another note, do you accept freelance work?

              Please let me know.

              Thanks

              Comment

              • Kevin Spencer

                #8
                Re: C# and regex issue

                Do you think it would make more sense to ask the user to define how
                they name their files is simple terms and then build the regex in code
                instead? I can see how it can get compllicated trying to guess.
                Well, regular expressions are "simply" reflections of rules that define
                patterns. In order to create an effective regular expression, you need to
                define the rules that identify the patterns. In your case, the rules were
                fairly simple, but a little too loose:

                Artist pattern:
                Any string that ends with a " - " character sequence, followed by a Ttle
                pattern, and optionally followed by a Remix pattern.

                Title pattern:
                Any string that is preceded by an Artist pattern, optionally followed by a
                Remix pattern.

                Remix pattern:
                Any string preceded by an Artist and a Title pattern, enclosed in round or
                square brackets.

                You'll note that ALL of the pattern rules have to be met for a match. This
                is because each of these patterns is a *part* of a match. A match is not a
                match unless all parts match. That is why the Artist pattern includes the
                assertion that it is followed by a Title pattern and an optional Remix
                pattern, and so on.

                The "looseness" problem occurs because of the number of "Any string" rules
                in the rules. This allows, for example, a Title to end with a string
                enclosed in round or square brackets, which, combined with an absense of a
                Remix pattern (allowed), makes the parenthetized end of the Title to be
                identified as the absent Remix.

                Because you're working with media lists, supplied by end users, you don't
                want the rules to be so complex that the users have trouble following them.
                This invites error by the users, and you'll probably have plenty of that
                anyway! So, what you need is to make the rules as loose as possible, while
                still enabling them to be parsed by a regular expression.

                As I said, one idea would be to require a difference in the brackets used in
                the Title and the Remix. Another would be to require a separator sequence,
                such as the " - " character sequence, between the Title and Remix. This
                would allow the user to continue to use any character sequence in all of
                them.

                There are other possibilities as well. If you keep the goal in mind (rules
                as loose as possible while retaining clarity for parsing), you can come up
                with your own if you like. For example, thinking about it a bit more, the
                rule that the Remix may be omitted, combined with the "Any string" rule for
                the Title could also be overcome by a rule that the Remix is required, and
                if the user doesn't have Remix information, he/she could simply add a pair
                of empty brackets to the end:

                Air - Cherry Blossom Girl (Because You Blossom) [ ]
                On another note, do you accept freelance work?
                I don't, but my company might hire me out for a contract job. Let me know,
                and I can send you some contact information.

                --
                HTH,

                Kevin Spencer
                Microsoft MVP

                Printing Components, Email Components,
                FTP Client Classes, Enhanced Data Controls, much more.
                DSI PrintManager, Miradyne Component Libraries:


                "Nightcrawl er" <thomas.zaleski @gmail.comwrote in message
                news:1179895575 .403548.67660@q 75g2000hsh.goog legroups.com...
                On May 22, 8:40 am, "Kevin Spencer" <unclechut...@n othinks.comwrot e:
                My apologies, Nightcrawler.
                >
                Revised Standard Version:
                >
                (?<artist>.+)(? =(\s+-\s+))\1(?:(?<ti tle>.+)??(?<rem ix>(?:\([^\)]+\)|\[[^\]]­+\]))|(?<title>.+) )
                <snip>
                Kevin,
                >
                Wow! Thanks!
                >
                Do you think it would make more sense to ask the user to define how
                they name their files is simple terms and then build the regex in code
                instead? I can see how it can get compllicated trying to guess.
                >
                Say someone told me they use %artist% - %title% [%remix%], how would
                you build a regex for that case for instance?
                >
                On another note, do you accept freelance work?
                >
                Please let me know.
                >
                Thanks

                Comment

                Working...