Stripping C-style comments using a Python regexp

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • lorinh@gmail.com

    Stripping C-style comments using a Python regexp

    Hi Folks,

    I'm trying to strip C/C++ style comments (/* ... */ or // ) from
    source code using Python regexps.

    If I don't have to worry about comments embedded in strings, it seems
    pretty straightforward (this is what I'm using now):

    cpp_pat = re.compile(r"""
    /\* .*? \*/ | # C comments
    // [^\n\r]* # C++ comments
    """,re.S|re .X)
    s = file('myprog.cp p').read()
    cpp_pat.sub(' ',s)

    However, the sticking point is dealing with tokens like /* embedded
    within a string:

    const char *mystr = "This is /*trouble*/";

    I've inherited a working Perl script, which I'd like to reimplement in
    Python so that I don't have to spawn a new Perl process in my Python
    program each time I want to strip comments from a file. The Perl script
    looks like this:

    #!/usr/bin/perl -w

    $/ = undef; # no line delimiter
    $_ = <>; # read entire file

    s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
    /\* .*? \*/ | # delete C comments
    // [^\n\r]* # delete C++ comments
    ! $1 || ' ' # change comments to a single space
    !xseg; # ignore white space, treat as single line
    # evaluate result, repeat globally
    print;

    The Perl regexp above uses some sort of conditional to deal with this,
    by replacing a quoted string with itself if the initial match is a
    quoted string. Is there some equivalent feature in Python regexps?

    Lorin

  • Lonnie Princehouse

    #2
    Re: Stripping C-style comments using a Python regexp

    > Is there some equivalent feature in Python regexps?

    cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

    def subfunc(match):
    if match.group(2):
    return match.group(2)
    else:
    return ''

    stripped_c_code = cpp_pat.sub(sub func, c_code)


    ....I suppose this is what the Perl code might do, but I'm not sure,
    since trying to read it hurts my brain...

    Comment

    • Jeff Epler

      #3
      Re: Stripping C-style comments using a Python regexp

      #------------------------------------------------------------------------
      import re, sys

      def q(c):
      """Returns a regular expression that matches a region delimited by c,
      inside which c may be escaped with a backslash"""

      return r"%s(\\.|[^%s])*%s" % (c, c, c)

      single_quoted_s tring = q('"')
      double_quoted_s tring = q("'")
      c_comment = r"/\*.*?\*/"
      cxx_comment = r"//[^\n]*[\n]"

      rx = re.compile("|". join([single_quoted_s tring, double_quoted_s tring,
      c_comment, cxx_comment]), re.DOTALL)

      def replace(x):
      x = x.group(0)
      if x.startswith("/"): return ' '
      return x

      result = rx.sub(replace, sys.stdin.read( ))
      sys.stdout.writ e(result)
      #------------------------------------------------------------------------

      The regular expression matches ""-strings, ''-character-constants,
      c-comments, and c++-comments. The replace function returns ' ' (space)
      when the matched thing was a comment, or the original thing otherwise.
      Depending on your use for this code, replace() should return as many
      '\n's as are in the matched thing, or ' ' otherwise, so that line
      numbers remain unchanged.

      Basically, the regular expression is a tokenizer, and replace() chooses
      what to do with each recognized token. Things not recognized as tokens
      by the regular expression are left unchanged.

      Jeff
      PS this is the test file I used:
      /* ... */ xyzzy;
      456 // 123
      const char *mystr = "This is /*trouble*/";
      /* * */
      /* /* */
      // /* /* */
      /* // /* */
      /*
      * */

      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.2.1 (GNU/Linux)

      iD8DBQFC57hHJd0 1MZaTXX0RAsE4AK CAmR8fPkU6BNofA ZQhn1X9qdWNMQCg n+8c
      ex2GXeRAF+P2d3H JuRDs6zo=
      =J5YT
      -----END PGP SIGNATURE-----

      Comment

      • Lonnie Princehouse

        #4
        Re: Stripping C-style comments using a Python regexp

        > Is there some equivalent feature in Python regexps?

        cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

        def subfunc(match):
        if match.group(2):
        return match.group(2)
        else:
        return ''

        stripped_c_code = cpp_pat.sub(sub func, c_code)


        ....I suppose this is what the Perl code might do, but I'm not sure,
        since trying to read it hurts my brain...

        Comment

        • lorinh@gmail.com

          #5
          Re: Stripping C-style comments using a Python regexp

          Neat! I didn't realize that re.sub could take a function as an
          argument. Thanks.

          Lorin

          Comment

          Working...