String manipulation

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • marco.minerva@gmail.com

    String manipulation

    Hi all!

    I have a file in which there are some expressions such as "kindest
    regard" and "yours sincerely". I must create a phyton script that
    checks if a text contains one or more of these expressions and, in
    this case, replaces the spaces in the expression with the character
    "_". For example, the text

    Yours sincerely, Marco.

    Must be transformated in:

    Yours_sincerely , Marco.

    Now I have written this code:

    filemw = codecs.open(sys .argv[1], "r", "iso-8859-1").readline s()
    filein = codecs.open(sys .argv[2], "r", "iso-8859-1").readline s()

    mw = ""
    for line in filemw:
    mw = mw + line.strip() + "|"

    mwfind_re = re.compile(r"^( " + mw + ")",re.IGNORECA SE|re.VERBOSE)
    mwfind_subst = r"_"

    for line in filein:
    line = line.strip()
    if (line != ""):
    line = mwfind_re.sub(m wfind_subst, line)
    print line

    It correctly identifies the expressions, but doesn't replace the
    character in the right way. How can I do what I want?

    Thanks in advance.
    --
    Marco Minerva, marco.minerva@g mail.com


  • Alexander Schmolck

    #2
    Re: String manipulation


    All the code is untested, but should give you the idea.

    marco.minerva@g mail.com writes:
    Hi all!
    >
    I have a file in which there are some expressions such as "kindest
    regard" and "yours sincerely". I must create a phyton script that
    checks if a text contains one or more of these expressions and, in
    this case, replaces the spaces in the expression with the character
    "_". For example, the text
    >
    Yours sincerely, Marco.
    >
    Must be transformated in:
    >
    Yours_sincerely , Marco.
    >
    Now I have written this code:
    >
    filemw = codecs.open(sys .argv[1], "r", "iso-8859-1").readline s()
    filein = codecs.open(sys .argv[2], "r", "iso-8859-1").readline s()
    >
    mw = ""
    for line in filemw:
    mw = mw + line.strip() + "|"
    One "|" too many. Generally, use join instead of many individual string +s.

    mwfind_re_strin g = "(%s)" % "|".join(line.s trip() for line in filemw)
    mwfind_re = re.compile(r"^( " + mw + ")",re.IGNORECA SE|re.VERBOSE)

    mwfind_re = re.compile(mwfi nd_re_string),r e.IGNORECASE)
    mwfind_subst = r"_"
    >
    for line in filein:
    That doesn't work. What about "kindest\nregar d"? I think you're best of
    reading the whole file in (don't forget to close the files, BTW).

    line = line.strip()
    if (line != ""):
    line = mwfind_re.sub(m wfind_subst, line)
    print line
    >
    It correctly identifies the expressions, but doesn't replace the
    character in the right way. How can I do what I want?
    Use the fact that you can also use a function as a substitution.

    print mwfind_re.sub(l ambda match: match.group().r eplace(' ','_'),
    "".join(line.st rip() for line in filein))

    'as

    Comment

    • Alexander Schmolck

      #3
      Re: String manipulation

      Alexander Schmolck <a.schmolck@gma il.comwrites:
      That doesn't work. What about "kindest\nregar d"? I think you're best of
      reading the whole file in (don't forget to close the files, BTW).
      I should have written "that may not always work, depending of whether the set
      phrases you're interested in can also span lines". If in doubt, it's better
      to assume they can.

      'as

      Comment

      • marco.minerva@gmail.com

        #4
        Re: String manipulation

        On 4 Apr, 17:39, Alexander Schmolck <a.schmo...@gma il.comwrote:
        All the code is untested, but should give you the idea.
        >
        >
        >
        >
        >
        marco.mine...@g mail.com writes:
        Hi all!
        >
        I have a file in which there are some expressions such as "kindest
        regard" and "yours sincerely". I must create a phyton script that
        checks if a text contains one or more of these expressions and, in
        this case, replaces the spaces in the expression with the character
        "_". For example, the text
        >
        Yours sincerely, Marco.
        >
        Must be transformated in:
        >
        Yours_sincerely , Marco.
        >
        Now I have written this code:
        >
        filemw = codecs.open(sys .argv[1], "r", "iso-8859-1").readline s()
        filein = codecs.open(sys .argv[2], "r", "iso-8859-1").readline s()
        >
        mw = ""
        for line in filemw:
        mw = mw + line.strip() + "|"
        >
        One "|" too many. Generally, use join instead of many individual string +s.
        >
        mwfind_re_strin g = "(%s)" % "|".join(line.s trip() for line in filemw)
        >
        mwfind_re = re.compile(r"^( " + mw + ")",re.IGNORECA SE|re.VERBOSE)
        >
        mwfind_re = re.compile(mwfi nd_re_string),r e.IGNORECASE)
        >
        mwfind_subst = r"_"
        >
        for line in filein:
        >
        That doesn't work. What about "kindest\nregar d"? I think you're best of
        reading the whole file in (don't forget to close the files, BTW).
        >
        line = line.strip()
        if (line != ""):
        line = mwfind_re.sub(m wfind_subst, line)
        print line
        >
        It correctly identifies the expressions, but doesn't replace the
        character in the right way. How can I do what I want?
        >
        Use the fact that you can also use a function as a substitution.
        >
        print mwfind_re.sub(l ambda match: match.group().r eplace(' ','_'),
        "".join(line.st rip() for line in filein))
        >
        'as- Nascondi testo tra virgolette -
        >
        - Mostra testo tra virgolette -
        Hi Alexander!

        Thank you very much, your code works perfectly!

        --
        Marco Minerva, marco.minerva@g mail.com


        Comment

        • Alexander Schmolck

          #5
          Re: String manipulation

          marco.minerva@g mail.com writes:
          Thank you very much, your code works perfectly!
          One thing I forgot: you might want to make the whitespace handling a bit more
          robust/general e.g. by using something along the lines of

          set_phrase.repl ace(' ', r'\w+')

          'as

          Comment

          • marco.minerva@gmail.com

            #6
            Re: String manipulation

            On 4 Apr, 21:47, Alexander Schmolck <a.schmo...@gma il.comwrote:
            marco.mine...@g mail.com writes:
            Thank you very much, your code works perfectly!
            >
            One thing I forgot: you might want to make the whitespace handling a bit more
            robust/general e.g. by using something along the lines of
            >
            set_phrase.repl ace(' ', r'\w+')
            >
            'as
            Hi!

            Thanks again... But where must I insert this instruction?

            --
            Marco Minerva, marco.minerva@g mail.com


            Comment

            • Alexander Schmolck

              #7
              Re: String manipulation

              marco.minerva@g mail.com writes:
              On 4 Apr, 21:47, Alexander Schmolck <a.schmo...@gma il.comwrote:
              marco.mine...@g mail.com writes:
              Thank you very much, your code works perfectly!
              One thing I forgot: you might want to make the whitespace handling a bit more
              robust/general e.g. by using something along the lines of

              set_phrase.repl ace(' ', r'\w+')
              Oops, sorry I meant r'\s+'.

              'as
              >
              Hi!
              Thanks again... But where must I insert this instruction?
              If you're sure the code already does what you want you can forget about my
              remark; I was thinking of transforming individual patterns like so: 'kindest
              regard' -r'kindest\w+reg ard', but it really depends on the details of your
              spec, which I'm not familiar with.

              For example you clearly want to do some amount of whitespace normalization
              (because you use ``.strip()``), but how much? The most extreme you could go is

              input = " ".join(file.rea d().split()) # all newlines, tabs, multiple spaces -" "

              In which case you don't need to worry about modifying the patterns to take
              care of possible whitespace variations. Another possibility is that you
              specify the patterns you want to replace as regexps in the file e.g.

              \bkind(?:est)?\ b\s+regard(?:s) ?\b
              \byours,\b
              ...

              In any case I'd suggest the following: think about what possible edge cases
              your input can contain and how you'd like to handle then; then write them up
              as unittests (use doctest or unittest and StringIO) and finally modify your
              code until it passes all the tests. Here are some examples of possible test
              patterns:


              - """kindest regard,"""
              - """kindest regard"""
              - """kindest\treg ard"""
              - """kind regards"
              - """mankind regards other species as inferior"""
              - """... and please send your wife my kindest
              regards,"""

              Finally, if you're looking for a programming excercise you could try the
              following: rather than working on strings and using regexps, work on a
              "stream" of words (i.e. ["kindest", "regards", ...]) and write your own code
              to match sequences of words.

              'as

              p.s. BTW, I overlooked the ``.readlines()` ` before, but you don't need it --
              files are iterable and you also want to hang on to the openend file object so
              that you can close it when you're done.

              Comment

              Working...