Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Synonymous

    Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

    Hello,

    Can regular expressions compare file names to one another. It seems RE
    can only compare with input i give it, while I want it to compare
    amongst itself and give me matches if the first x characters are
    similiar.

    For example:

    cccat
    cccap
    cccan
    dddfa
    dddfg
    dddfz

    Would result in the 'ddd' and the 'ccc' being grouped together if I
    specified it to look for a match of the first 3 characters.

    What I am trying to do is build a script that will automatically
    create directories based on duplicates like this starting with say 10
    characters, and going down to 1. This way "Vacation1. jpg,
    Vacation2.jpg" would be sent to its own directory (if i specifiy the
    first 8 characters being similiar) and "Cat1.jpg, Cat2.jpg" would
    (with 3) as well.

    Thanks for your help and interest!

    S M
  • tiissa

    #2
    Re: Pattern Matching Given # of Characters and no String Input; useRegularExpre ssions?

    Synonymous wrote:[color=blue]
    > Can regular expressions compare file names to one another. It seems RE
    > can only compare with input i give it, while I want it to compare
    > amongst itself and give me matches if the first x characters are
    > similiar.[/color]
    Do you have to use regular expressions?

    If you know the number of characters to match can't you just compare slices?

    In [1]: f1,f2='cccat',' cccap'

    In [2]: f1[:3]
    Out[2]: 'ccc'

    In [3]: f1[:3]==f2[:3]
    Out[3]: True

    It seems to me you just have to compare each file to the next one (after
    having sorted your list).

    Comment

    • tiissa

      #3
      Re: Pattern Matching Given # of Characters and no String Input; useRegularExpre ssions?

      tiissa wrote:[color=blue]
      > If you know the number of characters to match can't you just compare
      > slices?[/color]
      If you don't, you can still do it by hand:

      In [7]: def cmp(s1,s2):
      ....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s 1),
      len(s2)))]
      ....: diff_index=''.j oin(diff_map).f ind(chr(True))
      ....: if -1==diff_index:
      ....: return min(len(s1), len(s2))
      ....: else:
      ....: return diff_index
      ....:

      In [8]: cmp('cccat','cc cap')
      Out[8]: 4

      In [9]: cmp('ccc','ccca p')
      Out[9]: 3

      In [10]: cmp('cccat','dd dfa')
      Out[10]: 0

      Comment

      • Kent Johnson

        #4
        Re: Pattern Matching Given # of Characters and no String Input; useRegularExpre ssions?

        tiissa wrote:[color=blue]
        > Synonymous wrote:
        >[color=green]
        >> Can regular expressions compare file names to one another. It seems RE
        >> can only compare with input i give it, while I want it to compare
        >> amongst itself and give me matches if the first x characters are
        >> similiar.[/color]
        >
        > Do you have to use regular expressions?
        >
        > If you know the number of characters to match can't you just compare
        > slices?
        >
        > It seems to me you just have to compare each file to the next one (after
        > having sorted your list).[/color]

        itertools.group by() can do the comparing and grouping:
        [color=blue][color=green][color=darkred]
        >>> import itertools
        >>> def groupbyPrefix(l st, n):[/color][/color][/color]
        ... lst.sort()
        ... def key(item):
        ... return item[:n]
        ... return [ list(items) for k, items in itertools.group by(lst, key=key) ]
        ...[color=blue][color=green][color=darkred]
        >>> names = ['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd', 'dddfa', 'dddfg', 'dddfz']
        >>> groupbyPrefix(n ames, 3)[/color][/color][/color]
        [['cccat', 'cccap', 'cccan', 'cccbt'], ['ccddd'], ['dddfa', 'dddfg', 'dddfz']][color=blue][color=green][color=darkred]
        >>> groupbyPrefix(n ames, 2)[/color][/color][/color]
        [['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd'], ['dddfa', 'dddfg', 'dddfz']]

        Kent

        Comment

        • Synonymous

          #5
          Re: Pattern Matching Given # of Characters and no String Input; use RegularExpressi ons?

          tiissa <tiissa@nonfree .fr> wrote in message news:<42623ba8$ 0$10322$636a15c e@news.free.fr> ...[color=blue]
          > tiissa wrote:[color=green]
          > > If you know the number of characters to match can't you just compare
          > > slices?[/color]
          > If you don't, you can still do it by hand:
          >
          > In [7]: def cmp(s1,s2):
          > ....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s 1),
          > len(s2)))]
          > ....: diff_index=''.j oin(diff_map).f ind(chr(True))
          > ....: if -1==diff_index:
          > ....: return min(len(s1), len(s2))
          > ....: else:
          > ....: return diff_index
          > ....:
          >
          > In [8]: cmp('cccat','cc cap')
          > Out[8]: 4
          >
          > In [9]: cmp('ccc','ccca p')
          > Out[9]: 3
          >
          > In [10]: cmp('cccat','dd dfa')
          > Out[10]: 0[/color]

          I will look at that, although if i have 300 images i dont want to type
          all the comparisons (In [9]: cmp('ccc','ccca p')) by hand, it would
          just be easier to sort them then :).

          I got it somewhat close to working in visual basic:

          If Left$(Cells(iRo w, 1).Value, Count) = Left$(Cells(iRo w - 1,
          1).Value, Count) Then

          What it says is when comparing a list, it looks at the 'Count' left
          number of characters in the cell and compares it to the row cell
          above's 'Count' left number of characters and then does the task (i.e.
          makes a directory, moves the files) if they are equal.

          I will look for a Left$(str) function that looks at the first X
          characters for python :)).

          Thank you for your help!

          Synonymous

          Comment

          • John Machin

            #6
            Re: Pattern Matching Given # of Characters and no String Input; use RegularExpressi ons?

            On 17 Apr 2005 18:12:19 -0700, sm.synonymous@g mail.com (Synonymous)
            wrote:
            [color=blue]
            >
            >I will look for a Left$(str) function that looks at the first X
            >characters for python :)).
            >[/color]

            Wild goose chase alert! AFAIK there isn't one. Python uses slice
            notation instead of left/mid/right/substr/whatever functions. I do
            suggest that instead of looking for such a beastie, you read this
            section of the Python Tutorial: 3.1.2 Strings.

            Then, if you think that that was a good use of your time, you might
            like to read the *whole* tutorial :))

            HTH,

            John

            Comment

            • Dennis Lee Bieber

              #7
              Re: Pattern Matching Given # of Characters and no String Input; use RegularExpressi ons?

              On 17 Apr 2005 18:12:19 -0700, sm.synonymous@g mail.com (Synonymous)
              declaimed the following in comp.lang.pytho n:

              [color=blue]
              >
              > I will look for a Left$(str) function that looks at the first X
              > characters for python :)).
              >[/color]

              BASIC's
              Left$(str, x)

              is essentially Python's
              str[:x]

              and a comparison of two would be
              somestring[:X] == anotherstring[:X]


              --[color=blue]
              > =============== =============== =============== =============== == <
              > wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
              > wulfraed@dm.net | Bestiaria Support Staff <
              > =============== =============== =============== =============== == <
              > Home Page: <http://www.dm.net/~wulfraed/> <
              > Overflow Page: <http://wlfraed.home.ne tcom.com/> <[/color]

              Comment

              • tiissa

                #8
                Re: Pattern Matching Given # of Characters and no String Input; useRegularExpre ssions?

                Synonymous wrote:[color=blue]
                > tiissa <tiissa@nonfree .fr> wrote in message news:<42623ba8$ 0$10322$636a15c e@news.free.fr> ...
                >[color=green]
                >>tiissa wrote:
                >>[color=darkred]
                >>>If you know the number of characters to match can't you just compare
                >>>slices?[/color]
                >>
                >>If you don't, you can still do it by hand:
                >>
                >>In [7]: def cmp(s1,s2):
                >> ....: diff_map=[chr(s1[i]!=s2[i]) for i in range(min(len(s 1),
                >>len(s2)))]
                >> ....: diff_index=''.j oin(diff_map).f ind(chr(True))
                >> ....: if -1==diff_index:
                >> ....: return min(len(s1), len(s2))
                >> ....: else:
                >> ....: return diff_index
                >> ....:[/color]
                >
                > I will look at that, although if i have 300 images i dont want to type
                > all the comparisons (In [9]: cmp('ccc','ccca p')) by hand, it would
                > just be easier to sort them then :).[/color]

                I didn't meant you had to type it by hand. I thought about writing a
                small script (as opposed to using some in the standard tools). It might
                look like:

                In [22]: def make_group(L):
                ....: root,res='',[]
                ....: for i in range(1,len(L)) :
                ....: if ''==root:
                ....: root=L[i][:cmp(L[i-1],L[i])]
                ....: if ''==root:
                ....: res.append((L[i-1],[L[i-1]]))
                ....: else:
                ....: res.append((roo t,[L[i-1],L[i]]))
                ....: elif len(root)==cmp( root,L[i]):
                ....: res[-1][1].append(L[i])
                ....: else:
                ....: root=''
                ....: if ''==root:
                ....: res.append((L[-1],[L[-1]]))
                ....: return res
                ....:

                In [23]: L=['cccat','cccap' ,'cccan','dddfa ','dddfg','dddf z']

                In [24]: L.sort()

                In [25]: make_group(L)
                Out[25]: [('ccca', ['cccan', 'cccap', 'cccat']), ('dddf', ['dddfa',
                'dddfg', 'dddfz'])]


                However I guarantee no optimality in the number of classes (but, hey,
                that's when you don't specify the size of the prefix).
                (Actually, I guarantee nothing at all ;p)
                But in particular, you can have some file singled out:

                In [26]: make_group(['cccan','cccap' ,'cccat','cccb'])
                Out[26]: [('ccca', ['cccan', 'cccap', 'cccat']), ('cccb', ['cccb'])]


                It is a matter of choice: either you want to specify by hand the size of
                the prefix and you'd rather look at itertools as pointed out by Kent, or
                you don't and a variation with the above code might do the job.

                Comment

                • Synonymous

                  #9
                  Re: Pattern Matching Given # of Characters and no String Input; use RegularExpressi ons?

                  Hello!

                  I was trying to create a program to search for the largest common
                  subsetstring among filenames in a directory, them move the filenames
                  to the substring's name. I have succeeded, with help, in doing so and
                  here is the code.

                  Thanks for your help!

                  --- Code ---

                  #This program was created with feed back from: smeghead and sirup plus
                  aum of I2P; and also tiissa and John Machin of comp.lang.pytho n
                  #Thank you very much.
                  #I still get the odd error in this, but it was 1 out of 2500 files
                  successfully sorted. Make sure you have a directory under c:/test/
                  called 'aa' and have your
                  #I release this code into the public domain :o), send feed back to
                  sm.synonymous@g mail.com
                  files in c:/test/
                  import pickle
                  import os
                  import shutil
                  os.chdir ( '/test')
                  aaaa=2
                  aa='aa'
                  x=0
                  y=20
                  while y <> 2:
                  print y
                  List = []
                  for fileName in os.listdir ( '/test/' ):
                  Directory = fileName
                  List.append(Dir ectory)
                  List.append("A1 11111111111")
                  List.sort()
                  List.append("Z1 11111111111")
                  ListLength = len(List) - 1
                  x = 0
                  while x < ListLength:
                  ListLength = len(List) - 1
                  b = List[x]
                  c = List[x + 1]
                  backward1 = List[x - 1]
                  d = b[:y]
                  e = c[:y]
                  backward2 = backward1[:y]
                  f = str(d)
                  g = str(e)
                  backward3 = str(backward2)
                  if f==g:
                  if os.path.isdir (aa+"/"+f) == True:
                  shutil.move(b,a a+"/"+f)
                  else:
                  os.mkdir(aa+"/"+f)
                  #os.mkdir(f)
                  shutil.move(b,a a+"/"+f)
                  else:
                  if f==backward3:
                  if os.path.isdir (aa+"/"+f) == True:
                  shutil.move(b,a a+"/"+f)
                  else:
                  os.mkdir(aa+"/"+f)
                  #os.mkdir(f)
                  shutil.move(b,a a+"/"+f)
                  else:
                  aaaa=3
                  x = x + 1
                  y = y - 1

                  --- End Code ---

                  sm.synonymous@g mail.com (Synonymous) wrote in message news:<ae0b1ff7. 0504170134.66a0 1b0c@posting.go ogle.com>...[color=blue]
                  > Hello,
                  >
                  > Can regular expressions compare file names to one another. It seems RE
                  > can only compare with input i give it, while I want it to compare
                  > amongst itself and give me matches if the first x characters are
                  > similiar.
                  >
                  > For example:
                  >
                  > cccat
                  > cccap
                  > cccan
                  > dddfa
                  > dddfg
                  > dddfz
                  >
                  > Would result in the 'ddd' and the 'ccc' being grouped together if I
                  > specified it to look for a match of the first 3 characters.
                  >
                  > What I am trying to do is build a script that will automatically
                  > create directories based on duplicates like this starting with say 10
                  > characters, and going down to 1. This way "Vacation1. jpg,
                  > Vacation2.jpg" would be sent to its own directory (if i specifiy the
                  > first 8 characters being similiar) and "Cat1.jpg, Cat2.jpg" would
                  > (with 3) as well.
                  >
                  > Thanks for your help and interest!
                  >
                  > S M[/color]

                  Comment

                  Working...