Starting at the beginning...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hjnathan
    New Member
    • Feb 2008
    • 7

    Starting at the beginning...

    Hello...

    I m very new to programming but not to computers. I am trying to expand my skills beyond what I do in Excel. I started working with Python about a month ago. I feel things are close to clicking for me but I need a bit of a nudge from an expert to get me over the hump.

    I am not looking for help writing a script per se, but actually looking for insight on how to approach my project. I have been reading about tuples and lists and pickles and shelves and I find myself a bit perplexed on which way to proceed.

    The project starts with a .csv file that I download from a website. There are 14 columns of data and roughly 1500 rows or records.

    Initially, I need to read the file, reorder the columns, strip some characters from one column, compare the values of 2 different fields and highlight a deviation of greater than 20%, sum the value of one column and print revised records to a file.

    I have been playing around with and have created pickle files and shelve files but have had problems visualizing the order in which to work this out...do I even want to go that route or should I just be looking at re, maybe???

    Your thoughts would be appreciated!!
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Python built-in module csv may be ideal for your application. Instead of that, the approach to this partly depends on the data itself. If your data has a header line, you can read all the data into a dictionary using the column headers as keys. I have a simple example. This will not work if any of your data has embedded commas (the data with embedded commas must be enclosed in quotes). Following is the sample file:
    Code:
    Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC
    24, 7/8,A325,3 1/2,,1 Hardened,No
    448, 7/8,A325,3 1/4,,1 Hardened,No
    484, 7/8,A325,3,,1 Hardened,No
    508, 7/8,A325,2 3/4,,1 Hardened,No
    243, 7/8,A325,2 1/2,,1 Hardened,No
    479, 7/8,A325,2 1/4,,1 Hardened,No
    2632, 7/8,A325,2,,1 Hardened,No
    3378, 7/8,A325,1 3/4,,1 Hardened,No
    8, 3/4,A325,2,,1 Hardened,No
    8, 7/8,A307,11,,,No
    24, 7/8,A307,9,,,No
    36, 7/8,A307,7,,,No
    The following code parses and compiles the data so it can be manipulated:[code=Python]fn = 'sample.csv'
    f = open(fn)
    # read first line of file
    s = f.readline().st rip()

    # the first line contains the column headers
    # use the column headers as dictionary keys
    hdrList = s.split(',')
    colList = [[] for i in range(len(hdrLi st))]
    for line in f:
    for i, item in enumerate(line. strip().split(' ,')):
    colList[i].append(item)
    f.close()

    # create the dictionary
    dd = dict(zip(hdrLis t, colList))

    # reorder the columnar data
    # in this case, the length of hdrList = 7
    # column numbers = 1-7
    # list indices begin with 0
    neworder = [1,3,5,2,4,7]
    newhdrList = [hdrList[i-1] for i in neworder]

    # compile the reordered data
    outList = [newhdrList[:]] + [[] for i in range(len(dd.va lues()[0]))]
    for key in newhdrList:
    for i, item in enumerate(dd[key]):
    outList[i+1].append(item)

    # print the data
    # also suitable for writing to file
    print '\n'.join([','.join([s for s in item]) for item in outList])
    # print the sum of column 'Count'
    print 'Total number of bolts = %d' % sum([int(i) for i in dd['Count']])[/code]Printed output:[code=Python]>>> Count,Type,Hd Wash,Dia.,Lengt h,TC
    24,A325,, 7/8,3 1/2,No
    448,A325,, 7/8,3 1/4,No
    484,A325,, 7/8,3,No
    508,A325,, 7/8,2 3/4,No
    243,A325,, 7/8,2 1/2,No
    479,A325,, 7/8,2 1/4,No
    2632,A325,, 7/8,2,No
    3378,A325,, 7/8,1 3/4,No
    8,A325,, 3/4,2,No
    8,A307,, 7/8,11,No
    24,A307,, 7/8,9,No
    36,A307,, 7/8,7,No
    Total number of bolts = 8272
    >>> [/code]HTH

    Comment

    • hjnathan
      New Member
      • Feb 2008
      • 7

      #3
      Thanks for the quick response!

      Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

      It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

      I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...corr ect?

      I do believe this may have been the nudge I needed...thanks very much!!
      Last edited by hjnathan; Feb 15 '08, 01:45 AM. Reason: not sure why I can't see what I posted...now I got it!!

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Originally posted by hjnathan
        Thanks for the quick response!

        Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

        It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

        I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...corr ect?

        I do believe this may have been the nudge I needed...thanks very much!!
        You are very welcome. I am pleased hearing from someone with your enthusiasm.

        Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.[code=Python]compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]

        for i, item in enumerate(compL ist):
        diff = item[0]-item[1]
        if abs(diff) > 1:
        print 'Record %d difference is %0.4f' % (i, diff)[/code]

        Comment

        • hjnathan
          New Member
          • Feb 2008
          • 7

          #5
          Originally posted by bvdet
          You are very welcome. I am pleased hearing from someone with your enthusiasm.

          Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.[code=Python]compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]

          for i, item in enumerate(compL ist):
          diff = item[0]-item[1]
          if abs(diff) > 1:
          print 'Record %d difference is %0.4f' % (i, diff)[/code]
          Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison " code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??

          Comment

          • bvdet
            Recognized Expert Specialist
            • Oct 2006
            • 2851

            #6
            Originally posted by hjnathan
            Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison " code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??
            My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

            Count,Dia.,Type ,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
            24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
            448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
            ............... .....

            Comment

            • hjnathan
              New Member
              • Feb 2008
              • 7

              #7
              Originally posted by bvdet
              My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

              Count,Dia.,Type ,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
              24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
              448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
              ............... .....
              So, those should be keys then...I'l play with it some more. Thanks!

              Comment

              • hjnathan
                New Member
                • Feb 2008
                • 7

                #8
                Originally posted by hjnathan
                So, those should be keys then...I'l play with it some more. Thanks!
                Well I have spent a few hours trying to get the column comparison code to work.
                Here is the error I get...

                compList = [[float(item), float(dd['shipping_actua l'][i])] for i, item in enumerate(dd['shipping'])]

                for i, item in enumerate(compL ist):
                diff = item[0]-item[1]
                if abs(diff) > 1:
                print 'Record %d difference is %0.4f' % (i, diff)


                Here is a bit of my data...

                ['product_order_ qty', 'shipping_actua l', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocat ion', 'shipping', 'product_price' , 'ShipToAttn', 'number_boxes', 'product_ship_d ate', 'ship_method', 'ShipToCompany' , 'product_name', 'customer_name']

                26.66,30.31,649 R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                26.66,6.22,649R 2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                26.66,6.91,649R 2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
                26.66,30.31,649 R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                13.99,13.39,649 R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
                13.99,6.22,649R 2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
                13.99,6.22,649R 2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI


                ...and my (your) code...

                Code:
                fn = 'newtest.csv'
                f = open(fn)
                # read first line of file
                s = f.readline().strip()
                 
                # the first line contains the column headers
                # use the column headers as dictionary keys
                hdrList = s.split(',')
                colList = [[] for i in range(len(hdrList))]
                for line in f:
                    for i, item in enumerate(line.strip().split(',')):
                        colList[i].append(item)
                f.close()
                 
                # create the dictionary
                dd = dict(zip(hdrList, colList))
                print dd.keys() 
                
                
                # reorder the columnar data 
                # in this case, the length of hdrList = 15
                # column numbers = 1-15
                # list indices begin with 0
                neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]
                newhdrList = [hdrList[i-1] for i in neworder]
                 
                # compile the reordered data
                outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]
                for key in newhdrList:
                    for i, item in enumerate(dd[key]):
                        outList[i+1].append(item)        
                        
                # print the data
                # also suitable for writing to file
                print '\n'.join([','.join([s for s in item]) for item in outList])
                # print the sum of column 'Count'
                print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])
                
                compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]
                 
                for i, item in enumerate(compList):
                    diff = item[0]-item[1]
                    if abs(diff) > 1:
                        print 'Record %d difference is %0.4f' % (i, diff)

                I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

                If you could tell me why I am getting this error I would really appreciate it.

                Thanks again for your assistance!!

                Comment

                • bvdet
                  Recognized Expert Specialist
                  • Oct 2006
                  • 2851

                  #9
                  Originally posted by hjnathan
                  Well I have spent a few hours trying to get the column comparison code to work.
                  Here is the error I get...

                  compList = [[float(item), float(dd['shipping_actua l'][i])] for i, item in enumerate(dd['shipping'])]

                  for i, item in enumerate(compL ist):
                  diff = item[0]-item[1]
                  if abs(diff) > 1:
                  print 'Record %d difference is %0.4f' % (i, diff)


                  Here is a bit of my data...

                  ['product_order_ qty', 'shipping_actua l', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocat ion', 'shipping', 'product_price' , 'ShipToAttn', 'number_boxes', 'product_ship_d ate', 'ship_method', 'ShipToCompany' , 'product_name', 'customer_name']

                  26.66,30.31,649 R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                  26.66,6.22,649R 2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                  26.66,6.91,649R 2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
                  26.66,30.31,649 R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
                  13.99,13.39,649 R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
                  13.99,6.22,649R 2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
                  13.99,6.22,649R 2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI


                  ...and my (your) code...

                  Code:
                  fn = 'newtest.csv'
                  f = open(fn)
                  # read first line of file
                  s = f.readline().strip()
                   
                  # the first line contains the column headers
                  # use the column headers as dictionary keys
                  hdrList = s.split(',')
                  colList = [[] for i in range(len(hdrList))]
                  for line in f:
                      for i, item in enumerate(line.strip().split(',')):
                          colList[i].append(item)
                  f.close()
                   
                  # create the dictionary
                  dd = dict(zip(hdrList, colList))
                  print dd.keys() 
                  
                  
                  # reorder the columnar data 
                  # in this case, the length of hdrList = 15
                  # column numbers = 1-15
                  # list indices begin with 0
                  neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]
                  newhdrList = [hdrList[i-1] for i in neworder]
                   
                  # compile the reordered data
                  outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]
                  for key in newhdrList:
                      for i, item in enumerate(dd[key]):
                          outList[i+1].append(item)        
                          
                  # print the data
                  # also suitable for writing to file
                  print '\n'.join([','.join([s for s in item]) for item in outList])
                  # print the sum of column 'Count'
                  print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])
                  
                  compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]
                   
                  for i, item in enumerate(compList):
                      diff = item[0]-item[1]
                      if abs(diff) > 1:
                          print 'Record %d difference is %0.4f' % (i, diff)

                  I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

                  If you could tell me why I am getting this error I would really appreciate it.

                  Thanks again for your assistance!!
                  I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.

                  Comment

                  • hjnathan
                    New Member
                    • Feb 2008
                    • 7

                    #10
                    Originally posted by bvdet
                    I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.
                    I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
                    I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!

                    Comment

                    • bvdet
                      Recognized Expert Specialist
                      • Oct 2006
                      • 2851

                      #11
                      Originally posted by hjnathan
                      I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
                      I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!
                      I'm glad you were able to solve the problem. Thanks for the update.

                      Comment

                      Working...