Hey all.
I do not understand what is wrong with my script and would love some help... first off the examples in my script are based off running a map reduce in hadoop.. the part I am struggling with is the reduce.. my basic input is something like this
ID--VAL1--VAL2
41,0,1
41,1,0
41,1,0
46,0,1
46,0,1
46,1,0
46,1,0
basically I need to loop through each line and check to see if the ID from the next line = the ID from previous line and if it does, add keep a SUM value of both VAL1 and VAL2... and at the end, a total sum of VAL1 + VAL2.
so something like this
ID--VAL1_SUM--VAL2_SUM--TOTAL
41,2,1,3
46,2,2,4
The script I have does exactly that..but only for the first set of ID's.. (41).. it always leaves out the last set of ID's (46), so i am getting only this
ID--VAL1_SUM--VAL2_SUM--TOTAL
41,2,1,3
Anyhow.. any help would be appreciated
Cheers,
Eric
ps I am stuck using Python 2.3 on RedHat servers
I do not understand what is wrong with my script and would love some help... first off the examples in my script are based off running a map reduce in hadoop.. the part I am struggling with is the reduce.. my basic input is something like this
ID--VAL1--VAL2
41,0,1
41,1,0
41,1,0
46,0,1
46,0,1
46,1,0
46,1,0
basically I need to loop through each line and check to see if the ID from the next line = the ID from previous line and if it does, add keep a SUM value of both VAL1 and VAL2... and at the end, a total sum of VAL1 + VAL2.
so something like this
ID--VAL1_SUM--VAL2_SUM--TOTAL
41,2,1,3
46,2,2,4
The script I have does exactly that..but only for the first set of ID's.. (41).. it always leaves out the last set of ID's (46), so i am getting only this
ID--VAL1_SUM--VAL2_SUM--TOTAL
41,2,1,3
Anyhow.. any help would be appreciated
Code:
#!/usr/bin/python
import sys
TmpArr = []
OutArr = []
i = int(0)
j = int(0)
id = ""
VAL1 = int(0)
VAL2 = int(0)
TOTAL = int(0)
for line in sys.stdin:
j += 1
try:
line = line.strip()
TmpArr = line.split(',')
if i == 0:
#first loop always addes line to OutArr too
OutArr = line.split(',')
id = OutArr[0]
VAL1 = VAL1 + int(OutArr[1])
VAL2 = VAL2 + int(OutArr[2])
i += 1
else:
#now check if the new line id = previous line..
if TmpArr[0] == OutArr[0]:
id = TmpArr[0]
VAL1 = VAL1 + int(TmpArr[1])
VAL2 = VAL2 + int(TmpArr[2])
OutArr = line.split(',')
i += 1
else:
#if the new line id != previous line.. then print sums...
TOTAL = VAL1 + VAL2
print( id + ',' + str(VAL1) + ',' + str(VAL2) + ',' + str(TOTAL) )
OutArr = line.split(',')
id = OutArr[0]
VAL1 = int(OutArr[1])
VAL2 = int(OutArr[2])
except ValueError:
break
Eric
ps I am stuck using Python 2.3 on RedHat servers
Comment