how to extract the key from the log in python
i write the python code ,in order to extract key from the log.And using
the same log,it worked well in one machine.But when i run it in hadoop,it
failed.I guess there are some bugs when using regex.Who can give me some
comments?Is regex can't support hadoop?
This python code is aim to extract qry and rc ,and count the value of rc
,and then print it as qry query_count rc_count .When run it in hadoop,it
report
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1.
I search google,there may some bug in your mapper code.So how can i fix it?
log formats like that,
NOTICE: 01-03 23:57:23: [a.cpp][b][222] show_ver=11 sid=ae1d esid=6WVj
uid=D1 a=20 qry=cars qid0=293 loc_src=4 phn=0 mid=0 wvar=c op=0 qry_src=0
op_type=1 src=110|120|111 at=60942 rc=3|1|1 discount=20 indv_type=0
rep_query=
And my python code is that
import sys
import re
for line in sys.stdin:
count_result = 0
line = line.strip()
match=re.search('.*qry=(.*?)qid0.*rc=(.*?)discount',line).groups()
if (len(match)<2):
continue
counts_tmp = match[1].strip()
counts=counts_tmp.split('|')
for count in counts:
if count.isdigit():
count_result += int(count)
key_tmp = match[0].strip()
if key_tmp.strip():
key = key_tmp.split('\t')
key = ' '.join(key)
print '%s\t%s\t%s' %(key,1,count_result)
No comments:
Post a Comment