想从网站日志中提取出,用户的访问记录,不要搜索引擎的,命令怎么写呢。
使用python即可完美提取
这是一般Apache的 Log 内容:
192.168.1.1 - - [20/Nov/2011:01:10:35 +0100] "GET /feed.atom HTTP/1.0" 200 259653
192.168.1.2 - - [20/Nov/2011:01:10:49 +0100] "GET /feed.atom HTTP/1.1" 304 153
192.168.1.3 - - [20/Nov/2011:01:10:50 +0100] "GET /2008/1/23/no HTTP/1.0" 404 472
192.168.1.4 - - [20/Nov/2011:01:10:50 +0100] "GET /feed.atom?_qt=data HTTP/1.1"
先调出Log文件
with open('/var/log/apache2/access.log') as f:
for line in f:
然后提取用户访问记录
import re
from collections import defaultdict
from heapq import nlargest
with open('log.txt') as f:
count = defaultdict(int)
for line in f:
match = re.search(r' "\w+ (.*?) HTTP/', line)
if match is None:
continue
uri = match.group(1).split('?')[0]
count[uri] = count[uri] + 1
most_common = nlargest(5, count.items(), key=lambda x: x[1])
print most_common