Hacking on Apache Log Files with Python

There are plenty of tools out there for doing web request analysis. I wanted to pull some information out of some Apache log files without all that overhead. Here's what I did:

I got Rory McCann's apache-log-parser (after some googling; it's on Github and on pypi).  I set up a Python virtual environment using Doug Hellmann's virtualenvwrapper, activated it, and then used:

 pip install apache-log-parser  

Since I'd never used apache-log-parser before, I had to get familiar with it. I discovered that, to use it, I had to know the format string that Apache was using to log information for my site. This was a two-step process, figured out by reading the Log Files section of the Apache documentation and poking about with grep.

First, I searched in the Apache configuration files for the CustomLog directive that's associated with the virtual host I wanted to analyze. This gave me a 'nickname' for the log configuration. More spelunking in Apache config files -- this time in the main configuration file -- turned up the definition of that nickname (Apache uses the LogFormat directive for this purpose):

 $ cd /etc/apache2/  
$ grep CustomLog sites-enabled/foo.nowhere.org
CustomLog /var/log/apache2/foo.nowhere.org-access.log combined
$ grep combined apache2.conf
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

It's that LogFormat string that needs to be given to Rory's code so it knows how to parse your log files.

After some experimenting in the Python interpreter to get a feel for the code and its capabilities, I wrote a few lines of my own to wrap the setup and file reading operations:

 #!/usr/bin/env python  
# -*- coding: utf-8 -*-

import apache_log_parser
import glob
import logging

# supported log file formats
APACHE_COMBINED="%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""
APACHE_COMMON="%h %l %u %t \"%r\" %>s %b"

def gulp(log_file_path, pattern=APACHE_COMBINED):
""" import and parse log files """
log_data=[]
line_parser=apache_log_parser.make_parser(pattern)
for file_name in glob.glob(log_file_path):
logging.info("file_name: %s" % file_name)
file = open(file_name, 'r')
lines = file.readlines()
file.close()
logging.info(" read %s lines" % len(lines))
for line in lines:
line_data=line_parser(line)
log_data.append(line_data)
logging.info("total number of events parsed: %s" % len(log_data))
return log_data

For this particular server, I had multiple log files, but I wanted to have all the requests from all of them (parsed into dictionaries by Rory's code) in a single list for subsequent analysis. So, back to the Python interpreter:

 >>> import logging  
>>> logging.basicConfig(level=logging.INFO)
>>> import loggulper
>>> d = loggulper.gulp("/path/to/my/log/files/foo.nowhere.org-access.*")

I'll spare you the logging messages. This took several minutes. I ended up with about 1.5 million requests in the list. Real life intervened. How to save this data for later without having to run through the whole import process again? Pickle it.

 >>> import cPickle as pickle  
>>> out_name = "logdata.pickle"
>>> outf = open(out_name, 'w')
>>> pickler = pickle.Pickler(outf, pickle.HIGHEST_PROTOCOL)
>>> pickler.dump(d)
<cPickle.Pickler object at 0x1044044e8>
>>> outf.close()

The whole list was saved to a 933.3 MB file in just a few seconds (full disclosure: I have a solid-state drive). It was nearly as quick to read back in a couple of days later (new interpreter session and all):

 >>> import cPickle as pickle  
>>> in_name="logdata.pickle"
>>> inf=open(in_name, 'r')
>>> unpickler=pickle.Unpickler(inf)
>>> d=unpickler.load()
>>> len(d)
1522015
>>> d[0].keys()
['status', 'request_header_referer', 'remote_user', 'request_header_user_agent__browser__family', 'request_header_user_agent__is_mobile', 'request_header_user_agent__browser__version_string', 'request_header_user_agent', 'request_http_ver', 'request_header_user_agent__os__version_string', 'remote_logname', 'time_recieved_isoformat', 'time_recieved', 'request_first_line', 'request_header_user_agent__os__family', 'request_method', 'request_url', 'remote_host', 'time_recieved_datetimeobj', 'response_bytes_clf']

It's important to notice at this point that the word "received" is misspelled "recieved" in keys in the dictionaries returned by apache-log-parser. If you don't notice this early on, it will cause some amount of frustration.

It turned out that my log data included events past the end of reporting period I'd been given (ending 31 May 2014), so I needed to filter out just those requests that fell within the reporting period. Python list comprehensions to the rescue:

 >>> dates=[req['time_recieved_datetimeobj'] for req in d]  
>>> max(dates)
datetime.datetime(2014, 7, 13, 11, 37, 31)
>>> min(dates)
datetime.datetime(2013, 7, 21, 3, 41, 26)
>>> from datetime import datetime
>>> d_relevant=[req for req in d if req['time_recieved_datetimeobj'] < datetime(2014,06,01)]
>>> dates=[req['time_recieved_datetimeobj'] for req in d_relevant]
>>> max(dates)
datetime.datetime(2014, 5, 31, 23, 59, 17)
>>> min(dates)
datetime.datetime(2013, 7, 21, 3, 41, 26)

Now to separate requests made by self-identified bots and spiders from the rest of the traffic:

 >>> robots=[req for req in d_relevant if 'bot' in req['request_header_user_agent'].lower() or 'spider' in req['request_header_user_agent'].lower()]  
>>> len(robots)
848450
>>> humans=[req for req in d_relevant if 'bot' not in req['request_header_user_agent'].lower() and 'spider' not in req['request_header_user_agent'].lower()]
>>> len(humans)
486249
>>> len(robots) + len(humans) == len(d_relevant)
True
>>> unique_bots=[]
>>> for bot in robots:
... if bot['request_header_user_agent'] not in unique_bots:
... unique_bots.append(bot['request_header_user_agent'])
...
>>> len(unique_bots)
229

Aside: yes, I know there could well still be automated agents in the "humans" list; I've only filtered out those that are not operated by the sneaky or the uninformed. Let's not worry about that issue for now.

Stay tuned for the next installment, wherein I hope we actually learn something about how our server is being used.