This post shows how to analyze raw log data with Python data analysis library Pandas. I’m using Apache’s access logs as an example and will show how to:

  • Parse log data to Pandas Data Frame (DF)
  • Analyze data within the DF.

Using Jupyter Notebook for instant results

Jupyter Notebook is an excellent tool to do this kind of analysis.

pip install notebook
jupyter notebook

It will automatically output Pandas DFs in table format and enables quick testing for different things.

Installing Pandas

Pandas can be installed with pip. I also install pyarrow because running Pandas gives warning that it’s soon required by the framework.

pip install pandas
pip install pyarrow

You can also install dependencies within the Jupyter Notebook. Just add ! in front of the command in a code block.

!pip install pandas
!pip install pyarrow
# Add "> /dev/null" at the end if you want to omit the output from the installation.

Importing dependencies and downloading log sample

First we need to import necessary dependencies.

import pandas as pd
import re
import os
import urllib.request

Then we will download a log sample and save it as log_samples/apache.sample.log. I’m using log sample from Elastic’s Github repo.

Change APACHE_LOG_FILE_URL variable if you want to user sample from some other source.

APACHE_LOG_FILE_URL = 'https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs'
if not os.path.isdir("log_samples"):
    os.mkdir("log_samples")
apache_log_file = 'log_samples/apache.sample.log'
urllib.request.urlretrieve(APACHE_LOG_FILE_URL, apache_log_file)

Normalizing logs to Pandas DataFrame

Here I will show how to parse sample Apache access log file to Pandas Data Frame. The main task in the below parser code is to normalize data from a log event using regular expression statement defined in variable APACHE_RE.

APACHE_RE = r'^((?:\d+\.){3}\d+).*\[(.*)\]\s"(.*)"\s(\d{3})\s(\d+|-)\s"(.*)"\s"(.*)"?'
apache_lines = []
# Loop appends regular expression search results to variable "apache_lines"
with open(apache_log_file, 'r') as apache_log:
    for line in apache_log:
        r = re.search(APACHE_RE, line)
        if r:
            if len(r.groups()) == 7:
                apache_lines.append(list(r.groups()))
        else:
            print(line)
# This line converts "apache_lines" to DF and specifies column names
df = pd.DataFrame(apache_lines, columns = ['IP', 'DATE', 'URL', 'STATUS', 'BYTES', 'REFERRER', 'UserAgent'])
df

Now the log data is in variable df and outputting the content would look like in the table below.

Note how I did not use print(df) because I’m using the Jupyter Notebook myself. Inside a Notebook it will automatically output the table in graphical format.

With print(df) the output would be shown in text format as if running the code directly in a terminal.

IP DATE URL STATUS BYTES REFERRER UserAgent
0 83.149.9.216 17/May/2015:10:05:03 +0000 GET /presentations/logstash-monitorama-2013/im... 200 203023 http://semicomplete.com/presentations/logstash... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
1 83.149.9.216 17/May/2015:10:05:43 +0000 GET /presentations/logstash-monitorama-2013/im... 200 171717 http://semicomplete.com/presentations/logstash... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
2 83.149.9.216 17/May/2015:10:05:47 +0000 GET /presentations/logstash-monitorama-2013/pl... 200 26185 http://semicomplete.com/presentations/logstash... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
3 83.149.9.216 17/May/2015:10:05:12 +0000 GET /presentations/logstash-monitorama-2013/pl... 200 7697 http://semicomplete.com/presentations/logstash... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
4 83.149.9.216 17/May/2015:10:05:07 +0000 GET /presentations/logstash-monitorama-2013/pl... 200 2892 http://semicomplete.com/presentations/logstash... Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
... ... ... ... ... ... ... ...
9995 63.140.98.80 20/May/2015:21:05:28 +0000 GET /blog/tags/puppet?flav=rss20 HTTP/1.1 200 14872 http://www.semicomplete.com/blog/tags/puppet?f... Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
9996 63.140.98.80 20/May/2015:21:05:50 +0000 GET /blog/geekery/solving-good-or-bad-problems... 200 10756 - Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
9997 66.249.73.135 20/May/2015:21:05:00 +0000 GET /?flav=atom HTTP/1.1 200 32352 - Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
9998 180.76.6.56 20/May/2015:21:05:56 +0000 GET /robots.txt HTTP/1.1 200 - - Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...
9999 46.105.14.53 20/May/2015:21:05:15 +0000 GET /blog/tags/puppet?flav=rss20 HTTP/1.1 200 14872 - UniversalFeedParser/4.2-pre-314-svn +http://fe...

Analyzing data in DataFrame

Now that we have the data in Pandas DF we can investigate it in different ways.

Extract events by HTTP status code

We can, for example, extract all events where HTTP status code was 401 or 403.

Output of the below like statements are also shown directly as a graphical table within the Jupyter Notebook, so there is no need to define print statements.

# Filter dataframe based on column STATUS
df[(df['STATUS'] == '403') | (df['STATUS'] == '401')]

(In Pandas OR / AND operators are | / &.)

IP DATE URL STATUS BYTES REFERRER UserAgent
3028 94.153.9.168 18/May/2015:11:05:47 +0000 GET /presentations/vim/+++++++++++++++++++++++... 403 676 http://www.semicomplete.com/presentations/vim/... Opera/9.80 (Windows NT 6.1; WOW64; U; ru) Pres...
8685 208.115.113.88 20/May/2015:10:05:01 +0000 GET /svnweb/xpathtool/ HTTP/1.1 403 305 - Mozilla/5.0 (compatible; Ezooms/1.0; help@moz....

Examine user agents

Below example extracts all unique user agents from the data.

pd.DataFrame(df['UserAgent'].unique(), columns=['UserAgent'])
UserAgent
0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
1 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26....
2 Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko...
3 Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma...
4 Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
... ...
554 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT;...
555 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...
556 Mozilla/5.0 (Windows NT 5.0; rv:12.0) Gecko/20...
557 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...
558 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US...

This example shows top 10 user agents by occurances.

pd.DataFrame(df['UserAgent'].value_counts()[:10].index, columns=['UserAgent'])
UserAgent
0 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
1 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
2 UniversalFeedParser/4.2-pre-314-svn +http://fe...
3 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) G...
4 Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma...
5 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
6 Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
7 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27....
8 Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko...
9 Tiny Tiny RSS/1.11 (http://tt-rss.org/)"