Log analysis with Pandas

This post shows how to analyze raw log data with Python data analysis library Pandas. I’m using Apache’s access logs as an example and will show how to:

Parse log data to Pandas Data Frame (DF)
Analyze data within the DF.

Using Jupyter Notebook for instant results

Jupyter Notebook is an excellent tool to do this kind of analysis.

pip install notebook
jupyter notebook

It will automatically output Pandas DFs in table format and enables quick testing for different things.

Installing Pandas

Pandas can be installed with pip. I also install pyarrow because running Pandas gives warning that it’s soon required by the framework.

pip install pandas
pip install pyarrow

You can also install dependencies within the Jupyter Notebook. Just add ! in front of the command in a code block.

!pip install pandas
!pip install pyarrow
# Add "> /dev/null" at the end if you want to omit the output from the installation.

Importing dependencies and downloading log sample

First we need to import necessary dependencies.

import pandas as pd
import re
import os
import urllib.request

Then we will download a log sample and save it as log_samples/apache.sample.log. I’m using log sample from Elastic’s Github repo.

Change APACHE_LOG_FILE_URL variable if you want to user sample from some other source.

APACHE_LOG_FILE_URL = 'https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs'
if not os.path.isdir("log_samples"):
    os.mkdir("log_samples")
apache_log_file = 'log_samples/apache.sample.log'
urllib.request.urlretrieve(APACHE_LOG_FILE_URL, apache_log_file)

Normalizing logs to Pandas DataFrame

Here I will show how to parse sample Apache access log file to Pandas Data Frame. The main task in the below parser code is to normalize data from a log event using regular expression statement defined in variable APACHE_RE.

APACHE_RE = r'^((?:\d+\.){3}\d+).*\[(.*)\]\s"(.*)"\s(\d{3})\s(\d+|-)\s"(.*)"\s"(.*)"?'
apache_lines = []
# Loop appends regular expression search results to variable "apache_lines"
with open(apache_log_file, 'r') as apache_log:
    for line in apache_log:
        r = re.search(APACHE_RE, line)
        if r:
            if len(r.groups()) == 7:
                apache_lines.append(list(r.groups()))
        else:
            print(line)
# This line converts "apache_lines" to DF and specifies column names
df = pd.DataFrame(apache_lines, columns = ['IP', 'DATE', 'URL', 'STATUS', 'BYTES', 'REFERRER', 'UserAgent'])
df

Now the log data is in variable df and outputting the content would look like in the table below.

Note how I did not use print(df) because I’m using the Jupyter Notebook myself. Inside a Notebook it will automatically output the table in graphical format.

With print(df) the output would be shown in text format as if running the code directly in a terminal.

	IP	DATE	URL	STATUS	BYTES	REFERRER	UserAgent
0	83.149.9.216	17/May/2015:10:05:03 +0000	GET /presentations/logstash-monitorama-2013/im...	200	203023	http://semicomplete.com/presentations/logstash...	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
1	83.149.9.216	17/May/2015:10:05:43 +0000	GET /presentations/logstash-monitorama-2013/im...	200	171717	http://semicomplete.com/presentations/logstash...	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
2	83.149.9.216	17/May/2015:10:05:47 +0000	GET /presentations/logstash-monitorama-2013/pl...	200	26185	http://semicomplete.com/presentations/logstash...	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
3	83.149.9.216	17/May/2015:10:05:12 +0000	GET /presentations/logstash-monitorama-2013/pl...	200	7697	http://semicomplete.com/presentations/logstash...	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
4	83.149.9.216	17/May/2015:10:05:07 +0000	GET /presentations/logstash-monitorama-2013/pl...	200	2892	http://semicomplete.com/presentations/logstash...	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
...	...	...	...	...	...	...	...
9995	63.140.98.80	20/May/2015:21:05:28 +0000	GET /blog/tags/puppet?flav=rss20 HTTP/1.1	200	14872	http://www.semicomplete.com/blog/tags/puppet?f...	Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
9996	63.140.98.80	20/May/2015:21:05:50 +0000	GET /blog/geekery/solving-good-or-bad-problems...	200	10756	-	Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
9997	66.249.73.135	20/May/2015:21:05:00 +0000	GET /?flav=atom HTTP/1.1	200	32352	-	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
9998	180.76.6.56	20/May/2015:21:05:56 +0000	GET /robots.txt HTTP/1.1	200	-	-	Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...
9999	46.105.14.53	20/May/2015:21:05:15 +0000	GET /blog/tags/puppet?flav=rss20 HTTP/1.1	200	14872	-	UniversalFeedParser/4.2-pre-314-svn +http://fe...

Analyzing data in DataFrame

Now that we have the data in Pandas DF we can investigate it in different ways.

Extract events by HTTP status code

We can, for example, extract all events where HTTP status code was 401 or 403.

Output of the below like statements are also shown directly as a graphical table within the Jupyter Notebook, so there is no need to define print statements.

# Filter dataframe based on column STATUS
df[(df['STATUS'] == '403') | (df['STATUS'] == '401')]

(In Pandas OR / AND operators are | / &.)

	IP	DATE	URL	STATUS	BYTES	REFERRER	UserAgent
3028	94.153.9.168	18/May/2015:11:05:47 +0000	GET /presentations/vim/+++++++++++++++++++++++...	403	676	http://www.semicomplete.com/presentations/vim/...	Opera/9.80 (Windows NT 6.1; WOW64; U; ru) Pres...
8685	208.115.113.88	20/May/2015:10:05:01 +0000	GET /svnweb/xpathtool/ HTTP/1.1	403	305	-	Mozilla/5.0 (compatible; Ezooms/1.0; help@moz....

Examine user agents

Below example extracts all unique user agents from the data.

pd.DataFrame(df['UserAgent'].unique(), columns=['UserAgent'])

	UserAgent
0	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
1	Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26....
2	Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko...
3	Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma...
4	Tiny Tiny RSS/1.11 (http://tt-rss.org/)"
...	...
554	Mozilla/4.0 (compatible; MSIE 5.0; Windows NT;...
555	Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US...
556	Mozilla/5.0 (Windows NT 5.0; rv:12.0) Gecko/20...
557	Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3...
558	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US...

This example shows top 10 user agents by occurances.

pd.DataFrame(df['UserAgent'].value_counts()[:10].index, columns=['UserAgent'])

	UserAgent
0	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
1	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)...
2	UniversalFeedParser/4.2-pre-314-svn +http://fe...
3	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) G...
4	Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma...
5	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...
6	Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
7	Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27....
8	Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko...
9	Tiny Tiny RSS/1.11 (http://tt-rss.org/)"