Log analysis with Pandas
This post shows how to analyze raw log data with Python data analysis library Pandas. I’m using Apache’s access logs as an example and will show how to:
- Parse log data to Pandas Data Frame (DF)
- Analyze data within the DF.
Using Jupyter Notebook for instant results
Jupyter Notebook is an excellent tool to do this kind of analysis.
pip install notebook
jupyter notebook
It will automatically output Pandas DFs in table format and enables quick testing for different things.
Installing Pandas
Pandas can be installed with pip. I also install pyarrow because running Pandas gives warning that it’s soon required by the framework.
pip install pandas
pip install pyarrow
You can also install dependencies within the Jupyter Notebook. Just add !
in front of the command in a code block.
!pip install pandas
!pip install pyarrow
# Add "> /dev/null" at the end if you want to omit the output from the installation.
Importing dependencies and downloading log sample
First we need to import necessary dependencies.
import pandas as pd
import re
import os
import urllib.request
Then we will download a log sample and save it as log_samples/apache.sample.log
.
I’m using log sample from Elastic’s Github repo.
Change APACHE_LOG_FILE_URL
variable if you want to user sample from some other source.
APACHE_LOG_FILE_URL = 'https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs'
if not os.path.isdir("log_samples"):
os.mkdir("log_samples")
apache_log_file = 'log_samples/apache.sample.log'
urllib.request.urlretrieve(APACHE_LOG_FILE_URL, apache_log_file)
Normalizing logs to Pandas DataFrame
Here I will show how to parse sample Apache access log file to Pandas Data Frame.
The main task in the below parser code is to normalize data from a log event using regular expression statement defined in variable APACHE_RE
.
APACHE_RE = r'^((?:\d+\.){3}\d+).*\[(.*)\]\s"(.*)"\s(\d{3})\s(\d+|-)\s"(.*)"\s"(.*)"?'
apache_lines = []
# Loop appends regular expression search results to variable "apache_lines"
with open(apache_log_file, 'r') as apache_log:
for line in apache_log:
r = re.search(APACHE_RE, line)
if r:
if len(r.groups()) == 7:
apache_lines.append(list(r.groups()))
else:
print(line)
# This line converts "apache_lines" to DF and specifies column names
df = pd.DataFrame(apache_lines, columns = ['IP', 'DATE', 'URL', 'STATUS', 'BYTES', 'REFERRER', 'UserAgent'])
df
Now the log data is in variable df
and outputting the content would look like in the table below.
Note how I did not use print(df)
because I’m using the Jupyter Notebook myself. Inside a Notebook it will automatically output the table in graphical format.
With print(df)
the output would be shown in text format as if running the code directly in a terminal.
IP | DATE | URL | STATUS | BYTES | REFERRER | UserAgent | |
---|---|---|---|---|---|---|---|
0 | 83.149.9.216 | 17/May/2015:10:05:03 +0000 | GET /presentations/logstash-monitorama-2013/im... | 200 | 203023 | http://semicomplete.com/presentations/logstash... | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
1 | 83.149.9.216 | 17/May/2015:10:05:43 +0000 | GET /presentations/logstash-monitorama-2013/im... | 200 | 171717 | http://semicomplete.com/presentations/logstash... | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
2 | 83.149.9.216 | 17/May/2015:10:05:47 +0000 | GET /presentations/logstash-monitorama-2013/pl... | 200 | 26185 | http://semicomplete.com/presentations/logstash... | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
3 | 83.149.9.216 | 17/May/2015:10:05:12 +0000 | GET /presentations/logstash-monitorama-2013/pl... | 200 | 7697 | http://semicomplete.com/presentations/logstash... | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
4 | 83.149.9.216 | 17/May/2015:10:05:07 +0000 | GET /presentations/logstash-monitorama-2013/pl... | 200 | 2892 | http://semicomplete.com/presentations/logstash... | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
... | ... | ... | ... | ... | ... | ... | ... |
9995 | 63.140.98.80 | 20/May/2015:21:05:28 +0000 | GET /blog/tags/puppet?flav=rss20 HTTP/1.1 | 200 | 14872 | http://www.semicomplete.com/blog/tags/puppet?f... | Tiny Tiny RSS/1.11 (http://tt-rss.org/)" |
9996 | 63.140.98.80 | 20/May/2015:21:05:50 +0000 | GET /blog/geekery/solving-good-or-bad-problems... | 200 | 10756 | - | Tiny Tiny RSS/1.11 (http://tt-rss.org/)" |
9997 | 66.249.73.135 | 20/May/2015:21:05:00 +0000 | GET /?flav=atom HTTP/1.1 | 200 | 32352 | - | Mozilla/5.0 (compatible; Googlebot/2.1; +http:... |
9998 | 180.76.6.56 | 20/May/2015:21:05:56 +0000 | GET /robots.txt HTTP/1.1 | 200 | - | - | Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... |
9999 | 46.105.14.53 | 20/May/2015:21:05:15 +0000 | GET /blog/tags/puppet?flav=rss20 HTTP/1.1 | 200 | 14872 | - | UniversalFeedParser/4.2-pre-314-svn +http://fe... |
Analyzing data in DataFrame
Now that we have the data in Pandas DF we can investigate it in different ways.
Extract events by HTTP status code
We can, for example, extract all events where HTTP status code was 401 or 403.
Output of the below like statements are also shown directly as a graphical table within the Jupyter Notebook, so there is no need to define print
statements.
# Filter dataframe based on column STATUS
df[(df['STATUS'] == '403') | (df['STATUS'] == '401')]
(In Pandas OR
/ AND
operators are |
/ &
.)
IP | DATE | URL | STATUS | BYTES | REFERRER | UserAgent | |
---|---|---|---|---|---|---|---|
3028 | 94.153.9.168 | 18/May/2015:11:05:47 +0000 | GET /presentations/vim/+++++++++++++++++++++++... | 403 | 676 | http://www.semicomplete.com/presentations/vim/... | Opera/9.80 (Windows NT 6.1; WOW64; U; ru) Pres... |
8685 | 208.115.113.88 | 20/May/2015:10:05:01 +0000 | GET /svnweb/xpathtool/ HTTP/1.1 | 403 | 305 | - | Mozilla/5.0 (compatible; Ezooms/1.0; help@moz.... |
Examine user agents
Below example extracts all unique user agents from the data.
pd.DataFrame(df['UserAgent'].unique(), columns=['UserAgent'])
UserAgent | |
---|---|
0 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
1 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.... |
2 | Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko... |
3 | Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma... |
4 | Tiny Tiny RSS/1.11 (http://tt-rss.org/)" |
... | ... |
554 | Mozilla/4.0 (compatible; MSIE 5.0; Windows NT;... |
555 | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US... |
556 | Mozilla/5.0 (Windows NT 5.0; rv:12.0) Gecko/20... |
557 | Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.3... |
558 | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US... |
This example shows top 10 user agents by occurances.
pd.DataFrame(df['UserAgent'].value_counts()[:10].index, columns=['UserAgent'])
UserAgent | |
---|---|
0 | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi... |
1 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1)... |
2 | UniversalFeedParser/4.2-pre-314-svn +http://fe... |
3 | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) G... |
4 | Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Ma... |
5 | Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53... |
6 | Mozilla/5.0 (compatible; Googlebot/2.1; +http:... |
7 | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.... |
8 | Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko... |
9 | Tiny Tiny RSS/1.11 (http://tt-rss.org/)" |