Jul 19 2018 Inhouse Logfile Analytics Email

~ 5 Minute Read.

I recently mentioned that I wrote a script to do some inhouse apache log file analytics to supervise trafic on my blog. I love python for this kind of quick scripting stuff, so here is how I did that with pandas, jinja2 and python standard libraries.

Parsing Apache Logfiles with Pandas

I didn’t dig into this much since I don’t have a lot of experience with pandas and was happy to find a solution on Modesto Mas’ blog. A cool thing to note, though, is that pandas will automatically parse your compressed logs correctly, which is super neat!

Reading a log in Apache Combined format looks like this:

def parse_datetime(x):
    dt = datetime.datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))


def read_log(filename):
    return pd.read_csv(
        filename,
        sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
        engine='python',
        na_values='-',
        header=None,
        usecols=[0, 3, 4, 5, 6, 7, 8],
        names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
        converters={'time': parse_datetime,
                    'status': int,
                    'size': int})

I use this to parse all of the logs and then combine the data frames using

data = pd.concat([read_log(f) for f in glob.glob('path-to/my-access-log*')],
                 ignore_index=True)

Where the glob does the magic to also get the compressed old log files. As mentioned above, pandas will not care about whether the input is compressed or not and correctly read the log files.

Some cleanup is needed, like normalizing the filenames, ensuring every request to a blog post, e.g. blog.squareys.de/inhouse-logfile-analytics-email ends with a slash, i.e. blog.squareys.de/inhouse-logfile-analytics-email/, which makes counting filename values easier later. From the request I generate a new column “filename” roughly along the lines of this:

requested_files = pd.DataFrame({'filename': data.request.apply(lambda row: row.split(' ')[1])})
data = data.join(requested_files)
# Fix trailing / inconsistencies
data['filename'] = data['filename'].apply(
    lambda filename: filename if filename.endswith('/') or '.' in filename else filename + '/')
# Fix / prefix inconsistencies
data['filename'] = data['filename'].apply(
    lambda filename: filename[1:] if filename.startswith('//') else filename)

Which could be the most inefficient way to do this—I’m happy about improvement suggestions! (As mentioned, I don’t have experience with pandas.)

I then go ahead and classify the type of file: Is it a blog post or an image…?

# Classify file type
def classify_filetype(row):
    path, file = os.path.split(row.filename)
    file = file.split("?")[0]
    base, ext = os.path.splitext(file)
    if ext in ['.png', '.jpg', '.bmp', '.gif']:
        return "IMG"
    elif ext in ['.wasm']:
        return "WASM"
    elif ext in ['.woff', '.woff2']:
        return "FONT"
    elif ext in ['.css']:
        return "STYLE"
    elif ext in ['.js', '.min.js']:
        return "SCRIPT"
    elif file == "robots.txt":
        return "BOT"
    elif ext == ".xml" and file.startswith("sitemap"):
        return "SITEMAP"
    elif file.endswith(".atom.xml") or path.startswith("/feed"):
        return "FEED"
    elif file == "favicon.ico":
        return "FAVICON"
    elif not ext:
        if path == "/" and not file:
            return "ROOT"

        dir = path.split('/')[1]
        if dir in PAGES:
            return "PAGE"
        else:
            if dir in ["tag", "category", "author"]:
                return "ARCHIVE"  # listing like category, author, tag or root
            return "BLOG"
    elif ext in ['.html', '.php']:
        return "PAGE"
    return "SUSPECT"

def clean_referer(x):
    # cut away surrounding "" and shorten
    return x[1:40] + "..." if len(x) > 43 else x[1:-1]

data.referer = data.referer.apply(clean_referer)
new_cols = pd.DataFrame({
    'filetype': pd.Categorical(data.apply(classify_filetype, axis=1)),
})
data = data.join(new_cols)

Again, super rough and probably incomplete, but does the job! You’ll also notice I do some cleanup there: Some people just hammer you website with super long referal URLs which are a pain to view in an analytics report, so I cut them to max 43 characters. I don’t currently use the referer information anyway except grouping and counting it.

It’s a measure to recognize spikes more than anything else… so that I can thank people who refer to my blog? Who knows what this could be useful for some day :)

Analyzing with Pandas

# Only analyze successful page requests
data = data[data.status == 200]

Now we have a big table of unreadible data. What to do next? How about getting the top blog posts:

top_blog_posts = data[data.filetype == ["BLOG"]].filename.value_counts()[lambda x: x > 10]

Well, that’s easy! Just take all posts with BLOG filetype and count the filename values! (That gives us hits, not unique visitors, beware of that.) The final filter ensures we only keep blog posts that have been viewed more than 10 times. To show the top ten posts, go for .head(10).

All accesses in the last 24 hours?

yesterday = datetime.datetime.now() - pd.to_timedelta("1day")

last_24_h = data[data.time > yesterday][['filename', 'ip', 'filetype', 'size']].groupby('filename').agg({'ip': 'count', 'filetype': 'first', 'size': 'sum'})
last_24_h = last_24_h[last_24_h.ip > 1].sort_values('ip', ascending=False)

Also not that hard. Group is very powerful! Use 'ip': count_distinct to retrieve visitors instead of hits.

Generating the Report with Jinja2

Hurray, we can get useful data out of our logs. Now we just need to nicely format it and send it per email.

Most important thing here is to create a custom filter which allows you to display nice UTF-8 emojis instead of the category.

CATEGORY_TO_EMOJI = {
    "IMG": "🖼️",
    "FONT": "",
    "WASM": "🖥️",
    "PAGE": "📃",
    "BLOG": "📝",
    "SUSPECT": "❓",
    "ARCHIVE": "📁",
    "BOT": "🤖",
    "SCRIPT": "",
    "STYLE": "",
    "FEED": "",
    "FAVICON": "",
    "SITEMAP": "🗺️",
    "ROOT": "🌱",
}

def emojify(cat):
    return CATEGORY_TO_EMOJI[cat] if cat in CATEGORY_TO_EMOJI else cat

# ... later add it to the environment:
env.filters['emojify'] = emojify

Definitely the most important thing. Second most important is actually creating the jinja template:

<html>
    <head>
        <!-- Let jinja copy-paste some css file here -->
        <style>
            {% include 'm-grid.css' %}
        </style>
    </head>
    <body>
        <div class="container">
            <div class="m-row">
                <div class="m-col-m-6">
                    <h2><small>{{now}}</small></br>Website Analytics Report</h2>
                </div>
            </div>
            <div class="m-row">
                <div class="m-col-m-4">
                    <h2>Last 24 hours</h2>
                    <table>
                        <tr>
                            <th>Filetype</th>
                            <th>Filename</th>
                            <th>Hits</th>
                        </tr>
                        {%- for filename, ip, filetype in last_24_h.itertuples() %}
                        <tr> <td>{{filetype|emojify}}</td> <td>{{filename}}</td> <td>{{ip}}</td> </tr>
                        {%- endfor %}
                    </table>
                </div>
                <div class="m-col-m-2">
                    <h2>Top Blog Posts</h2>
                    <table>
                        <tr>
                            <th>Post</th>
                            <th>Hits</th>
                        </tr>
                        {%- for post, hits in top_blog_posts.iteritems() %}
                        <tr> <td>{{post}}</td> <td>{{hits}}</td> </tr>
                        {%- endfor %}
                    </table>
                </div>
                <!-- ... -->
            </div>
        </div>
    </body>
</html>

Which can be read and rendered with simplest jinja2 API:

from jinja2 import Environment, select_autoescape
env = Environment(
    loader=jinja2.FileSystemLoader('.'),
    autoescape=select_autoescape(['html'])
)
template = env.get_template("template.html")

data = template.render(
    now=datetime.datetime.now(),
    last_24_h=last_24_h[['ip', 'filetype']],
    top_blog_posts=top_blog_posts,
    # ... whatever you can dream of
)

Send the Mail

And off it goes!

s = smtplib.SMTP('localhost', port=...)
s.login("admin", "password")  # Yeah, these are my real credentials, of course!

msg = email.message.EmailMessage()
msg.set_charset('utf-8')
msg['Subject'] = 'Analytics for ' + page
msg['From'] = "analytics@example.com"
msg['To'] = ", ".join(["admin@the.page", "datascientist@void.online"])

msg.set_content('Please view in HTML')
msg.add_alternative(data, subtype='html')

s.send_message(msg)

s.quit()

Conclusion

What a wall of code this was today! I hope you take something out of this and maybe make use of it yourself! I believe having control over your own analytics is super fun and keeps your mind thinking about whether you are properly interpreting your data, training yourself not to fall into traps.

Written in 45 minutes, edited in 10 minutes.