Inhouse Logfile Analytics Email

~ 5 Minute Read.

I re­cent­ly men­tioned that I wrote a script to do some in­house apache log file an­a­lyt­ics to su­per­vise traf­ic on my blog. I love python for this kind of quick script­ing stuff, so here is how I did that with pan­das, jin­ja2 and python stan­dard li­braries.

Pars­ing Apache Log­files with Pan­das

I didn’t dig in­to this much since I don’t have a lot of ex­pe­ri­ence with pan­das and was hap­py to find a so­lu­tion on Modesto Mas’ blog. A cool thing to note, though, is that pan­das will au­to­mat­i­cal­ly parse your com­pressed logs cor­rect­ly, which is su­per neat!

Read­ing a log in Apache Com­bined for­mat looks like this:

def parse_datetime(x):
    dt = datetime.datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))


def read_log(filename):
    return pd.read_csv(
        filename,
        sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
        engine='python',
        na_values='-',
        header=None,
        usecols=[0, 3, 4, 5, 6, 7, 8],
        names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
        converters={'time': parse_datetime,
                    'status': int,
                    'size': int})

I use this to parse all of the logs and then com­bine the da­ta frames us­ing

data = pd.concat([read_log(f) for f in glob.glob('path-to/my-access-log*')],
                 ignore_index=True)

Where the glob does the mag­ic to al­so get the com­pressed old log files. As men­tioned above, pan­das will not care about whether the in­put is com­pressed or not and cor­rect­ly read the log files.

Some cleanup is need­ed, like nor­mal­iz­ing the file­names, en­sur­ing ev­ery re­quest to a blog post, e.g. blog.squareys.de/inhouse-logfile-analytics-email ends with a slash, i.e. blog.squareys.de/inhouse-logfile-analytics-email/, which makes count­ing file­name val­ues eas­i­er lat­er. From the re­quest I gen­er­ate a new col­umn “file­name” rough­ly along the lines of this:

requested_files = pd.DataFrame({'filename': data.request.apply(lambda row: row.split(' ')[1])})
data = data.join(requested_files)
# Fix trailing / inconsistencies
data['filename'] = data['filename'].apply(
    lambda filename: filename if filename.endswith('/') or '.' in filename else filename + '/')
# Fix / prefix inconsistencies
data['filename'] = data['filename'].apply(
    lambda filename: filename[1:] if filename.startswith('//') else filename)

Which could be the most in­ef­fi­cient way to do this—I’m hap­py about im­prove­ment sug­ges­tions! (As men­tioned, I don’t have ex­pe­ri­ence with pan­das.)

I then go ahead and clas­si­fy the type of file: Is it a blog post or an im­age…?

# Classify file type
def classify_filetype(row):
    path, file = os.path.split(row.filename)
    file = file.split("?")[0]
    base, ext = os.path.splitext(file)
    if ext in ['.png', '.jpg', '.bmp', '.gif']:
        return "IMG"
    elif ext in ['.wasm']:
        return "WASM"
    elif ext in ['.woff', '.woff2']:
        return "FONT"
    elif ext in ['.css']:
        return "STYLE"
    elif ext in ['.js', '.min.js']:
        return "SCRIPT"
    elif file == "robots.txt":
        return "BOT"
    elif ext == ".xml" and file.startswith("sitemap"):
        return "SITEMAP"
    elif file.endswith(".atom.xml") or path.startswith("/feed"):
        return "FEED"
    elif file == "favicon.ico":
        return "FAVICON"
    elif not ext:
        if path == "/" and not file:
            return "ROOT"

        dir = path.split('/')[1]
        if dir in PAGES:
            return "PAGE"
        else:
            if dir in ["tag", "category", "author"]:
                return "ARCHIVE"  # listing like category, author, tag or root
            return "BLOG"
    elif ext in ['.html', '.php']:
        return "PAGE"
    return "SUSPECT"

def clean_referer(x):
    # cut away surrounding "" and shorten
    return x[1:40] + "..." if len(x) > 43 else x[1:-1]

data.referer = data.referer.apply(clean_referer)
new_cols = pd.DataFrame({
    'filetype': pd.Categorical(data.apply(classify_filetype, axis=1)),
})
data = data.join(new_cols)

Again, su­per rough and prob­a­bly in­com­plete, but does the job! You’ll al­so no­tice I do some cleanup there: Some peo­ple just ham­mer you web­site with su­per long refer­al URLs which are a pain to view in an an­a­lyt­ics re­port, so I cut them to max 43 char­ac­ters. I don’t cur­rent­ly use the ref­er­er in­for­ma­tion any­way ex­cept group­ing and count­ing it.

It’s a mea­sure to rec­og­nize spikes more than any­thing else… so that I can thank peo­ple who re­fer to my blog? Who knows what this could be use­ful for some day :)

An­a­lyz­ing with Pan­das

# Only analyze successful page requests
data = data[data.status == 200]

Now we have a big ta­ble of un­read­i­ble da­ta. What to do next? How about get­ting the top blog posts:

top_blog_posts = data[data.filetype == ["BLOG"]].filename.value_counts()[lambda x: x > 10]

Well, that’s easy! Just take all posts with BLOG file­type and count the file­name val­ues! (That gives us hits, not unique vis­i­tors, be­ware of that.) The fi­nal fil­ter en­sures we on­ly keep blog posts that have been viewed more than 10 times. To show the top ten posts, go for .head(10).

All ac­cess­es in the last 24 hours?

yesterday = datetime.datetime.now() - pd.to_timedelta("1day")

last_24_h = data[data.time > yesterday][['filename', 'ip', 'filetype', 'size']].groupby('filename').agg({'ip': 'count', 'filetype': 'first', 'size': 'sum'})
last_24_h = last_24_h[last_24_h.ip > 1].sort_values('ip', ascending=False)

Al­so not that hard. Group is very pow­er­ful! Use 'ip': count_distinct to re­trieve vis­i­tors in­stead of hits.

Gen­er­at­ing the Re­port with Jin­ja2

Hur­ray, we can get use­ful da­ta out of our logs. Now we just need to nice­ly for­mat it and send it per email.

Most im­por­tant thing here is to cre­ate a cus­tom fil­ter which al­lows you to dis­play nice UTF-8 emo­jis in­stead of the cat­e­go­ry.

CATEGORY_TO_EMOJI = {
    "IMG": "🖼️",
    "FONT": "",
    "WASM": "🖥️",
    "PAGE": "📃",
    "BLOG": "📝",
    "SUSPECT": "❓",
    "ARCHIVE": "📁",
    "BOT": "🤖",
    "SCRIPT": "",
    "STYLE": "",
    "FEED": "",
    "FAVICON": "",
    "SITEMAP": "🗺️",
    "ROOT": "🌱",
}

def emojify(cat):
    return CATEGORY_TO_EMOJI[cat] if cat in CATEGORY_TO_EMOJI else cat

# ... later add it to the environment:
env.filters['emojify'] = emojify

Def­i­nite­ly the most im­por­tant thing. Sec­ond most im­por­tant is ac­tu­al­ly cre­at­ing the jin­ja tem­plate:

<html>
    <head>
        <!-- Let jinja copy-paste some css file here -->
        <style>
            {% include 'm-grid.css' %}
        </style>
    </head>
    <body>
        <div class="container">
            <div class="m-row">
                <div class="m-col-m-6">
                    <h2><small>{{now}}</small></br>Website Analytics Report</h2>
                </div>
            </div>
            <div class="m-row">
                <div class="m-col-m-4">
                    <h2>Last 24 hours</h2>
                    <table>
                        <tr>
                            <th>Filetype</th>
                            <th>Filename</th>
                            <th>Hits</th>
                        </tr>
                        {%- for filename, ip, filetype in last_24_h.itertuples() %}
                        <tr> <td>{{filetype|emojify}}</td> <td>{{filename}}</td> <td>{{ip}}</td> </tr>
                        {%- endfor %}
                    </table>
                </div>
                <div class="m-col-m-2">
                    <h2>Top Blog Posts</h2>
                    <table>
                        <tr>
                            <th>Post</th>
                            <th>Hits</th>
                        </tr>
                        {%- for post, hits in top_blog_posts.iteritems() %}
                        <tr> <td>{{post}}</td> <td>{{hits}}</td> </tr>
                        {%- endfor %}
                    </table>
                </div>
                <!-- ... -->
            </div>
        </div>
    </body>
</html>

Which can be read and ren­dered with sim­plest jin­ja2 API:

from jinja2 import Environment, select_autoescape
env = Environment(
    loader=jinja2.FileSystemLoader('.'),
    autoescape=select_autoescape(['html'])
)
template = env.get_template("template.html")

data = template.render(
    now=datetime.datetime.now(),
    last_24_h=last_24_h[['ip', 'filetype']],
    top_blog_posts=top_blog_posts,
    # ... whatever you can dream of
)

Send the Mail

And off it goes!

s = smtplib.SMTP('localhost', port=...)
s.login("admin", "password")  # Yeah, these are my real credentials, of course!

msg = email.message.EmailMessage()
msg.set_charset('utf-8')
msg['Subject'] = 'Analytics for ' + page
msg['From'] = "analytics@example.com"
msg['To'] = ", ".join(["admin@the.page", "datascientist@void.online"])

msg.set_content('Please view in HTML')
msg.add_alternative(data, subtype='html')

s.send_message(msg)

s.quit()

Con­clu­sion

What a wall of code this was to­day! I hope you take some­thing out of this and maybe make use of it your­self! I be­lieve hav­ing con­trol over your own an­a­lyt­ics is su­per fun and keeps your mind think­ing about whether you are prop­er­ly in­ter­pret­ing your da­ta, train­ing your­self not to fall in­to traps.

Writ­ten in 45 min­utes, ed­it­ed in 10 min­utes.