Skip to content

roshni-b/Log-Parser

Repository files navigation

Log-Parser

A modular log parser that parses NASA's Apache Web logs and stores processed logs as CSV.

This code generates host sessions from raw log and assign session id to every line in the log.
Definition of session: Session is window of activity from a host. A session ends when there is at least 15 mins of inactivity.

Input (Raw) log schema:
<host name> <log name> <time> <method> <url> <response> <bytes>
Sample: piweba3y.prodigy.com - 807301196 GET /shuttle/missions/missions.html 200 8677

Output (Processed) log schema:
<host name>, <session ID>, <date DD-MM-YYYY>, <time HH:MM:SS>, <method>, <response>
Sample: piweba3y.prodigy.com, piweba3y.prodigy.com_1, 01-08-1995, 18:19:56, GET, 200

Field name Description
host When possible, the hostname making the request. Uses the IP address if the hostname was unavailable.
logname Unused, always -
time In seconds, since 1970
method HTTP method: GET, HEAD, or POST
url Requested path
response HTTP response code
bytes Number of bytes in the reply

Dataset:

NASA Apache Web Logs (http://opensource.indeedeng.io/imhotep/docs/sample-data/):

Pre-processing
Some of the host names have commas - these have been replaced by dots. This ensures that while reading the .csv file, the part of the host name that comes after the comma does not get read as log name instead (which is the next column), consequently all the other attributes in the same log entry are prevented from bring misread.

Usage:

1. Cloning the repository.
git clone https://github.com/roshni-b/Log-Parser.git
cd Log-Parser
2. Decompressing log files.
gunzip nasa_19950630.22-19950728.12.tsv.gz
gunzip nasa_19950731.22-19950831.22.tsv.gz
3. Running the script.
python LogParser.py

Releases

No releases published

Packages

No packages published

Languages