Analysing Apache HTTP Server Logs With Hadoop - Part 1   May 06, 2015

The Apache HTTP Server seems to be declining in popularity but it still has huge market share. I was toying with MapReduce and Pig lately and thought that processing log files with Hadoop would be a cool little project to get the hang of things so I started with a MapReduce project using the Java API and while researching I found that I could do it much more easily with Apache Pig. This post describes the Java API approach and a follow up post will cover the Pig solution.

I created a Maven project in Eclipse with the following pom.xml file. This ensures I have the right libraries pulled into the project to extend the Mapper and Reducer classes.

The next step was to create the Main class. This was going to parse the command line arguments, define the regular expression to use to parse the Apache log format, determine the field indexes that I wanted to generate counts for and finally define where the output files should be stored.

The regular expression I used to parse the Apache Combined Log Format is below.

^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"

The regex string above is the escaped Java String version, you can see an explanation of the unescaped version here along with a sample log entry to show how each section matches.

I decided that it was best to keep the regular expression in the Main class as it meant I simply had to write a different regular expression for another log format and I was able to use the same CountMapper for that log format.

The CountMapper class processes each entry in the log file to generate a k/v pair for every field we're interested in counting. These are encoded in a string of integers in the job conf variable fieldsToCount. For each entry that CountMapper emits, it simply outputs the value of the field I'm interested in as a key and an IntWritable with a value of 1 as the value. The reducer then will sum these IntWritable values for each entry with the same key thus generating a count of occurrences for that key.

As stated above, the Reducer simply sums occurrences of the entries with the same key. Each CountReducer will receive all CountMapper outputs with the same key and then sum them up, emitting a single record which is the common key value and an IntWritable which represents the number of occurrences of that key.

So when this is run, each line of input from the apache log file is parsed into a number of fields according to the regular expression in the Main class. The fields we're interested in counting are passed to the CountMapper class and it emits one key/value pair for every field we're interested in counting. So if we are interested in counting four fields then we will emit four key/value pairs for every input line matching the regular expression. Several CountReducer classes will be created and each one will handle all the records with the same key and will count up the values. The value for each key/value pair is simply an integer value of 1 and so the output from the CountReducer class will be the key it is calculating totals for and a value representing the count of occurrences of that key in the input file. When we want to gather information that is more complex than just the count of key occurrences in the input file we will need to make the Map and Reduce steps more complicated and it was at this point you start to make your classes less generic. So I decided to look into alternatives and I found Pig to be more intuitive for the purposes of this task and I will discuss the Pig solution in a follow-up post very soon.

Tags for this post: