{"id":855,"date":"2015-06-15T18:43:50","date_gmt":"2015-06-15T18:43:50","guid":{"rendered":"http:\/\/www.van-maanen.com\/?p=855"},"modified":"2015-06-15T18:43:50","modified_gmt":"2015-06-15T18:43:50","slug":"map-and-reduce-what-happens","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=855","title":{"rendered":"Map and reduce &#8211; what happens?"},"content":{"rendered":"<p>In Big Data, the concept of mapping and reducing plays a huge role. The idea is that a a massive dataset is split over several servers. On each server, a part of the data is investigated. This part is called a mapper. In a subsequent part, these parts are merged into an outcome. This latter part is called the reduce part. The communication between these two parts go along key-value pairs.<br \/>\nIn a well-known example (MaxTemperature), this mechanism is demonstrated in a Java programme. This programme consists of 3 classes: a supervisory programme, that is shown below.<\/p>\n<pre>\/\/ cc MaxTemperature Application to find the maximum temperature in the weather dataset\n\/\/ vv MaxTemperature\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapreduce.Job;\nimport org.apache.hadoop.mapreduce.lib.input.FileInputFormat;\nimport org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;\npublic class MaxTemperature {\npublic static void main(String[] args) throws Exception {\nif (args.length != 2) {\nSystem.out.println(\"invoer is \" + args[0]);\nSystem.out.println(\"uitvoer is \" + args[1]);\nSystem.err.println(\"Usage: MaxTemperature [input path] [output path]\");\nSystem.exit(-1);\n}\n@SuppressWarnings(\"deprecation\")\nJob job = new Job();\njob.setJarByClass(MaxTemperature.class);\njob.setJobName(\"Max temperature\");\nFileInputFormat.addInputPath(job, new Path(args[0]));\nSystem.out.println(\"invoer is \" + args[0]);\nFileOutputFormat.setOutputPath(job, new Path(args[1]));\nSystem.out.println(\"uitvoer is \" + args[1]);\njob.setMapperClass(MaxTemperatureMapper.class);\njob.setReducerClass(MaxTemperatureReducer.class);\njob.setOutputKeyClass(Text.class);\njob.setOutputValueClass(IntWritable.class);\nSystem.exit(job.waitForCompletion(true) ? 0 : 1);\n}\n}\n\/\/ ^^ MaxTemperature\n\n<\/pre>\n<p>This programme calls two other classes. The call is done via job.setMapperClass, which is coded below:<\/p>\n<pre>\n\/\/ cc MaxTemperatureMapper Mapper for maximum temperature example\n\/\/ vv MaxTemperatureMapper\nimport java.io.IOException;\nimport java.io.BufferedWriter;\nimport java.io.File;\nimport java.io.FileWriter;\nimport java.util.Date;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.LongWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapreduce.Mapper;\npublic class MaxTemperatureMapper\nextends Mapper<LongWritable, Text, Text, IntWritable> {\nprivate static final int MISSING = 9999;\n@Override\npublic void map(LongWritable key, Text value, Context context)\nthrows IOException, InterruptedException {\nString line = value.toString();\nString year = line.substring(15, 19);\nFile file = new File(\"\/home\/hduser\/example-mapper.txt\");\nif (!file.exists()) {\n\tfile.createNewFile();\n};\nFileWriter fw = new FileWriter(file.getAbsoluteFile(),true);\nBufferedWriter output = new BufferedWriter(fw);\nDate date = new Date();\nint airTemperature;\nif (line.charAt(87) == '+') { \/\/ parseInt doesn't like leading plus signs\nairTemperature = Integer.parseInt(line.substring(88, 92));\n} else {\nairTemperature = Integer.parseInt(line.substring(87, 92));\n}\noutput.append(\"mappert is jaar \" + date.toString() +\">\"+ year + \" temp  \" + airTemperature + \"\\n\");\noutput.close();\nString quality = line.substring(92, 93);\nif (airTemperature != MISSING && quality.matches(\"[01459]\")) {\ncontext.write(new Text(year), new IntWritable(airTemperature));\n}\n}\n}\n\/\/ ^^ MaxTemperatureMapper\n<\/pre>\n<p>In this class, the input is read as a key value pair. On its turn the output is written as a new key value pair. This key value pair consists of a year and a temperature measurement. To know exactly what values are communicated, the key-value pairs are written to a file. The file (&#8220;\/home\/hduser\/example-mapper.txt&#8221;) contains these lines:<\/p>\n<pre>\nmappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12341\nmappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12342\nmappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12343\nmappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12345\n<\/pre>\n<p>The value pairs that are communicated are 1975 &#8211; 12341, 1975 12342 etc, . The resulting key value pair are processed in the subsequent reducer part that has this code:<\/p>\n<pre>\n\n\/\/ cc MaxTemperatureReducer Reducer for maximum temperature example\n\/\/ vv MaxTemperatureReducer\nimport java.io.IOException;\nimport java.io.BufferedWriter;\nimport java.io.File;\nimport java.io.FileWriter;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapreduce.Reducer;\npublic class MaxTemperatureReducer\nextends Reducer<Text, IntWritable, Text, IntWritable> {\n@Override\npublic void reduce(Text key, Iterable<IntWritable> values,\nContext context)\nthrows IOException, InterruptedException {\nint maxValue = Integer.MIN_VALUE;\nFile file = new File(\"\/home\/hduser\/example-reducer.txt\");\nBufferedWriter output = new BufferedWriter(new FileWriter(file));\nfor (IntWritable value : values) {\nint waarde = value.get();\nmaxValue = Math.max(maxValue, waarde);\noutput.write(\"mappert is gelezen waarde \" + waarde + \" max  \" + maxValue + \"\\n\");\n}\ncontext.write(key, new IntWritable(maxValue));\noutput.close();\n};\n\n}\n\/\/ ^^ MaxTemperatureReducer\n<\/pre>\n<p>Also, in this part a file is written that contains the values as they are processed. The values are 12345, 12343 etc<\/p>\n<pre>\nmappert is gelezen waarde 12345 max  12345\nmappert is gelezen waarde 12343 max  12345\nmappert is gelezen waarde 12342 max  12345\nmappert is gelezen waarde 12341 max  12345\n<\/pre>\n<p>From these values the maximum is calculated.<br \/>\nThe final result(key and maximum) can finally be read in the hdfs file with:<br \/>\n\/usr\/local\/hadoop\/bin\/hadoop dfs -cat \/user\/output51\/part-r-00000. This shows: 1975\t12345, which is the final outcome of this exercise.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Big Data, the concept of mapping and reducing plays a huge role. The idea is that a a massive dataset is split over several servers. On each server, a part of the data is investigated. This part is called a mapper. In a subsequent part, these parts are merged into an outcome. This latter [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":856,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-855","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-warehousing"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/855","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=855"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/855\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=855"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=855"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=855"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}