{"id":1519,"date":"2017-01-30T13:29:58","date_gmt":"2017-01-30T13:29:58","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1519"},"modified":"2017-01-30T13:29:58","modified_gmt":"2017-01-30T13:29:58","slug":"getting-a-histogram-from-big-data-with-scala","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1519","title":{"rendered":"Getting a histogram from Big Data with Scala"},"content":{"rendered":"<p>Scala can be used as a tool to manipulate big data. If it is used in the spark context, we have a possibility to combine two strong tools: spark with its possibility to bypass the MapReduce bottleneck and Scala with its short learning curve.<\/p>\n<p>The idea that Scala can be closely integrated with Spark is already clear when Scala is started. It can be started with the command &#8220;spark-shell&#8221; from the terminal. After a few seconds the Spark sign is shown with the note &#8220;Using Scala version 2.10.4&#8221;. A few lines below, one sees &#8220;Spark context available as sc&#8221;. This is the statement that we will use.<\/p>\n<p>We have quite a number of files in an HDFS directory. They can be made accessible as a RDD via:<\/p>\n<pre>\nval rawblocks = sc.textFile(\"\/linkage\")\n<\/pre>\n<p>Such a RDD is a dataset that is composed of a number of lines. The number of lines can be seen with rawblocks.count() that yields: res63: Long = 5749142.<br \/>\nTo do anything with this RDD, we must 1\/ remove lines that are used as header, 2\/ split the lines and 3\/identify numbers as numbers. Let us do so.<\/p>\n<p>Removal of lines that act as a header can be done with this function:<\/p>\n<pre>\ndef isHeader(line: String): Boolean = {\n      line.contains(\"id_1\")\n      }\n\nfollowed by\n\nval noheader = rawblocks.filter(x => !isHeader(x))\n<\/pre>\n<p>Splitting the lines and identifying numbers as numbers is a bit cumbersome, but the code looks pretty straightforward:<\/p>\n<pre>\ndef toDouble(s:String) = {\n      if (\"?\".equals(s)) Double.NaN else s.toDouble\n      }\n\ncase class MatchData(id1: Int, id2: Int,\n      scores:Array[Double],matched: Boolean)\n\ndef parse(line: String) = {\n      val pieces = line.split(',')\n      val id1 = pieces(0).toInt\n      val id2 = pieces(1).toInt\n      val scores = pieces.slice(2, 11).map(toDouble)\n      val matched = pieces(11).toBoolean\n      MatchData(id1, id2, scores, matched)\n      }\n\nfollowed by \n\nval parsed = noheader.map(line => parse(line))\n<\/pre>\n<p>Then we are ready to create the histogram:<\/p>\n<pre>\nval matchCounts = parsed.map(md => md.matched).countByValue()\nval matchCountsSeq = matchCounts.toSeq\nmatchCountsSeq.foreach(println)\n<\/pre>\n<p>&nbsp;<br \/>\nFinally a less complicated script that calculates the mean of a column:<\/p>\n<pre>\nval famblocks = sc.textFile(\"\/user\/hdfs\/fam\")\ncase class MatchData(id1: Int, id2: String)\ndef parse(line: String) = {\n      val pieces = line.split(',')\n      val id1 = pieces(0).toInt\n      val id2 = pieces(1).toString\n      MatchData(id1,id2)\n      }\nval parsed = famblocks.map(line => parse(line))\nval matchCounts = parsed.map(md => md.id1).mean()\n<\/pre>\n<p>with fam an HDFS file containing lines with a number and name, comma separated. (1,tom etc).<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scala can be used as a tool to manipulate big data. If it is used in the spark context, we have a possibility to combine two strong tools: spark with its possibility to bypass the MapReduce bottleneck and Scala with its short learning curve. The idea that Scala can be closely integrated with Spark is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1520,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1519","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1519","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1519"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1519\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1519"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1519"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1519"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}