{"id":1558,"date":"2017-02-06T15:34:55","date_gmt":"2017-02-06T15:34:55","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1558"},"modified":"2017-02-06T15:34:55","modified_gmt":"2017-02-06T15:34:55","slug":"scala-merging-files","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1558","title":{"rendered":"Scala merging files"},"content":{"rendered":"<p>In a previous post, I showed how two files can be merged in Scala. The idea was that RDDs were translated as data frames and a join was undertaken on these.<br \/>\nIn this post, the philosophy is slightly different. Now the RDD is rewritten as a key-value pair with a unique key. This then allows a merge on this unique key.<br \/>\nLet us first see how a RDD can be created with a unique key:<\/p>\n<pre>\nval counts = sc.textFile(\"\/user\/hdfs\/keyvalue\").flatMap(line => line.split(',')).map(fields => (fields,1)).reduceByKey((v1,v2) => v1+v2)\n<\/pre>\n<p>A file is read (&#8220;keyvalue&#8221;) that is subsequently split along their comma. Each word is then rewritten as an own record. If the original file contains 6 words, we end up having 6 records. We then create a new RDD with a &#8220;1&#8221; added to each record. Subsequently the word is seen as a key. The &#8220;1&#8221; is then aggregated over the records. This result could be used as a wordcount example.<br \/>\nI created a similar RDD (&#8220;counts1&#8221;) that also had the words as a key. <\/p>\n<pre>\nval counts1 = sc.textFile(\"\/user\/hdfs\/keyvalue\").flatMap(line => line.split(',')).map(fields => (fields,2)).reduceByKey((v1,v2) => v1+v2)\n<\/pre>\n<p>The join can then be undertaken as:<\/p>\n<pre>\nval pipo = counts1.join(counts)\n<\/pre>\n<p>The outcomes can be shown as pipo.foreach(println). <\/p>\n<p>And a similar scripts runs as<\/p>\n<pre>\nval kbreqs = sc.textFile(\"\/user\/hdfs\/keyuser\").filter(line => line.contains(\"KBDOC\")).keyBy(line => (line.split(' ')(5)))\nval kblist = sc.textFile(\"\/user\/hdfs\/keydoc\").keyBy(line => (line.split(':')(0)))\nval titlereqs = kbreqs.join(kblist)\n<\/pre>\n<p>A final script:<\/p>\n<pre>\nval logs=sc.textFile(\"\/user\/hdfs\/naamtoev\")\nval userreqs = logs.map(line => line.split(' ')).map(words => (words(0),1)).reduceByKey((v1,v2) => v1 + v2)\nval accountsdata=\"\/user\/hdfs\/naamstraat\"\nval accounts = sc.textFile(accountsdata).keyBy(line =>  line.split(',')(0))  \nval accounthits = accounts.join(userreqs)\nfor (pair <- accounthits) {printf(\"%s, %s, %s, %s\\n\",pair._1,pair._2._1,\" score= \",pair._2._2)}\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In a previous post, I showed how two files can be merged in Scala. The idea was that RDDs were translated as data frames and a join was undertaken on these. In this post, the philosophy is slightly different. Now the RDD is rewritten as a key-value pair with a unique key. This then allows [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1559,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1558","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1558","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1558"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1558\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1558"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1558"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1558"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}