{"id":1539,"date":"2017-02-03T10:03:59","date_gmt":"2017-02-03T10:03:59","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1539"},"modified":"2017-02-03T10:03:59","modified_gmt":"2017-02-03T10:03:59","slug":"scala-and-rdds","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1539","title":{"rendered":"Scala and RDDs"},"content":{"rendered":"<p>RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a filter, where some rows are left out and a map where the rows are manipulated. Below, I collected two sets of examples of scala statements on such RDD. This might then be used as a cookbook for future use.<\/p>\n<p>The mapping functions:<\/p>\n<pre>\nval mydata_uc = mydata.map(line => line.toUpperCase())\nval myrdd2 = myrdd1.map(pair => JSON.parseFull(pair._2).get.asInstanceOf[Map[String,String]])\nval myrdd2 = mydata.map(line => line.split(' '))\nval myrdd2 = mydata.flatMap(line => line.split(' '))\nvar pipo = logs.map(line => line.length)\nvar pipo = log.map(line => (line.split(' ')(0),line.split(' ')(2)))\nval users = sc.textFile(\"\/user\/hdfs\/keyvalue\").map(line => line.split(',')).map(fields => (fields(0),(fields(0),fields(1)))) \nval users = sc.textFile(\"\/user\/hdfs\/keyvalue\").keyBy(line => line.split(',')(0))\nval counts = sc.textFile(\"\/user\/hdfs\/keyvalue\").flatMap(line => line.split(',')).map(fields => (fields,1))\nval counts = sc.textFile(\"\/user\/hdfs\/keyvalue\").flatMap(line => line.split(',')).map(fields => (fields,1)).reduceByKey((v1,v2) => v1+v2)\nval avglens = sc.textFile(\"\/user\/hdfs\/naamtoev\").flatMap(line => line.split(\" \")).map(word => (word,word.length)).groupByKey().map(pair => (pair._1, pair._2.sum\/pair._2.size.toDouble)) \n<\/pre>\n<p>The filter function:<\/p>\n<pre>\nval mydata_uc = mydata.filter(line => line.startsWith(\"I\"))\nvar jpglogs = logs.filter(line => line.contains(\".jpg\"))\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1540,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-1539","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-warehousing"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1539","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1539"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1539\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1539"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1539"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1539"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}