{"id":1483,"date":"2017-01-04T12:27:36","date_gmt":"2017-01-04T12:27:36","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1483"},"modified":"2017-01-04T12:27:36","modified_gmt":"2017-01-04T12:27:36","slug":"a-python-script-with-many-steps","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1483","title":{"rendered":"A python script with many steps"},"content":{"rendered":"<p>Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple.<\/p>\n<p>In the scheme below, some steps are shown that might be used.<\/p>\n<p>sc.textFile allow to read a file and process its as a RDD (resilient distributed dataset). This stands for a dataset that is distributed over\u00a0nodes and which can be recreated fast.<br \/>\n<a href=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2017\/01\/Drawing1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1488\" src=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2017\/01\/Drawing1.jpg\" alt=\"\" width=\"451\" height=\"976\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>flatMap allows to create multiple lines from one line.<\/p>\n<p>Map processes one line. From one word, two fields are created: the original word and a field with the length of a word.<\/p>\n<p>filter allows to filter the lines.<\/p>\n<p>groupByKey aggregates the lines by the first field that acts as a key.<\/p>\n<p>map then translates the aggregate into something that is human readable.<\/p>\n<p>collect displays the results.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple. In the scheme below, some steps are shown that might be used. sc.textFile allow to read a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1484,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-1483","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-warehousing"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1483","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1483"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1483\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1483"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}