{"id":801,"date":"2015-01-20T14:26:40","date_gmt":"2015-01-20T14:26:40","guid":{"rendered":"http:\/\/www.van-maanen.com\/?p=801"},"modified":"2015-01-20T14:26:40","modified_gmt":"2015-01-20T14:26:40","slug":"pig-yet-another-approach-to-handling-big-data","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=801","title":{"rendered":"Pig: yet another approach to handling big data"},"content":{"rendered":"<p>In <a href=\"http:\/\/www.van-maanen.com\/?p=780\" title=\"Hadoop: my first java programme\">another post, <\/a>I discussed how Java can be used to analyse data in a Big Data environment. The problem then lies with Java itsself. Java is not a tool for the faint hearted; it is difficult. Moreover, one must comply with a structure where one must write two programme&#8217;s: a mapping programme and a reduce programme. These programmes communicate with a key, value pair. This structure might be too strict for the problem at hand.<br \/>\n<!--more--><\/p>\n<p>Hence, Big Data development is difficult if one uses Java as a vehicle to undertake analysing Big Data.<\/p>\n<p>Pig addresses these issues. This tool offers two advantages: it provides a relative simple language and it releaves the necessity to use the constraint of key, value pairs.<\/p>\n<p>The language is relative simple to <a href=\"http:\/\/www.van-maanen.com\/piglatin_ref2.pdf\">learn<\/a>. <\/p>\n<p>Let me show a simple programme that helped me to understand what Pig is all about. I used this dataset:<\/p>\n<pre>\n10001\t42\t07                                                                      \n10020\t42\t07                                                                      \n10031\t42\t08                                                                      \n10011\t42\t08                                                                     \n10051\t42\t09 \n<\/pre>\n<p>The programme is as follows:<\/p>\n<pre>\nA = LOAD '\/infauser\/ww-ii-data.txt' USING PigStorage('\\t') AS (voorraad:int, year:int,lokatie:int);\ndescribe A;\nX = GROUP A by lokatie;\ndescribe X;\nB = FOREACH X GENERATE group AS lokatie, COUNT(A.voorraad) AS voorraad;\nDUMP B;\n<\/pre>\n<p>The first statement reads the records from the flat file. A structure is loaded that has tuples with 3 elements: voorraad, year and lokatie. The second line (describe) verifies that structure. Its&#8217; output is A: {voorraad: int,year: int,lokatie: int}. The output is a tuple A with the three elements that was expected.<br \/>\nAs a next step, the set of tuples is grouped by lokatie. The result can be seen in the output from the subsequent describe. This shows: <\/p>\n<pre>\nX: {group: int,A: {(voorraad: int,year: int,lokatie: int)}}\n<\/pre>\n<p>We have a tuple that consists of two levels. On one level, we have group and A. On a level beneath, we have A.voorraad, A.year, and A.lokatie. This implies that a subsequent step must use group and A.voorraad etc. In the subsequent step the lower level is aggregated via a &#8220;COUNT&#8221; clause. The final step then shows the results as they are stored in structure B:<\/p>\n<pre>\n(7,2)\n(8,2)\n(9,1)\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In another post, I discussed how Java can be used to analyse data in a Big Data environment. The problem then lies with Java itsself. Java is not a tool for the faint hearted; it is difficult. Moreover, one must comply with a structure where one must write two programme&#8217;s: a mapping programme and a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":802,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-801","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=801"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/801\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=801"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}