{"id":1382,"date":"2016-11-28T22:06:38","date_gmt":"2016-11-28T22:06:38","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1382"},"modified":"2016-11-28T22:06:38","modified_gmt":"2016-11-28T22:06:38","slug":"with-python-in-hive","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1382","title":{"rendered":"With Python in Hive"},"content":{"rendered":"<p>In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on top of the HDFS datasets. One may then access the HDFS datasets, much like an ordinary database.<br \/>\nWe will use the python language via spark. This avoids the bottleneck that MapReduce has created.<br \/>\nOne starts python via spark with the command &#8220;pyspark&#8221;. If everything goes correct, we see:<br \/>\n<a href=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2016\/11\/Untitled-14.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1388\" src=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2016\/11\/Untitled-14.png\" alt=\"untitled\" width=\"728\" height=\"143\" \/><\/a><br \/>\nTwo variables are important: sc that is an anchor point for methods that can be used within Spark and HiveContext that be used as a starting point for Hive methods.<\/p>\n<p>We first import the relevant libraries and create the context:<\/p>\n<pre>\nfrom pyspark.sql import HiveContext\nsqlContext = HiveContext(sc)\n<\/pre>\n<p>Then the table is defined:<\/p>\n<pre>\nsqlContext.sql(\"CREATE TABLE IF NOT EXISTS HiveTom (key STRING, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'\")\n<\/pre>\n<p>In the last step, an existing HDFS file is connected to that table definition:<\/p>\n<pre>\nsqlContext.sql(\"LOAD DATA INPATH 'hdfs:\/Chapter5\/uit2' INTO TABLE HiveTom\")\n<\/pre>\n<p>We may now approach this dataset as a table. The tablename is HiveTom. A possibility is to access the table via ODBC. We can download an ODBC connector. Each distribution (Cloudera, MapR, Hortonworks) has a ODBC connector. Once installed, we may retrieve the data in a ODBC compliant tool. As example, we may undertake this in Ecel:<br \/>\n<a href=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2016\/11\/Untitled-15.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/62.131.51.129\/wp-content\/uploads\/2016\/11\/Untitled-15.png\" alt=\"untitled\" width=\"445\" height=\"302\" class=\"alignnone size-full wp-image-1393\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1383,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-1382","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-nice-to-know"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1382","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1382"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1382\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1382"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1382"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1382"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}