{"id":1085,"date":"2015-10-22T21:16:37","date_gmt":"2015-10-22T21:16:37","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1085"},"modified":"2015-10-22T21:16:37","modified_gmt":"2015-10-22T21:16:37","slug":"scoop-and-hive","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1085","title":{"rendered":"Sqoop and Hive"},"content":{"rendered":"<p>It is possible to use Sqoop to directly load from a RDBMS into Hive. This opens interesting possibilities. Data that are stored in a RDBMS and that need to be analysed on a cheaper platform, can be migrated via Sqoop to a Hadoop platform. Sqoop is generaly seen as a reliable medium to undertake such migration. The command is straigthforward:<\/p>\n<pre>sqoop import --connect jdbc:mysql:\/\/62.131.51.999\/tom --table persons -m 1 --username thom --password thom24257  --hive-import\n<\/pre>\n<p>I noticed from the log that the import is done in three\u00a0steps:<\/p>\n<ul>\n<li>In a first step, the data are imported from the RDMBS and the data are stored as HDFS datasets.<\/li>\n<li>The table is defined on Hive in the metadata store.<\/li>\n<li>In the third\u00a0step, the data are moved from the HDFS platform to the Hive Warehouse.<\/li>\n<\/ul>\n<p>An alternative script is provided below:<\/p>\n<pre>\nsqoop import-all-tables \\\n    -m 1 \\\n    --connect jdbc:mysql:\/\/quickstart:3306\/retail_db \\\n    --username=retail_dba \\\n    --password=cloudera \\\n    --compression-codec=snappy \\\n    --as-parquetfile \\\n    --warehouse-dir=\/user\/hive\/warehouse \\\n\t--hive-overwrite\n<\/pre>\n<p>This alternative shows how the HDFS datasets can be stored in a compressed format. Here snappy is used. The format is a parquet file. An alternative would be an avro file. That would be achieved by a &#8220;&#8211;as-avrodatafile \\&#8221; in stead of &#8220;&#8211;as-parquetfile \\&#8221;.<\/p>\n<p>Another alternative is importing from Oracle that can be done as:<\/p>\n<pre>\nsqoop import \\\n--connect \"jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))\" \\\n--table COUNTRIES \\\n--columns \"NAME, LOCAL\"  \\\n--num-mappers 1 \\\n--compression-codec=snappy \\\n--as-avrodatafile \\\n--username ANONYMOUS \\\n--password xxx99998 \\\n--warehouse-dir=\/user\/hive\/warehouse \\\n--hive-overwrite\n<\/pre>\n<p>Once the data are in the Hive metadatastore, they can be accessed via a jar-based JDBC connector or an ODBC connector. For the JDBC connector, I used Cloudera_HiveJDBC4_2.5.4.1006 that contains the necessary jars. For ODBC, I used ClouderaHiveODBC64.msi. Doing so, I could see the data in, say Excel:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/van-maanen.com\/wp-content\/uploads\/2015\/10\/Naamloos.png\" alt=\"\" width=\"337\" height=\"517\" class=\"alignnone size-full wp-image-1911\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It is possible to use Sqoop to directly load from a RDBMS into Hive. This opens interesting possibilities. Data that are stored in a RDBMS and that need to be analysed on a cheaper platform, can be migrated via Sqoop to a Hadoop platform. Sqoop is generaly seen as a reliable medium to undertake such [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1086,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-1085","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-nice-to-know"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1085","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1085"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1085\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1085"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1085"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1085"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}