{"id":1442,"date":"2016-12-16T10:12:30","date_gmt":"2016-12-16T10:12:30","guid":{"rendered":"http:\/\/62.131.51.129\/?p=1442"},"modified":"2016-12-16T10:12:30","modified_gmt":"2016-12-16T10:12:30","slug":"manipulating-avro","status":"publish","type":"post","link":"http:\/\/archief.van-maanen.com\/?p=1442","title":{"rendered":"Manipulating Avro"},"content":{"rendered":"<p>Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java. In previous posts I showed how such file could be written and read by PHP. <a href=\"http:\/\/62.131.51.129\/index.php\/2015\/08\/10\/avro-getting-it-work\/\">See a post here. <\/a><br \/>\nIn this note I show one may use a jar file to create and to read an avro file. The jar file is avro-tools-1.8.1.jar. This jar file enables us to create an avro file from a schema definition and a json file. The schema file looks like:<\/p>\n<pre>\n{\n  \"type\" : \"record\",\n  \"name\" : \"twitter_schema\",\n  \"namespace\" : \"com.miguno.avro\",\n  \"fields\" : [ {\n    \"name\" : \"username\",\n    \"type\" : \"string\",\n    \"doc\"  : \"Name of the user account on Twitter.com\"\n  }, {\n    \"name\" : \"tweet\",\n    \"type\" : \"string\",\n    \"doc\"  : \"The content of the user's Twitter message\"\n  }, {\n    \"name\" : \"timestamp\",\n    \"type\" : \"long\",\n    \"doc\"  : \"Unix epoch time in seconds\"\n  } ],\n  \"doc:\" : \"A basic schema for storing Twitter messages\"\n}\n<\/pre>\n<p>wheras the JSON data file looks like:<\/p>\n<pre>\n {\"username\":\"miguno\",\"tweet\":\"Rock: Nerf paper, scissors is fine.\",\"timestamp\":1366150681}\n{\"username\":\"BlizzardCS\",\"tweet\":\"Works as intended.  Terran is IMBA.\",\"timestamp\":1366154481}\n<\/pre>\n<p>This can then be combined in an avro files with:<\/p>\n<pre>\njava -jar \"C:\/Program Files\/Java\/avro-tools-1.8.1.jar\" fromjson --schema-file D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user.avsc D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user.json > D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user.avro\n<\/pre>\n<p>We now have an avro file. This is a binary file. This file can translated to a json file with:<\/p>\n<pre>\njava -jar \"C:\/Program Files\/Java\/avro-tools-1.8.1.jar\" tojson D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user.avro > D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user2.json\n<\/pre>\n<p>Likewise the scheme can be derived with:<\/p>\n<pre>\njava -jar \"C:\/Program Files\/Java\/avro-tools-1.8.1.jar\" getschema D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\part-m-00000.avro > D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\user2.avsc\n<\/pre>\n<p>For me, this utility is very handy to investigate the result from a sqoop command. Roughly stated, such sqoop command may import the contents of a database table to an HDFS platform. Such command may look like:<\/p>\n<pre>\nsqoop import \\\n--connect \"jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))\" \\\n--username scott --password binvegni \\\n--table fam \\\n--columns \"NUMMER, NAAM\" \\\n--m 1 \\\n--target-dir \/loudacre\/fam_avro \\\n--null-non-string '\\\\N' \\\n--as-avrodatafile\n<\/pre>\n<p>The output from such command might be an avro file that might be called part-m-00000.avro. The question is: how do I know that this file contains the correct data? I could then import the avro file to Windows and translate it with:<\/p>\n<pre>\njava -jar \"C:\/Program Files\/Java\/avro-tools-1.8.1.jar\" tojson D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\part-m-00000.avro > D:\\Users\\tmaanen\\CloudStation\\java\\avro2\\part-m-00000.json\n<\/pre>\n<p>This provides me the confirmation that the avro file is correct.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1443,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1442","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1442","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1442"}],"version-history":[{"count":0,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=\/wp\/v2\/posts\/1442\/revisions"}],"wp:attachment":[{"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1442"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1442"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/archief.van-maanen.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1442"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}