Apache Sqoop: Import from RDBMS to HDFS in ORC Format

Apache Sqoop import tool offers the capability to import data from RDBMS (MySQL, Oracle, SQLServer, etc.) table to HDFS. Sqoop import provides native support to store data in text files as well as binary formats such as Avro and Parquet. There’s no native support to import in ORC format. However, it’s still possible to import in ORC format by using Sqoop and Hive in tandem.
Firstly, let’s see how Sqoop import has native support for Text format and binary formats such as Avro and Parquet. After that, we will see how we can leverage Hive with Sqoop to import as an ORC format.
Import as Text Files
By default, the import tool stores the data as text files if no format is specified. We can also explicitly instruct the import tool to store it as a text file using this argument –as-textfile
In the below example, we are importing data from a MySQL table called departments into HDFS location /user/cloudera/import_example/text/. We are using –as-textfile argument to instruct Sqoop to store the data in Text format in HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--warehouse-dir "/user/cloudera/import_example/text/" \
--as-textfile
–connect, –username, –password contain the connection details of the RDBMS.
–table is the RDBMS table from which the data will be imported.
–warehouse-dir is the parent output location in HDFS. In this location, Sqoop will create a directory with the name of the RDBMS table to contain all the data files.
–num-mappers specifies the number of map tasks to run in parallel. Each map task creates an output file. To make the example simpler, we are setting the number of map tasks to 1 so that only one output file is created.
After running the above Sqoop command, we can verify that the data was stored as text indeed. Listing the files in the output directory shows that the files have no file extension.
$ hadoop fs -ls /user/cloudera/import_example/text/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 03:53 /user/cloudera/import_example/text/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 60 2019-05-31 03:53 /user/cloudera/import_example/text/departments/part-m-00000
You can observe that the contents of /user/cloudera/import_example/text/departments/part-m-00000 are in text format.
$ hadoop fs -cat /user/cloudera/import_example/text/departments/part-m-00000 | head
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
Import as Binary Files – Avro/Parquet
Similar to the –as-textfile parameter, the import tool has parameters to import the data in binary formats – Avro and Parquet.
- Avro: –as-avrodatafile
- Parquet: –as-parquetfile
Let’s use the following command to import the data in Avro format. Note that, we are passing the argument –as-avrodatafile.
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--warehouse-dir "/user/cloudera/import_example/avro/" \
--as-avrodatafile
We can see that the files have .avro extension. We can peek into the files to confirm that the files are in Avro format indeed.
$ hadoop fs -ls /user/cloudera/import_example/avro/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 450 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/part-m-00000.avro
$ hadoop fs -cat /user/cloudera/import_example/avro/departments/part-m-00000.avro | head
Objavro.schema�{“type”:”record”,”name”:”departments”,”doc”:”Sqoop import of departments”,”fields”:[{“name”:”department_id”,”type”:[“null”,”int”],”default”:null,”columnName”:”department_id”,”sqlType”:”4″},{“name”:”department_name”,”type”:[“null”,”string”],”default”:null,”columnName”:”department_name”,”sqlType”:”12″}],”tableName”:”departments”}�M�ܣ����^�`�
�FitnessFootweaApparel
Golf
Import as ORC Files
There is no equivalent of –as-avrodatafile and –as-parquetfile to save as an ORC file. We can leverage Sqoop’s Hcatalog integration feature to import as an ORC file. You can think of Hcatalog as an API to access Hive metastore. This method involves two steps.
- Create a Hive Database with your desired HDFS warehouse location
- Run Sqoop import command to import from the RDBMS table to the Hcatalog table
Create Hive Database
A Hive database should exist in order to use Sqoop with Hcatalog. We can create a Hive database by providing the location to store the data of managed tables. If we don’t specify a location, the default location will be used.
The below command is run on Hive CLI to create a Hive database called orc_db with hdfs:///user/cloudera/import_example/orc/ as the location. This location is equivalent to the –warehouse-dir parameter of the Sqoop import.
hive> create database orc_db location "hdfs:///user/cloudera/import_example/orc/";
hive> create database orc_db location “hdfs:///user/cloudera/import_example/orc/”;
OK
Time taken: 0.431 seconds
Sqoop import with Hcatalog Integration
We can use Sqoop import to import from an RDBMS table to an Hcatalog/Hive Table. Sqoop import’s Hcatalog integration provides a feature to automatically create Hcatalog/Hive table during the import.
$ sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--hcatalog-database orc_db \
--hcatalog-table departments \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile"
–hcatalog-database refers to the Hive database that we need to use.
–hcatalog-table refers to the Hive table. In this example, we are calling it departments. A directory named departments will be created in the location we specified with the create database command.
–create-hcatalog-table is used as an indicator that the Hive table has to be created as part of the import job. The job will fail if the table already exists.
–hcatalog-storage-stanza is used to specify the storage format of the newly created table. In this example, we are storing it as an ORC file.
After we run the Sqoop command, we can see that a directory called departments was created under the database location – hdfs:///user/cloudera/import_example/orc
$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc
Found 1 items
drwxr-xr-x – cloudera cloudera 0 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments
When we list the contents of the departments directory, we can see there was one file created. Though it doesn’t have an extension, be assured that it’s an ORC file.
$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc/departments
Found 1 items
-rw-r–r– 1 cloudera cloudera 377 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments/part-m-00000
We can verify that it’s an ORC file by peeking at the top few lines of the file. Observe that the first line of the file reads “ORC”.
$ hadoop fs -cat hdfs:///user/cloudera/import_example/orc/departments/part-m-00000 | head
ORC
P+
6PJ�R�be!i6%Q.vǂ�Ģ�!�Ғ���b�� �UFitnessFootwearApparelGolfOutdoorsFan Shop
Fxt�V�b�`���ь
�IBH3��`�IBH3i!&
TL;DR
Sqoop import has native support to import in Avro and Parquet formats but not for ORC format. We can leverage Sqoop import’s Hcatalogue integration to import the data from RDBMS to HDFS in ORC format.
First, we create a Hive database and specify the HDFS location of the database. Then, we import data from the RDBMS table to the Hive table to be stored in ORC format. The ORC data would be stored under a directory with the name of the Hive table in the HDFS location of the Hive database.
37 Comments
I have been browsing online more than 2 hours today, yet I
never found any interesting article like yours. It’s
pretty worth enough for me. In my view, if all webmasters and
bloggers made good content as you did, the net
will be a lot more useful than ever before.
I’ve read some good stuff here. Definitely price bookmarking for revisiting.
I wonder how much effort you set to make any such excellent informative site.
Wonderful article! We are linking to this great post on our site.
Keep up the good writing.
Hello! This is my 1st comment here so I just wanted to give
a quick shout out and tell you I genuinely enjoy reading
through your articles. Can you suggest any other blogs/websites/forums that cover the same topics?
Thank you so much!
Spot on with this write-up, I absolutely believe this web site needs
a great deal more attention. I’ll probably be returning to read more, thanks for the advice!
Hi to all, the contents existing at this site are actually remarkable for people experience,
well, keep up the good work fellows.
Please let me know if you’re looking for a author for
your site. You have some really great articles and I believe I
would be a good asset. If you ever want to take
some of the load off, I’d love to write some
material for your blog in exchange for a link back to mine.
Please shoot me an e-mail if interested. Cheers!
If some one wishes expert view regarding blogging and site-building after that i recommend
him/her to pay a quick visit this website, Keep up the
good job.
Wow! After all I got a blog from where I be able to actually get useful data concerning
my study and knowledge.
This post is genuinely a good one it assists new net viewers, who are
wishing in favor of blogging.
エルメス靴スーパーコピー専門通販店
Shot Blasting
Polishing Diameter 200mm 1A1 CBN Grinding Wheel
What Are the Methods of UV Coating
韓国ブランドコピー時計
Filling Line
Spoons Factory
ブランドコピー販売代引き
Statement of plant growth principle of LED plant growth lamp
https://beauviva.com/ viagra https://www.aiporn.net/ https://deepnude.cc/ https://aiporn.bid/ canadin pharmacy https://pornmake.ai/ https://deepnude.cc/ https://altavillaspa.com/ https://pornjourney.ai/ online dapoxetine buy drugs cheap https://deepnude.cc/ pharmacy https://pornworks.ai/ https://pornworks.ai/ https://gaiaenergysystems.com/ https://gaiaenergysystems.com/ free
delivery https://comicshopservices.com/ https://comicshopservices.com/ https://medcostbuy.co.uk/ https://gaiaenergysystems.com/
High Tensile Nut Bolt
ルイヴィトンストールコピーブランドスーパーコピー通販
Adjustable O-ring Knife Holder
EM-370 HDI PCB
iphonexケースブランドコピー
Incontinence Briefs With Tabs
ブランドコピーブランド財布バッグ時計スーパーコピー市場
Heavy Duty Transmission Oil
Air Auto Shock Absorber