HDFS Copy Files and Directories: hadoop fs -cp

Copying files and directories is a fundamental aspect of maintaining data within the Hadoop Distributed File System (HDFS). The hadoop fs -cp command empowers you to duplicate data seamlessly, allowing you to manage, distribute, and back up your information effectively.

In this blog post, we will explore the hadoop fs -cp command’s usage and its most commonly used flags with examples to help you become proficient in copying files and directories in HDFS.

Command Syntax:

hadoop fs -cp [options] <source_path> <destination_path>

source_path: The path to the file or directory to be copied.
destination_path: The path to the new location of the file or directory.
options: The following options are available:
- -f: Force the copy, even if the destination file or directory already exists.
  - Example: hadoop fs -cp -f /user/hadoop/myfile.txt /user/hadoop/mydir
- -r: Recursively copy the file or directory, including all its subdirectories and files.
  - Example: hadoop fs -cp -r /user/hadoop/mydir /user/hadoop/newdir
- -p: Preserve the replication factor, modification time, ownership, and permissions.
  - Example: hadoop fs -cp -p /user/hadoop/mydir /user/hadoop/newdir
- -t: Rename the file or directory instead of copying it.
  - Example: hadoop fs -cp -t /user/hadoop/myfile.txt newfile.txt
- -ignore: Ignore a pattern of files during the copy operation.
  - Example: hadoop fs -cp -ignore "*.log" /user/hadoop/source/ /user/hadoop/destination/
- -delete: Delete the source file or directory after it has been copied.
  - Example: hadoop fs -cp -delete /user/hadoop/myfile.txt /user/hadoop/mydir

Basic Copy Operation

This basic usage copies a file or directory from the source path to the destination path within HDFS.

hadoop fs -cp <source_path> <destination_path>

Example:

In this example, the file file.txt is copied from the /user/hadoop/ directory to the copy directory within the same location.

hadoop fs -cp /user/hadoop/file.txt /user/hadoop/copy/

Overwrite Existing Destination (-f)

The -f flag forces the command to overwrite the destination if it already exists.

hadoop fs -cp -f <source_path> <destination_path>

Example:

With the -f flag, the command overwrites the existing backup directory with the file overwrite.txt.

hadoop fs -cp -f /user/hadoop/overwrite.txt /user/hadoop/backup/

Copy Directories Recursively (-r)

The -r flag copies directories and their contents recursively, making it ideal for duplicating complex directory structures.

hadoop fs -cp -r <source_directory> <destination_directory>

Example:

The -r flag facilitates the recursive copy of the source_dir directory and its contents to the destination directory.

hadoop fs -cp -r /user/hadoop/source_dir /user/hadoop/destination/

Preserve Attributes (-p)

The -p flag ensures that the replication factor, modification time, ownership, and permissions of the source file or directory are preserved in the copied data.

hadoop fs -cp -p <source_path> <destination_path>

Example:

Using the -p flag, the command copies the entire documents directory to the backup directory while preserving its attributes.

hadoop fs -cp -p /user/hadoop/documents /user/hadoop/backup/

Copy and Rename Files (-t)

Employing the -t flag facilitates copying files while renaming them during the operation.

hadoop fs -cp -t <source_path> <destination_path>

Example:

In this example, the file file.txt is copied from the /user/hadoop/source/ directory to the /user/hadoop/target/ directory. Additionally, during the copy operation, the file is renamed to new_file.txt in the destination directory.

hadoop fs -cp -t /user/hadoop/source/file.txt /user/hadoop/target/new_file.txt

Ignore Files Matching Pattern (-ignore)

The -ignore flag in the hadoop fs -cp command is used to specify a pattern of files that should be ignored during the copy operation. This can be useful when you want to exclude certain files from being copied based on a specific pattern. Here’s how you can use the -ignore flag, along with an example:

hadoop fs -cp -ignore <pattern> <source_path> <destination_path>

Example:

In this example, the -ignore "*.log" flag is used with the hadoop fs -cp command. The command will copy all files and directories from /user/hadoop/source/ to /user/hadoop/destination/, but any files with the “.log” extension will be ignored and not copied to the destination. This can be helpful, for instance, if you want to copy all files except log files during a data migration process.

hadoop fs -cp -ignore "*.log" /user/hadoop/source/ /user/hadoop/destination/

Delete Source After Copy (-delete)

With the -delete flag, you can automatically delete the source after a successful copy operation.

hadoop fs -cp -delete <source_path> <destination_path>

Example:

In this example, the file file.txt is copied from the /user/hadoop/source/ directory to the /user/hadoop/destination/ directory. After the copy operation is successfully completed, the source file file.txt is automatically deleted from the source directory /user/hadoop/source/.

hadoop fs -cp -delete /user/hadoop/source/file.txt /user/hadoop/destination/

Combining Multiple Flags

You can combine multiple flags in a single command to tailor the behavior of the hadoop fs -cp command according to your specific requirements. Let’s walk through an example that utilizes all the flags -f, -r, -p, -t, -ignore, and -delete:

hadoop fs -cp -f -r -p -t -ignore "*.log" -delete /user/hadoop/source/ /user/hadoop/destination/

The command performs the following actions:

Copies all files and directories from /user/hadoop/source/ to /user/hadoop/destination/.
Overwrites any existing files in the destination.
Recursively copies directories and their contents.
Preserves attributes such as replication factor, modification time, ownership, and permissions.
Copies files while renaming them to the target name.
Ignores any files that match the “*.log” pattern during the copy.
Deletes the source files and directories after the copy operation is successfully completed.

Using multiple flags in a single command allows you to fine-tune the behavior of the hadoop fs -cp command, making it a versatile tool for your data management needs within the Hadoop ecosystem.