Apache Spark supports the ability to execute Linux commands from Spark Shell and PySpark Shell. This comes in handy during development to run some Linux commands like listing the contents of an HDFS directory or a local directory.
These methods are provided by the native libraries of Scala and Python languages. Hence, we can even use these methods within Spark and PySpark applications to programmatically execute Linux commands or to store the output of the Linux commands in a variable.
Spark Shell runs on Scala and any of the Scala libraries can be used from Spark Shell. Scala has a built-in library called sys that includes a package called process. process helps with handling the execution of external processes. process package provides a simple way to run Linux commands.
After sys.process._ package is imported, we can run any Linux commands in this format – “some linux command”.!
For more information about sys.process, refer to the official documentation.
scala> import sys.process._;
scala> "hadoop -fs ls /".!
scala> "echo 'hello, world!'".!
PySpark runs on Python and any of the Python modules can be used from PySpark. Python has a built-in module called os that provides operating system-dependent functionality. os module includes a method called system which takes in the command as argument and executes that command in a sub-shell.
First, the os module should be imported. Then, the Linux command to be executed should be passed as a parameter to os.system method.
For more information on os.system, refer to the official documentation.
>>> import os
>>> os.system("hadoop fs -ls /")
>>> os.system("echo 'Hello, world!'")
It’s fairly simple to execute Linux commands from Spark Shell and PySpark Shell. Scala’s sys.process package and Python’s os.system module can be used in Spark Shell and PySpark Shell respectively to execute Linux commands. Linux commands can be executed using these libraries within Spark applications as well.