aztk.spark package¶
aztk.spark.client module¶
-
class
aztk.spark.client.
Client
(secrets_configuration: aztk.spark.models.models.SecretsConfiguration)[source]¶ Bases:
aztk.client.client.CoreClient
The client used to create and manage Spark clusters
-
cluster
¶
-
job
¶
-
-
class
aztk.spark.client.cluster.
ClusterOperations
(context)[source]¶ Bases:
aztk.spark.client.base.operations.SparkBaseOperations
Spark ClusterOperations object
-
_core_cluster_operations
¶
-
# _spark_base_cluster_operations
aztk.spark.client.cluster.CoreClusterOperations
-
create
(cluster_configuration: aztk.spark.models.models.ClusterConfiguration, wait: bool = False)[source]¶ Create a cluster.
Parameters: - cluster_configuration (
ClusterConfiguration
) – Configuration for the cluster to be created. - wait (
bool
) – if True, this function will block until the cluster creation is finished.
Returns: An Cluster object representing the state and configuration of the cluster.
Return type: - cluster_configuration (
-
delete
(id: str, keep_logs: bool = False)[source]¶ Delete a cluster.
Parameters: Returns: True if the deletion process was successful.
Return type:
-
get
(id: str)[source]¶ Get details about the state of a cluster.
Parameters: id ( str
) – the id of the cluster to get.Returns: A Cluster object representing the state and configuration of the cluster. Return type: aztk.spark.models.Cluster
-
list
()[source]¶ List all clusters.
Returns: - List of Cluster objects each representing the state
- and configuration of the cluster.
Return type: List[aztk.spark.models.Cluster]
-
submit
(id: str, application: aztk.spark.models.models.ApplicationConfiguration, remote: bool = False, wait: bool = False, internal: bool = False)[source]¶ Submit an application to a cluster.
Parameters: - id (
str
) – the id of the cluster to submit the application to. - application (
aztk.spark.models.ApplicationConfiguration
) – Application definition - remote (
bool
) – If True, the application file will not be uploaded, it is assumed to be reachable by the cluster already. This is useful when your application is stored in a mounted Azure File Share and not the client. Defaults to False. - internal (
bool
) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. This only applies if the cluster’s SchedulingTarget is not set to SchedulingTarget.Any. Defaults to False. - wait (
bool
, optional) – If True, this function blocks until the application has completed. Defaults to False.
Returns: - id (
-
create_user
(id: str, username: str, password: str = None, ssh_key: str = None)[source]¶ Create a user on every node in the cluster
Parameters: - username (
str
) – name of the user to create. - pool_id (
str
) – id of the cluster to create the user on. - ssh_key (
str
, optional) – ssh public key to create the user with, must use ssh_key or password. Defaults to None. - password (
str
, optional) – password for the user, must use ssh_key or password. Defaults to None.
Returns: - username (
-
get_application_state
(id: str, application_name: str)[source]¶ Get the state of a submitted application
Parameters: Returns: the state of the application
Return type:
-
list_applications
(id: str)[source]¶ Get all tasks that have been submitted to the cluster
Parameters: id ( str
) – the name of the cluster the tasks belong toReturns: list of aztk applications Return type: [aztk.spark.models.Application]
-
run
(id: str, command: str, host=False, internal: bool = False, timeout=None)[source]¶ Run a bash command on every node in the cluster
Parameters: - id (
str
) – the id of the cluster to run the command on. - command (
str
) – the bash command to execute on the node. - internal (
bool
) – if true, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. - container_name=None (
str
, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None. - timeout=None (
str
, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns: list of NodeOutput objects containing the output of the run command
Return type: List[aztk.spark.models.NodeOutput]
- id (
-
node_run
(id: str, node_id: str, command: str, host=False, internal: bool = False, timeout=None, block=True)[source]¶ Run a bash command on the given node
Parameters: - id (
str
) – the id of the cluster to run the command on. - node_id (
str
) – the id of the node in the cluster to run the command on. - command (
str
) – the bash command to execute on the node. - internal (
bool
) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. - container_name=None (
str
, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None. - timeout=None (
str
, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None. - block=True (
bool
, optional) – If True, the command blocks until execution is complete.
Returns: object containing the output of the run command
Return type: aztk.spark.models.NodeOutput
- id (
-
copy
(id: str, source_path: str, destination_path: str, host: bool = False, internal: bool = False, timeout: int = None)[source]¶ Copy a file to every node in a cluster.
Parameters: - id (
str
) – the id of the cluster to copy files with. - source_path (
str
) – the local path of the file to copy. - destination_path (
str
, optional) – the path on each node the file is copied to. - container_name (
str
, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None. - internal (
bool
, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. - timeout (
int
, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns: A list of NodeOutput objects representing the output of the copy command.
Return type: List[aztk.spark.models.NodeOutput]
- id (
-
download
(id: str, source_path: str, destination_path: str = None, host: bool = False, internal: bool = False, timeout: int = None)[source]¶ Download a file from every node in a cluster.
Parameters: - id (
str
) – the id of the cluster to copy files with. - source_path (
str
) – the path of the file to copy from. - destination_path (
str
, optional) – the local directory path where the output should be written. If None, a SpooledTemporaryFile will be returned in the NodeOutput object, else the file will be written to this path. Defaults to None. - container_name (
str
, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None. - internal (
bool
, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. - timeout (
int
, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns: A list of NodeOutput objects representing the output of the copy command.
Return type: List[aztk.spark.models.NodeOutput]
- id (
-
diagnostics
(id, output_directory: str = None, brief: bool = False)[source]¶ Download a file from every node in a cluster.
Parameters: Returns: A list of NodeOutput objects representing the output of the copy command.
Return type: List[aztk.spark.models.NodeOutput]
-
get_application_log
(id: str, application_name: str, tail=False, current_bytes: int = 0)[source]¶ Get the log for a running or completed application
Parameters: - id (
str
) – the id of the cluster to run the command on. - application_name (
str
) – str - tail (
bool
, optional) – If True, get the remaining bytes after current_bytes. Otherwise, the whole log will be retrieved. Only use this if streaming the log as it is being written. Defaults to False. - current_bytes (
int
) – Specifies the last seen byte, so only the bytes after current_bytes are retrieved. Only useful is streaming the log as it is being written. Only used if tail is True.
Returns: a model representing the output of the application.
Return type: - id (
-
get_remote_login_settings
(id: str, node_id: str)[source]¶ Get the remote login information for a node in a cluster
Parameters: Returns: Object that contains the ip address and port combination to login to a node
Return type:
-
wait
(id: str, application_name: str)[source]¶ Wait until the application has completed
Parameters: Returns:
-
get_configuration
(id: str)[source]¶ Get the initial configuration of the cluster
Parameters: id ( str
) – the id of the clusterReturns: aztk.spark.models.ClusterConfiguration
-
ssh_into_master
(id, username, ssh_key=None, password=None, port_forward_list=None, internal=False)[source]¶ Open an SSH tunnel to the Spark master node and forward the specified ports
Parameters: - id (
str
) – the id of the cluster - username (
str
) – the name of the user to open the ssh session with - ssh_key (
str
, optional) – the ssh_key to authenticate the ssh user with. Must specify either ssh_key or password. - password (
str
, optional) – the password to authenticate the ssh user with. Must specify either password or ssh_key. - port_forward_list (
aztk.spark.models.PortForwardingSpecification
, optional) – List of the ports to forward. - internal (
str
, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
- id (
-
-
class
aztk.spark.client.job.
JobOperations
(context)[source]¶ Bases:
aztk.spark.client.base.operations.SparkBaseOperations
Spark ClusterOperations object
-
_core_job_operations
¶ aztk.client.cluster.CoreJobOperations
-
list
()[source]¶ List all jobs.
Returns: List of aztk.models.Job objects each representing the state and configuration of the job. Return type: List[Job]
-
delete
(id, keep_logs: bool = False)[source]¶ Delete a job.
Parameters: Returns: True if the deletion process was successful.
Return type:
-
get
(id)[source]¶ Get details about the state of a job.
Parameters: id ( str
) – the id of the job to get.Returns: A job object representing the state and configuration of the job. Return type: aztk.spark.models.job
-
get_application
(id, application_name)[source]¶ Get information on a submitted application
Parameters: Returns: object representing that state and output of an application
Return type:
-
get_application_log
(id, application_name)[source]¶ Get the log for a running or completed application
Parameters: Returns: a model representing the output of the application.
Return type:
-
list_applications
(id)[source]¶ List all application defined as a part of a job
Parameters: id ( str
) – the id of the job to list the applications ofReturns: a list of all applications defined as a part of the job Return type: List[aztk.spark.models.Application]
-
stop_application
(id, application_name)[source]¶ Stops a submitted application
Parameters: Returns: True if the stop was successful, else False
Return type:
-
submit
(job_configuration: aztk.spark.models.models.JobConfiguration, wait: bool = False)[source]¶ Submit a job
Jobs are a cluster definition and one or many application definitions which run on the cluster. The job’s cluster will be allocated and configured, then the applications will be executed with their output stored in Azure Storage. When all applications have completed, the cluster will be automatically deleted.
Parameters: - job_configuration (
aztk.spark.models.JobConfiguration
) – Model defining the job’s configuration. - wait (
bool
) – If True, blocks until job is completed. Defaults to False.
Returns: Model representing the state of the job.
Return type: - job_configuration (
-