aztk.spark package

aztk.spark.client module

class aztk.spark.client.Client(secrets_configuration: aztk.spark.models.models.SecretsConfiguration)[source]

Bases: aztk.client.client.CoreClient

The client used to create and manage Spark clusters

cluster

aztk.spark.client.cluster.ClusterOperations – Cluster

job

aztk.spark.client.job.JobOperations – Job

class aztk.spark.client.cluster.ClusterOperations(context)[source]

Bases: aztk.spark.client.base.operations.SparkBaseOperations

Spark ClusterOperations object

_core_cluster_operations

aztk.client.cluster.CoreClusterOperations

# _spark_base_cluster_operations

aztk.spark.client.cluster.CoreClusterOperations

create(cluster_configuration: aztk.spark.models.models.ClusterConfiguration, wait: bool = False)[source]

Create a cluster.

Parameters:
  • cluster_configuration (ClusterConfiguration) – Configuration for the cluster to be created.
  • wait (bool) – if True, this function will block until the cluster creation is finished.
Returns:

An Cluster object representing the state and configuration of the cluster.

Return type:

aztk.spark.models.Cluster

delete(id: str, keep_logs: bool = False)[source]

Delete a cluster.

Parameters:
  • id (str) – the id of the cluster to delete.
  • keep_logs (bool) – If True, the logs related to this cluster in Azure Storage are not deleted. Defaults to False.
Returns:

True if the deletion process was successful.

Return type:

bool

get(id: str)[source]

Get details about the state of a cluster.

Parameters:id (str) – the id of the cluster to get.
Returns:A Cluster object representing the state and configuration of the cluster.
Return type:aztk.spark.models.Cluster
list()[source]

List all clusters.

Returns:
List of Cluster objects each representing the state
and configuration of the cluster.
Return type:List[aztk.spark.models.Cluster]
submit(id: str, application: aztk.spark.models.models.ApplicationConfiguration, remote: bool = False, wait: bool = False, internal: bool = False)[source]

Submit an application to a cluster.

Parameters:
  • id (str) – the id of the cluster to submit the application to.
  • application (aztk.spark.models.ApplicationConfiguration) – Application definition
  • remote (bool) – If True, the application file will not be uploaded, it is assumed to be reachable by the cluster already. This is useful when your application is stored in a mounted Azure File Share and not the client. Defaults to False.
  • internal (bool) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. This only applies if the cluster’s SchedulingTarget is not set to SchedulingTarget.Any. Defaults to False.
  • wait (bool, optional) – If True, this function blocks until the application has completed. Defaults to False.
Returns:

None

create_user(id: str, username: str, password: str = None, ssh_key: str = None)[source]

Create a user on every node in the cluster

Parameters:
  • username (str) – name of the user to create.
  • pool_id (str) – id of the cluster to create the user on.
  • ssh_key (str, optional) – ssh public key to create the user with, must use ssh_key or password. Defaults to None.
  • password (str, optional) – password for the user, must use ssh_key or password. Defaults to None.
Returns:

None

get_application_state(id: str, application_name: str)[source]

Get the state of a submitted application

Parameters:
  • id (str) – the name of the cluster the application was submitted to
  • application_name (str) – the name of the application to get
Returns:

the state of the application

Return type:

aztk.spark.models.ApplicationState

run(id: str, command: str, host=False, internal: bool = False, timeout=None)[source]

Run a bash command on every node in the cluster

Parameters:
  • id (str) – the id of the cluster to run the command on.
  • command (str) – the bash command to execute on the node.
  • internal (bool) – if true, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
  • container_name=None (str, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None.
  • timeout=None (str, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:

list of NodeOutput objects containing the output of the run command

Return type:

List[aztk.spark.models.NodeOutput]

node_run(id: str, node_id: str, command: str, host=False, internal: bool = False, timeout=None, block=True)[source]

Run a bash command on the given node

Parameters:
  • id (str) – the id of the cluster to run the command on.
  • node_id (str) – the id of the node in the cluster to run the command on.
  • command (str) – the bash command to execute on the node.
  • internal (bool) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
  • container_name=None (str, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None.
  • timeout=None (str, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
  • block=True (bool, optional) – If True, the command blocks until execution is complete.
Returns:

object containing the output of the run command

Return type:

aztk.spark.models.NodeOutput

copy(id: str, source_path: str, destination_path: str, host: bool = False, internal: bool = False, timeout: int = None)[source]

Copy a file to every node in a cluster.

Parameters:
  • id (str) – the id of the cluster to copy files with.
  • source_path (str) – the local path of the file to copy.
  • destination_path (str, optional) – the path on each node the file is copied to.
  • container_name (str, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None.
  • internal (bool, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
  • timeout (int, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:

A list of NodeOutput objects representing the output of the copy command.

Return type:

List[aztk.spark.models.NodeOutput]

download(id: str, source_path: str, destination_path: str = None, host: bool = False, internal: bool = False, timeout: int = None)[source]

Download a file from every node in a cluster.

Parameters:
  • id (str) – the id of the cluster to copy files with.
  • source_path (str) – the path of the file to copy from.
  • destination_path (str, optional) – the local directory path where the output should be written. If None, a SpooledTemporaryFile will be returned in the NodeOutput object, else the file will be written to this path. Defaults to None.
  • container_name (str, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None.
  • internal (bool, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
  • timeout (int, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:

A list of NodeOutput objects representing the output of the copy command.

Return type:

List[aztk.spark.models.NodeOutput]

diagnostics(id, output_directory: str = None, brief: bool = False)[source]

Download a file from every node in a cluster.

Parameters:
  • id (str) – the id of the cluster to copy files with.
  • output_directory (str, optional) – the local directory path where the output should be written. If None, a SpooledTemporaryFile will be returned in the NodeOutput object, else the file will be written to this path. Defaults to None.
Returns:

A list of NodeOutput objects representing the output of the copy command.

Return type:

List[aztk.spark.models.NodeOutput]

get_application_log(id: str, application_name: str, tail=False, current_bytes: int = 0)[source]

Get the log for a running or completed application

Parameters:
  • id (str) – the id of the cluster to run the command on.
  • application_name (str) – str
  • tail (bool, optional) – If True, get the remaining bytes after current_bytes. Otherwise, the whole log will be retrieved. Only use this if streaming the log as it is being written. Defaults to False.
  • current_bytes (int) – Specifies the last seen byte, so only the bytes after current_bytes are retrieved. Only useful is streaming the log as it is being written. Only used if tail is True.
Returns:

a model representing the output of the application.

Return type:

aztk.spark.models.ApplicationLog

get_remote_login_settings(id: str, node_id: str)[source]

Get the remote login information for a node in a cluster

Parameters:
  • id (str) – the id of the cluster the node is in
  • node_id (str) – the id of the node in the cluster
Returns:

Object that contains the ip address and port combination to login to a node

Return type:

aztk.spark.models.RemoteLogin

wait(id: str, application_name: str)[source]

Wait until the application has completed

Parameters:
  • id (str) – the id of the cluster the application was submitted to
  • application_name (str) – the name of the application to wait for
Returns:

None

get_configuration(id: str)[source]

Get the initial configuration of the cluster

Parameters:id (str) – the id of the cluster
Returns:aztk.spark.models.ClusterConfiguration
ssh_into_master(id, username, ssh_key=None, password=None, port_forward_list=None, internal=False)[source]

Open an SSH tunnel to the Spark master node and forward the specified ports

Parameters:
  • id (str) – the id of the cluster
  • username (str) – the name of the user to open the ssh session with
  • ssh_key (str, optional) – the ssh_key to authenticate the ssh user with. Must specify either ssh_key or password.
  • password (str, optional) – the password to authenticate the ssh user with. Must specify either password or ssh_key.
  • port_forward_list (aztk.spark.models.PortForwardingSpecification, optional) – List of the ports to forward.
  • internal (str, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.
class aztk.spark.client.job.JobOperations(context)[source]

Bases: aztk.spark.client.base.operations.SparkBaseOperations

Spark ClusterOperations object

_core_job_operations

aztk.client.cluster.CoreJobOperations

list()[source]

List all jobs.

Returns:List of aztk.models.Job objects each representing the state and configuration of the job.
Return type:List[Job]
delete(id, keep_logs: bool = False)[source]

Delete a job.

Parameters:
  • id (str) – the id of the job to delete.
  • keep_logs (bool) – If True, the logs related to this job in Azure Storage are not deleted. Defaults to False.
Returns:

True if the deletion process was successful.

Return type:

bool

get(id)[source]

Get details about the state of a job.

Parameters:id (str) – the id of the job to get.
Returns:A job object representing the state and configuration of the job.
Return type:aztk.spark.models.job
get_application(id, application_name)[source]

Get information on a submitted application

Parameters:
  • id (str) – the name of the job the application was submitted to
  • application_name (str) – the name of the application to get
Returns:

object representing that state and output of an application

Return type:

aztk.spark.models.Application

get_application_log(id, application_name)[source]

Get the log for a running or completed application

Parameters:
  • id (str) – the id of the job the application was submitted to.
  • application_name (str) – the name of the application to get the log of
Returns:

a model representing the output of the application.

Return type:

aztk.spark.models.ApplicationLog

list_applications(id)[source]

List all application defined as a part of a job

Parameters:id (str) – the id of the job to list the applications of
Returns:a list of all applications defined as a part of the job
Return type:List[aztk.spark.models.Application]
stop(id)[source]

Stop a submitted job

Parameters:id (str) – the id of the job to stop
Returns:None
stop_application(id, application_name)[source]

Stops a submitted application

Parameters:
  • id (str) – the id of the job the application belongs to
  • application_name (str) – the name of the application to stop
Returns:

True if the stop was successful, else False

Return type:

bool

submit(job_configuration: aztk.spark.models.models.JobConfiguration, wait: bool = False)[source]

Submit a job

Jobs are a cluster definition and one or many application definitions which run on the cluster. The job’s cluster will be allocated and configured, then the applications will be executed with their output stored in Azure Storage. When all applications have completed, the cluster will be automatically deleted.

Parameters:
Returns:

Model representing the state of the job.

Return type:

aztk.spark.models.Job

wait(id)[source]

Wait until the job has completed. :param id: the id of the job the application belongs to :type id: str

Returns:None