aztk.spark package¶

aztk.spark.models package
- aztk.spark.models.plugins package

aztk.spark.client module¶

class aztk.spark.client.Client(secrets_configuration: aztk.spark.models.models.SecretsConfiguration)[source]¶

Bases: aztk.client.client.CoreClient

The client used to create and manage Spark clusters

cluster¶: aztk.spark.client.cluster.ClusterOperations – Cluster

job¶: aztk.spark.client.job.JobOperations – Job

class aztk.spark.client.cluster.ClusterOperations(context)[source]¶

Bases: aztk.spark.client.base.operations.SparkBaseOperations

Spark ClusterOperations object

_core_cluster_operations¶: aztk.client.cluster.CoreClusterOperations

# _spark_base_cluster_operations: aztk.spark.client.cluster.CoreClusterOperations

create(cluster_configuration: aztk.spark.models.models.ClusterConfiguration, wait: bool = False)[source]¶

Create a cluster.

Parameters:	cluster_configuration (`ClusterConfiguration`) – Configuration for the cluster to be created. wait (`bool`) – if True, this function will block until the cluster creation is finished.
Returns:	An Cluster object representing the state and configuration of the cluster.
Return type:	`aztk.spark.models.Cluster`

delete(id: str, keep_logs: bool = False)[source]¶

Delete a cluster.

Parameters:	id (`str`) – the id of the cluster to delete. keep_logs (`bool`) – If True, the logs related to this cluster in Azure Storage are not deleted. Defaults to False.
Returns:	True if the deletion process was successful.
Return type:	`bool`

get(id: str)[source]¶

Get details about the state of a cluster.

Parameters:	id (`str`) – the id of the cluster to get.
Returns:	A Cluster object representing the state and configuration of the cluster.
Return type:	`aztk.spark.models.Cluster`

list()[source]¶

List all clusters.

Returns:	List of Cluster objects each representing the state and configuration of the cluster.
Return type:	`List[aztk.spark.models.Cluster]`

submit(id: str, application: aztk.spark.models.models.ApplicationConfiguration, remote: bool = False, wait: bool = False, internal: bool = False)[source]¶

Submit an application to a cluster.

Parameters:

id (str) – the id of the cluster to submit the application to.
application (aztk.spark.models.ApplicationConfiguration) – Application definition
remote (bool) – If True, the application file will not be uploaded, it is assumed to be reachable by the cluster already. This is useful when your application is stored in a mounted Azure File Share and not the client. Defaults to False.
internal (bool) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. This only applies if the cluster’s SchedulingTarget is not set to SchedulingTarget.Any. Defaults to False.
wait (bool, optional) – If True, this function blocks until the application has completed. Defaults to False.

Returns:

None

create_user(id: str, username: str, password: str = None, ssh_key: str = None)[source]¶

Create a user on every node in the cluster

Parameters:	username (`str`) – name of the user to create. pool_id (`str`) – id of the cluster to create the user on. ssh_key (`str`, optional) – ssh public key to create the user with, must use ssh_key or password. Defaults to None. password (`str`, optional) – password for the user, must use ssh_key or password. Defaults to None.
Returns:	`None`

get_application_state(id: str, application_name: str)[source]¶

Get the state of a submitted application

Parameters:	id (`str`) – the name of the cluster the application was submitted to application_name (`str`) – the name of the application to get
Returns:	the state of the application
Return type:	`aztk.spark.models.ApplicationState`

run(id: str, command: str, host=False, internal: bool = False, timeout=None)[source]¶

Run a bash command on every node in the cluster

Parameters:	id (`str`) – the id of the cluster to run the command on. command (`str`) – the bash command to execute on the node. internal (`bool`) – if true, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. container_name=None (`str`, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None. timeout=None (`str`, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:	list of NodeOutput objects containing the output of the run command
Return type:	`List[aztk.spark.models.NodeOutput]`

node_run(id: str, node_id: str, command: str, host=False, internal: bool = False, timeout=None, block=True)[source]¶

Run a bash command on the given node

Parameters:	id (`str`) – the id of the cluster to run the command on. node_id (`str`) – the id of the node in the cluster to run the command on. command (`str`) – the bash command to execute on the node. internal (`bool`) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. container_name=None (`str`, optional) – the name of the container to run the command in. If None, the command will run on the host VM. Defaults to None. timeout=None (`str`, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None. block=True (`bool`, optional) – If True, the command blocks until execution is complete.
Returns:	object containing the output of the run command
Return type:	`aztk.spark.models.NodeOutput`

copy(id: str, source_path: str, destination_path: str, host: bool = False, internal: bool = False, timeout: int = None)[source]¶

Copy a file to every node in a cluster.

Parameters:	id (`str`) – the id of the cluster to copy files with. source_path (`str`) – the local path of the file to copy. destination_path (`str`, optional) – the path on each node the file is copied to. container_name (`str`, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None. internal (`bool`, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. timeout (`int`, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:	A list of NodeOutput objects representing the output of the copy command.
Return type:	`List[aztk.spark.models.NodeOutput]`

download(id: str, source_path: str, destination_path: str = None, host: bool = False, internal: bool = False, timeout: int = None)[source]¶

Download a file from every node in a cluster.

Parameters:	id (`str`) – the id of the cluster to copy files with. source_path (`str`) – the path of the file to copy from. destination_path (`str`, optional) – the local directory path where the output should be written. If None, a SpooledTemporaryFile will be returned in the NodeOutput object, else the file will be written to this path. Defaults to None. container_name (`str`, optional) – the name of the container to copy to or from. If None, the copy operation will occur on the host VM, Defaults to None. internal (`bool`, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False. timeout (`int`, optional) – The timeout in seconds for establishing a connection to the node. Defaults to None.
Returns:	A list of NodeOutput objects representing the output of the copy command.
Return type:	`List[aztk.spark.models.NodeOutput]`

diagnostics(id, output_directory: str = None, brief: bool = False)[source]¶

Download a file from every node in a cluster.

Parameters:	id (`str`) – the id of the cluster to copy files with. output_directory (`str`, optional) – the local directory path where the output should be written. If None, a SpooledTemporaryFile will be returned in the NodeOutput object, else the file will be written to this path. Defaults to None.
Returns:	A list of NodeOutput objects representing the output of the copy command.
Return type:	`List[aztk.spark.models.NodeOutput]`

get_application_log(id: str, application_name: str, tail=False, current_bytes: int = 0)[source]¶

Get the log for a running or completed application

Parameters:	id (`str`) – the id of the cluster to run the command on. application_name (`str`) – str tail (`bool`, optional) – If True, get the remaining bytes after current_bytes. Otherwise, the whole log will be retrieved. Only use this if streaming the log as it is being written. Defaults to False. current_bytes (`int`) – Specifies the last seen byte, so only the bytes after current_bytes are retrieved. Only useful is streaming the log as it is being written. Only used if tail is True.
Returns:	a model representing the output of the application.
Return type:	`aztk.spark.models.ApplicationLog`

get_remote_login_settings(id: str, node_id: str)[source]¶

Get the remote login information for a node in a cluster

Parameters:	id (`str`) – the id of the cluster the node is in node_id (`str`) – the id of the node in the cluster
Returns:	Object that contains the ip address and port combination to login to a node
Return type:	`aztk.spark.models.RemoteLogin`

wait(id: str, application_name: str)[source]¶

Wait until the application has completed

Parameters:	id (`str`) – the id of the cluster the application was submitted to application_name (`str`) – the name of the application to wait for
Returns:	`None`

get_configuration(id: str)[source]¶

Get the initial configuration of the cluster

Parameters:	id (`str`) – the id of the cluster
Returns:	`aztk.spark.models.ClusterConfiguration`

ssh_into_master(id, username, ssh_key=None, password=None, port_forward_list=None, internal=False)[source]¶

Open an SSH tunnel to the Spark master node and forward the specified ports

Parameters:

id (str) – the id of the cluster
username (str) – the name of the user to open the ssh session with
ssh_key (str, optional) – the ssh_key to authenticate the ssh user with. Must specify either ssh_key or password.
password (str, optional) – the password to authenticate the ssh user with. Must specify either password or ssh_key.
port_forward_list (aztk.spark.models.PortForwardingSpecification, optional) – List of the ports to forward.
internal (str, optional) – if True, this will connect to the node using its internal IP. Only use this if running within the same VNET as the cluster. Defaults to False.

class aztk.spark.client.job.JobOperations(context)[source]¶

Bases: aztk.spark.client.base.operations.SparkBaseOperations

Spark ClusterOperations object

_core_job_operations¶: aztk.client.cluster.CoreJobOperations

list()[source]¶

List all jobs.

Returns:	List of aztk.models.Job objects each representing the state and configuration of the job.
Return type:	`List[Job]`

delete(id, keep_logs: bool = False)[source]¶

Delete a job.

Parameters:	id (`str`) – the id of the job to delete. keep_logs (`bool`) – If True, the logs related to this job in Azure Storage are not deleted. Defaults to False.
Returns:	True if the deletion process was successful.
Return type:	`bool`

get(id)[source]¶

Get details about the state of a job.

Parameters:	id (`str`) – the id of the job to get.
Returns:	A job object representing the state and configuration of the job.
Return type:	`aztk.spark.models.job`

get_application(id, application_name)[source]¶

Get information on a submitted application

Parameters:	id (`str`) – the name of the job the application was submitted to application_name (`str`) – the name of the application to get
Returns:	object representing that state and output of an application
Return type:	`aztk.spark.models.Application`

get_application_log(id, application_name)[source]¶

Get the log for a running or completed application

Parameters:	id (`str`) – the id of the job the application was submitted to. application_name (`str`) – the name of the application to get the log of
Returns:	a model representing the output of the application.
Return type:	`aztk.spark.models.ApplicationLog`

list_applications(id)[source]¶

List all application defined as a part of a job

Parameters:	id (`str`) – the id of the job to list the applications of
Returns:	a list of all applications defined as a part of the job
Return type:	`List[aztk.spark.models.Application]`

stop(id)[source]¶

Stop a submitted job

Parameters:	id (`str`) – the id of the job to stop
Returns:	`None`

stop_application(id, application_name)[source]¶

Stops a submitted application

Parameters:	id (`str`) – the id of the job the application belongs to application_name (`str`) – the name of the application to stop
Returns:	True if the stop was successful, else False
Return type:	`bool`

submit(job_configuration: aztk.spark.models.models.JobConfiguration, wait: bool = False)[source]¶

Submit a job

Jobs are a cluster definition and one or many application definitions which run on the cluster. The job’s cluster will be allocated and configured, then the applications will be executed with their output stored in Azure Storage. When all applications have completed, the cluster will be automatically deleted.

Parameters:	job_configuration (`aztk.spark.models.JobConfiguration`) – Model defining the job’s configuration. wait (`bool`) – If True, blocks until job is completed. Defaults to False.
Returns:	Model representing the state of the job.
Return type:	`aztk.spark.models.Job`

wait(id)[source]¶

Wait until the job has completed. :param id: the id of the job the application belongs to :type id: str

Returns:	`None`