Databricks on AWS¶

sync.awsdatabricks.apply_project_recommendation(job_id: str, project_id: str, recommendation_id: str) → sync.api.Response[str]¶

Updates jobs with project recommendation

Parameters

job_id (str) – ID of job to apply prediction to
project_id (str) – Sync project ID
recommendation_id (str) – Sync project recommendation ID

Returns

ID of applied recommendation

Return type

Response[str]

sync.awsdatabricks.create_and_record_run(run: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) → sync.api.Response[str]¶

Applies the Databricks run configuration and creates a prediction based on the result.

If a project is specified the resulting prediction is added to it. This function waits for run to complete.

Parameters

run (dict) – Databricks run configuration
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.create_and_wait_for_run(run: dict) → sync.api.Response[str]¶

Creates a Databricks run from the incoming configuration and returns the final status.

This function waits for the run to complete.

Parameters: run (dict) – Databricks run configuration
Returns: result state, e.g. “SUCCESS”
Return type: Response[str]

sync.awsdatabricks.create_cluster(config: dict) → sync.api.Response[str]¶

Create Databricks cluster from the provided configuration.

The configuration should match that described here: https://docs.databricks.com/dev-tools/api/latest/clusters.html#create

Parameters: config (dict) – cluster configuration
Returns: cluster ID
Return type: Response[str]

sync.awsdatabricks.create_run(run: dict) → sync.api.Response[str]¶

Creates a run based off the incoming Databricks run configuration

Parameters: run (dict) – run object
Returns: run ID
Return type: Response[str]

sync.awsdatabricks.create_submission_for_run(run_id: str, plan_type: str, compute_type: str, project_id: str, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) → sync.api.Response[str]¶

Create a Submission for the specified Databricks run.

Parameters

run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the prediction

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.create_submission_with_cluster_info(run_id: str, project_id: str, cluster: Dict, cluster_info: Dict, cluster_activity_events: Dict, plan_type: sync.models.DatabricksPlanType, compute_type: sync.models.DatabricksComputeType, skip_eventlog: bool = False, dbfs_eventlog_file_size: int = 0) → sync.api.Response[str]¶: Create a Submission for the specified Databricks run given a cluster report

sync.awsdatabricks.get_access_report(log_url: Optional[str] = None) → sync.models.AccessReport¶

Reports access to Databricks, AWS and Sync required for integrating jobs with Sync. Access is partially determined by the configuration of this library and boto3.

Parameters: log_url (str, optional) – location of event logs, defaults to None
Returns: access report
Return type: AccessReport

sync.awsdatabricks.get_all_cluster_events(cluster_id: str)¶: Fetches all ClusterEvents for a given Databricks cluster, optionally within a time window. Pages will be followed and returned as 1 object

sync.awsdatabricks.get_cluster(cluster_id: str) → sync.api.projects.Response[dict]¶

Get Databricks cluster.

Parameters: cluster_id (str) – cluster ID
Returns: cluster object
Return type: Response[dict]

sync.awsdatabricks.get_cluster_report(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete: bool = False, exclude_tasks: Optional[Collection[str]] = None) → sync._databricks.Response[DatabricksClusterReport]¶

Fetches the cluster information required to create a Sync prediction

Parameters

run_id (str) – Databricks run ID
plan_type (str) – Databricks Pricing Plan, e.g. “Standard”
compute_type (str) – Cluster compute type, e.g. “Jobs Compute”
project_id (str, optional) – The Sync Project ID this report should be generated for. This is good to provide in general, but especially for multi-cluster jobs.
allow_incomplete (bool, optional, defaults to False) – Whether creating a cluster report with incomplete data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the report

Returns

cluster report

Return type

Response[DatabricksClusterReport]

sync.awsdatabricks.get_project_cluster(cluster: dict, project_id: str, region_name: Optional[str] = None) → sync.api.projects.Response[dict]¶

Apply project configuration to a cluster.

The cluster is updated with tags and a log configuration to facilitate project continuity.

Parameters

cluster (dict) – Databricks cluster object
project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration

Returns

project job object

Return type

Response[dict]

sync.awsdatabricks.get_project_cluster_settings(project_id: str, region_name: Optional[str] = None) → sync.api.projects.Response[dict]¶

Gets cluster configuration for a project.

This configuration is intended to be used to update the cluster of a Databricks job so that its runs can be included in a Sync project.

Parameters

project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration

Returns

project cluster settings - a subset of a Databricks cluster object

Return type

Response[dict]

sync.awsdatabricks.get_project_job(job_id: str, project_id: str, region_name: Optional[str] = None) → sync.api.projects.Response[dict]¶

Apply project configuration to a job.

The job can only have tasks that run on the same job cluster. That cluster is updated with tags and a log configuration to facilitate project continuity. The result can be tested in a one-off run or applied to an existing job to surface run-time (see run_job_object()) or cost optimizations.

Parameters

job_id (str) – ID of basis job
project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration

Returns

project job object

Return type

Response[dict]

sync.awsdatabricks.get_recommendation_job(job_id: str, project_id: str, recommendation_id: str) → sync.api.projects.Response[dict]¶

Apply the recommendation to the specified job.

The basis job can only have tasks that run on the same cluster. That cluster is updated with the configuration from the prediction and returned in the result job configuration. Use this function to apply a prediction to an existing job or test a prediction with a one-off run.

Parameters

job_id (str) – basis job ID
project_id (str) – Sync project ID
recommendation_id (str) – recommendation ID

Returns

job object with recommendation applied to it

Return type

Response[dict]

sync.awsdatabricks.handle_successful_job_run(job_id: str, run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) → sync._databricks.Response[Dict[str, str]]¶

Create’s Sync project submissions for each eligible cluster in the run (see record_run())

If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.

For projects with auto_apply_recs=True, apply latest recommended cluster configurations.

Parameters

job_id (str) – Databricks job ID
run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional, defaults to None) – Sync project ID
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude

Returns

map of project ID to submission or prediction ID

Return type

Response[Dict[str, str]]

sync.awsdatabricks.record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) → sync._databricks.Response[Dict[str, str]]¶

Create’s Sync project submissions for each eligible cluster in the run.

If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.

Parameters

run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional, defaults to None) – Sync project ID
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude

Returns

map of project ID to submission or prediction ID

Return type

Response[Dict[str, str]]

sync.awsdatabricks.run_and_record_job(job_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) → sync.api.Response[str]¶

Runs the specified job and creates a prediction based on the result.

If a project is specified the prediction is added to it.

Parameters

job_id (str) – Databricks job ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_and_record_job_object(job: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) → sync.api.Response[str]¶

Creates a one-off Databricks run based on the provided job object.

Job tasks must use the same job cluster, and that cluster must be configured to store the event logs in S3.

Parameters

job (dict) – Databricks job object
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_and_record_project_job(job_id: str, project_id: str, plan_type: str, compute_type: str, region_name: Optional[str] = None) → sync.api.Response[str]¶

Runs the specified job and adds the result to the project.

This function waits for the run to complete.

Parameters

job_id (str) – Databricks job ID
project_id (str) – Sync project ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
region_name (str, optional) – region name, defaults to AWS configuration

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_job_object(job: dict) → sync._databricks.Response[Tuple[str, str]]¶

Create a Databricks one-off run based on the job configuration.

Parameters: job (dict) – Databricks job object
Returns: run ID, and optionally ID of newly created cluster
Return type: Response[Tuple[str, str]]

sync.awsdatabricks.terminate_cluster(cluster_id: str) → sync.api.projects.Response[dict]¶

Terminate Databricks cluster and wait to return final state.

Parameters: cluster_id (str) – Databricks cluster ID
Returns: Databricks cluster object with state: “TERMINATED”
Return type: Response[str]

sync.awsdatabricks.wait_for_and_record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) → sync.api.Response[str]¶

Waits for a run to complete before creating a prediction.

The run must save 1 event log to S3. If a project is specified the prediction is added to that project.

Parameters

run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.wait_for_final_run_status(run_id: str) → sync.api.Response[str]¶

Waits for run returning final status.

Parameters: run_id (str) – Databricks run ID
Returns: result state, e.g. “SUCCESS”
Return type: Response[str]

sync.awsdatabricks.wait_for_run_and_cluster(run_id: str) → sync.api.Response[str]¶

Waits for final run status and returns it after terminating the cluster.

Parameters: run_id (str) – Databricks run ID
Returns: result state, e.g. “SUCCESS”
Return type: Response[str]

Databricks on AWS¶

Sync Library

Navigation

Related Topics