Databricks on AWS

sync.awsdatabricks.apply_project_recommendation(job_id: str, project_id: str, recommendation_id: str) sync.api.Response[str]

Updates jobs with project recommendation

Parameters
  • job_id (str) – ID of job to apply prediction to

  • project_id (str) – Sync project ID

  • recommendation_id (str) – Sync project recommendation ID

Returns

ID of applied recommendation

Return type

Response[str]

sync.awsdatabricks.create_and_record_run(run: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str]

Applies the Databricks run configuration and creates a prediction based on the result.

If a project is specified the resulting prediction is added to it. This function waits for run to complete.

Parameters
  • run (dict) – Databricks run configuration

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.create_and_wait_for_run(run: dict) sync.api.Response[str]

Creates a Databricks run from the incoming configuration and returns the final status.

This function waits for the run to complete.

Parameters

run (dict) – Databricks run configuration

Returns

result state, e.g. “SUCCESS”

Return type

Response[str]

sync.awsdatabricks.create_cluster(config: dict) sync.api.Response[str]

Create Databricks cluster from the provided configuration.

The configuration should match that described here: https://docs.databricks.com/dev-tools/api/latest/clusters.html#create

Parameters

config (dict) – cluster configuration

Returns

cluster ID

Return type

Response[str]

sync.awsdatabricks.create_run(run: dict) sync.api.Response[str]

Creates a run based off the incoming Databricks run configuration

Parameters

run (dict) – run object

Returns

run ID

Return type

Response[str]

sync.awsdatabricks.create_submission_for_run(run_id: str, plan_type: str, compute_type: str, project_id: str, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync.api.Response[str]

Create a Submission for the specified Databricks run.

Parameters
  • run_id (str) – Databricks run ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional) – Sync project ID, defaults to None

  • allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable

  • exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the prediction

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.create_submission_with_cluster_info(run_id: str, project_id: str, cluster: Dict, cluster_info: Dict, cluster_activity_events: Dict, plan_type: sync.models.DatabricksPlanType, compute_type: sync.models.DatabricksComputeType, skip_eventlog: bool = False, dbfs_eventlog_file_size: int = 0) sync.api.Response[str]

Create a Submission for the specified Databricks run given a cluster report

sync.awsdatabricks.get_access_report(log_url: Optional[str] = None) sync.models.AccessReport

Reports access to Databricks, AWS and Sync required for integrating jobs with Sync. Access is partially determined by the configuration of this library and boto3.

Parameters

log_url (str, optional) – location of event logs, defaults to None

Returns

access report

Return type

AccessReport

sync.awsdatabricks.get_all_cluster_events(cluster_id: str)

Fetches all ClusterEvents for a given Databricks cluster, optionally within a time window. Pages will be followed and returned as 1 object

sync.awsdatabricks.get_cluster(cluster_id: str) sync.api.projects.Response[dict]

Get Databricks cluster.

Parameters

cluster_id (str) – cluster ID

Returns

cluster object

Return type

Response[dict]

sync.awsdatabricks.get_cluster_report(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[DatabricksClusterReport]

Fetches the cluster information required to create a Sync prediction

Parameters
  • run_id (str) – Databricks run ID

  • plan_type (str) – Databricks Pricing Plan, e.g. “Standard”

  • compute_type (str) – Cluster compute type, e.g. “Jobs Compute”

  • project_id (str, optional) – The Sync Project ID this report should be generated for. This is good to provide in general, but especially for multi-cluster jobs.

  • allow_incomplete (bool, optional, defaults to False) – Whether creating a cluster report with incomplete data should be allowable

  • exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the report

Returns

cluster report

Return type

Response[DatabricksClusterReport]

sync.awsdatabricks.get_project_cluster(cluster: dict, project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict]

Apply project configuration to a cluster.

The cluster is updated with tags and a log configuration to facilitate project continuity.

Parameters
  • cluster (dict) – Databricks cluster object

  • project_id (str) – Sync project ID

  • region_name (str, optional) – region name, defaults to AWS configuration

Returns

project job object

Return type

Response[dict]

sync.awsdatabricks.get_project_cluster_settings(project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict]

Gets cluster configuration for a project.

This configuration is intended to be used to update the cluster of a Databricks job so that its runs can be included in a Sync project.

Parameters
  • project_id (str) – Sync project ID

  • region_name (str, optional) – region name, defaults to AWS configuration

Returns

project cluster settings - a subset of a Databricks cluster object

Return type

Response[dict]

sync.awsdatabricks.get_project_job(job_id: str, project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict]

Apply project configuration to a job.

The job can only have tasks that run on the same job cluster. That cluster is updated with tags and a log configuration to facilitate project continuity. The result can be tested in a one-off run or applied to an existing job to surface run-time (see run_job_object()) or cost optimizations.

Parameters
  • job_id (str) – ID of basis job

  • project_id (str) – Sync project ID

  • region_name (str, optional) – region name, defaults to AWS configuration

Returns

project job object

Return type

Response[dict]

sync.awsdatabricks.get_recommendation_job(job_id: str, project_id: str, recommendation_id: str) sync.api.projects.Response[dict]

Apply the recommendation to the specified job.

The basis job can only have tasks that run on the same cluster. That cluster is updated with the configuration from the prediction and returned in the result job configuration. Use this function to apply a prediction to an existing job or test a prediction with a one-off run.

Parameters
  • job_id (str) – basis job ID

  • project_id (str) – Sync project ID

  • recommendation_id (str) – recommendation ID

Returns

job object with recommendation applied to it

Return type

Response[dict]

sync.awsdatabricks.handle_successful_job_run(job_id: str, run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[Dict[str, str]]

Create’s Sync project submissions for each eligible cluster in the run (see record_run())

If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.

For projects with auto_apply_recs=True, apply latest recommended cluster configurations.

Parameters
  • job_id (str) – Databricks job ID

  • run_id (str) – Databricks run ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional, defaults to None) – Sync project ID

  • allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable

  • exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude

Returns

map of project ID to submission or prediction ID

Return type

Response[Dict[str, str]]

sync.awsdatabricks.record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[Dict[str, str]]

Create’s Sync project submissions for each eligible cluster in the run.

If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.

Parameters
  • run_id (str) – Databricks run ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional, defaults to None) – Sync project ID

  • allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable

  • exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude

Returns

map of project ID to submission or prediction ID

Return type

Response[Dict[str, str]]

sync.awsdatabricks.run_and_record_job(job_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str]

Runs the specified job and creates a prediction based on the result.

If a project is specified the prediction is added to it.

Parameters
  • job_id (str) – Databricks job ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_and_record_job_object(job: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str]

Creates a one-off Databricks run based on the provided job object.

Job tasks must use the same job cluster, and that cluster must be configured to store the event logs in S3.

Parameters
  • job (dict) – Databricks job object

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_and_record_project_job(job_id: str, project_id: str, plan_type: str, compute_type: str, region_name: Optional[str] = None) sync.api.Response[str]

Runs the specified job and adds the result to the project.

This function waits for the run to complete.

Parameters
  • job_id (str) – Databricks job ID

  • project_id (str) – Sync project ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • region_name (str, optional) – region name, defaults to AWS configuration

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.run_job_object(job: dict) sync._databricks.Response[Tuple[str, str]]

Create a Databricks one-off run based on the job configuration.

Parameters

job (dict) – Databricks job object

Returns

run ID, and optionally ID of newly created cluster

Return type

Response[Tuple[str, str]]

sync.awsdatabricks.terminate_cluster(cluster_id: str) sync.api.projects.Response[dict]

Terminate Databricks cluster and wait to return final state.

Parameters

cluster_id (str) – Databricks cluster ID

Returns

Databricks cluster object with state: “TERMINATED”

Return type

Response[str]

sync.awsdatabricks.wait_for_and_record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str]

Waits for a run to complete before creating a prediction.

The run must save 1 event log to S3. If a project is specified the prediction is added to that project.

Parameters
  • run_id (str) – Databricks run ID

  • plan_type (str) – either “Standard”, “Premium” or “Enterprise”

  • compute_type (str) – e.g. “Jobs Compute”

  • project_id (str, optional) – Sync project ID, defaults to None

Returns

prediction ID

Return type

Response[str]

sync.awsdatabricks.wait_for_final_run_status(run_id: str) sync.api.Response[str]

Waits for run returning final status.

Parameters

run_id (str) – Databricks run ID

Returns

result state, e.g. “SUCCESS”

Return type

Response[str]

sync.awsdatabricks.wait_for_run_and_cluster(run_id: str) sync.api.Response[str]

Waits for final run status and returns it after terminating the cluster.

Parameters

run_id (str) – Databricks run ID

Returns

result state, e.g. “SUCCESS”

Return type

Response[str]