Databricks on AWS¶
- sync.awsdatabricks.apply_project_recommendation(job_id: str, project_id: str, recommendation_id: str) sync.api.Response[str] ¶
Updates jobs with project recommendation
- Parameters
job_id (str) – ID of job to apply prediction to
project_id (str) – Sync project ID
recommendation_id (str) – Sync project recommendation ID
- Returns
ID of applied recommendation
- Return type
Response[str]
- sync.awsdatabricks.create_and_record_run(run: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str] ¶
Applies the Databricks run configuration and creates a prediction based on the result.
If a project is specified the resulting prediction is added to it. This function waits for run to complete.
- Parameters
run (dict) – Databricks run configuration
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.create_and_wait_for_run(run: dict) sync.api.Response[str] ¶
Creates a Databricks run from the incoming configuration and returns the final status.
This function waits for the run to complete.
- Parameters
run (dict) – Databricks run configuration
- Returns
result state, e.g. “SUCCESS”
- Return type
Response[str]
- sync.awsdatabricks.create_cluster(config: dict) sync.api.Response[str] ¶
Create Databricks cluster from the provided configuration.
The configuration should match that described here: https://docs.databricks.com/dev-tools/api/latest/clusters.html#create
- Parameters
config (dict) – cluster configuration
- Returns
cluster ID
- Return type
Response[str]
- sync.awsdatabricks.create_run(run: dict) sync.api.Response[str] ¶
Creates a run based off the incoming Databricks run configuration
- Parameters
run (dict) – run object
- Returns
run ID
- Return type
Response[str]
- sync.awsdatabricks.create_submission_for_run(run_id: str, plan_type: str, compute_type: str, project_id: str, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync.api.Response[str] ¶
Create a Submission for the specified Databricks run.
- Parameters
run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the prediction
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.create_submission_with_cluster_info(run_id: str, project_id: str, cluster: Dict, cluster_info: Dict, cluster_activity_events: Dict, plan_type: sync.models.DatabricksPlanType, compute_type: sync.models.DatabricksComputeType, skip_eventlog: bool = False, dbfs_eventlog_file_size: int = 0) sync.api.Response[str] ¶
Create a Submission for the specified Databricks run given a cluster report
- sync.awsdatabricks.get_access_report(log_url: Optional[str] = None) sync.models.AccessReport ¶
Reports access to Databricks, AWS and Sync required for integrating jobs with Sync. Access is partially determined by the configuration of this library and boto3.
- Parameters
log_url (str, optional) – location of event logs, defaults to None
- Returns
access report
- Return type
AccessReport
- sync.awsdatabricks.get_all_cluster_events(cluster_id: str)¶
Fetches all ClusterEvents for a given Databricks cluster, optionally within a time window. Pages will be followed and returned as 1 object
- sync.awsdatabricks.get_cluster(cluster_id: str) sync.api.projects.Response[dict] ¶
Get Databricks cluster.
- Parameters
cluster_id (str) – cluster ID
- Returns
cluster object
- Return type
Response[dict]
- sync.awsdatabricks.get_cluster_report(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[DatabricksClusterReport] ¶
Fetches the cluster information required to create a Sync prediction
- Parameters
run_id (str) – Databricks run ID
plan_type (str) – Databricks Pricing Plan, e.g. “Standard”
compute_type (str) – Cluster compute type, e.g. “Jobs Compute”
project_id (str, optional) – The Sync Project ID this report should be generated for. This is good to provide in general, but especially for multi-cluster jobs.
allow_incomplete (bool, optional, defaults to False) – Whether creating a cluster report with incomplete data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude from the report
- Returns
cluster report
- Return type
Response[DatabricksClusterReport]
- sync.awsdatabricks.get_project_cluster(cluster: dict, project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict] ¶
Apply project configuration to a cluster.
The cluster is updated with tags and a log configuration to facilitate project continuity.
- Parameters
cluster (dict) – Databricks cluster object
project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration
- Returns
project job object
- Return type
Response[dict]
- sync.awsdatabricks.get_project_cluster_settings(project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict] ¶
Gets cluster configuration for a project.
This configuration is intended to be used to update the cluster of a Databricks job so that its runs can be included in a Sync project.
- Parameters
project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration
- Returns
project cluster settings - a subset of a Databricks cluster object
- Return type
Response[dict]
- sync.awsdatabricks.get_project_job(job_id: str, project_id: str, region_name: Optional[str] = None) sync.api.projects.Response[dict] ¶
Apply project configuration to a job.
The job can only have tasks that run on the same job cluster. That cluster is updated with tags and a log configuration to facilitate project continuity. The result can be tested in a one-off run or applied to an existing job to surface run-time (see
run_job_object()
) or cost optimizations.- Parameters
job_id (str) – ID of basis job
project_id (str) – Sync project ID
region_name (str, optional) – region name, defaults to AWS configuration
- Returns
project job object
- Return type
Response[dict]
- sync.awsdatabricks.get_recommendation_job(job_id: str, project_id: str, recommendation_id: str) sync.api.projects.Response[dict] ¶
Apply the recommendation to the specified job.
The basis job can only have tasks that run on the same cluster. That cluster is updated with the configuration from the prediction and returned in the result job configuration. Use this function to apply a prediction to an existing job or test a prediction with a one-off run.
- Parameters
job_id (str) – basis job ID
project_id (str) – Sync project ID
recommendation_id (str) – recommendation ID
- Returns
job object with recommendation applied to it
- Return type
Response[dict]
- sync.awsdatabricks.handle_successful_job_run(job_id: str, run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[Dict[str, str]] ¶
Create’s Sync project submissions for each eligible cluster in the run (see
record_run()
)If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.
For projects with auto_apply_recs=True, apply latest recommended cluster configurations.
- Parameters
job_id (str) – Databricks job ID
run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional, defaults to None) – Sync project ID
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude
- Returns
map of project ID to submission or prediction ID
- Return type
Response[Dict[str, str]]
- sync.awsdatabricks.record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None, allow_incomplete_cluster_report: bool = False, exclude_tasks: Optional[Collection[str]] = None) sync._databricks.Response[Dict[str, str]] ¶
Create’s Sync project submissions for each eligible cluster in the run.
If project ID is provided only submit run data for the cluster tagged with it, or the only cluster if there is such. If no project ID is provided then submit run data for each cluster tagged with a project ID.
- Parameters
run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional, defaults to None) – Sync project ID
allow_incomplete_cluster_report (bool, optional, defaults to False) – Whether creating a prediction with incomplete cluster report data should be allowable
exclude_tasks (Collection[str], optional, defaults to None) – Keys of tasks (task names) to exclude
- Returns
map of project ID to submission or prediction ID
- Return type
Response[Dict[str, str]]
- sync.awsdatabricks.run_and_record_job(job_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str] ¶
Runs the specified job and creates a prediction based on the result.
If a project is specified the prediction is added to it.
- Parameters
job_id (str) – Databricks job ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.run_and_record_job_object(job: dict, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str] ¶
Creates a one-off Databricks run based on the provided job object.
Job tasks must use the same job cluster, and that cluster must be configured to store the event logs in S3.
- Parameters
job (dict) – Databricks job object
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.run_and_record_project_job(job_id: str, project_id: str, plan_type: str, compute_type: str, region_name: Optional[str] = None) sync.api.Response[str] ¶
Runs the specified job and adds the result to the project.
This function waits for the run to complete.
- Parameters
job_id (str) – Databricks job ID
project_id (str) – Sync project ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
region_name (str, optional) – region name, defaults to AWS configuration
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.run_job_object(job: dict) sync._databricks.Response[Tuple[str, str]] ¶
Create a Databricks one-off run based on the job configuration.
- Parameters
job (dict) – Databricks job object
- Returns
run ID, and optionally ID of newly created cluster
- Return type
Response[Tuple[str, str]]
- sync.awsdatabricks.terminate_cluster(cluster_id: str) sync.api.projects.Response[dict] ¶
Terminate Databricks cluster and wait to return final state.
- Parameters
cluster_id (str) – Databricks cluster ID
- Returns
Databricks cluster object with state: “TERMINATED”
- Return type
Response[str]
- sync.awsdatabricks.wait_for_and_record_run(run_id: str, plan_type: str, compute_type: str, project_id: Optional[str] = None) sync.api.Response[str] ¶
Waits for a run to complete before creating a prediction.
The run must save 1 event log to S3. If a project is specified the prediction is added to that project.
- Parameters
run_id (str) – Databricks run ID
plan_type (str) – either “Standard”, “Premium” or “Enterprise”
compute_type (str) – e.g. “Jobs Compute”
project_id (str, optional) – Sync project ID, defaults to None
- Returns
prediction ID
- Return type
Response[str]
- sync.awsdatabricks.wait_for_final_run_status(run_id: str) sync.api.Response[str] ¶
Waits for run returning final status.
- Parameters
run_id (str) – Databricks run ID
- Returns
result state, e.g. “SUCCESS”
- Return type
Response[str]
- sync.awsdatabricks.wait_for_run_and_cluster(run_id: str) sync.api.Response[str] ¶
Waits for final run status and returns it after terminating the cluster.
- Parameters
run_id (str) – Databricks run ID
- Returns
result state, e.g. “SUCCESS”
- Return type
Response[str]