Infra improvements (#66)

* set docker context to root of autogluon-bench project to prepare for copying package setup files to docker * install agbench according to local agbench version * use static base dir in docker to increase caching * Use /home as base dir for dependencies * require IMDSV2 in instances * use AWS array job to avoid throttle * raise lambda error * use custom metrics with standard metrics * use custom_configs/ for mounting * handle empty params and default eval_metric for init * add metrics support * update tests * lint * update README
autogluon · Nov 9, 2023 · fcf701e · fcf701e
1 parent 5bcd5bc
commit fcf701e
Show file tree

Hide file tree

Showing 26 changed files with 303 additions and 413 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,4 @@
+*
+!.git/
+!src/
+!pyproject.toml
diff --git a/README.md b/README.md
@@ -33,12 +33,6 @@ cd autogluon-bench
 pip install -e ".[tests]"
 ```
 
-For development, please be aware that `autogluon.bench` is installed as a dependency in certain places, such as the [Dockerfile](https://github.com/autogluon/autogluon-bench/blob/master/src/autogluon/bench/Dockerfile) and [Multimodal Setup](https://github.com/autogluon/autogluon-bench/blob/master/src/autogluon/bench/frameworks/multimodal/setup.sh). We made it possible to reflect the development changes by pushing the changes to a remote GitHub branch, and providing the URI when testing on benchmark runs:
-
-```
-agbench run sample_configs/multimodal_cloud_configs.yaml --dev-branch https://github.com/<username>/autogluon-bench.git#<dev_branch>
-```
-
 
 ## Run benchmarks locally
 
@@ -144,11 +138,11 @@ After having the configuration file ready, use the command below to initiate ben
 agbench run /path/to/cloud_config_file
 ```
 
-This command automatically sets up an AWS Batch environment using instance specifications defined in the [cloud config files](https://github.com/autogluon/autogluon-bench/tree/master/sample_configs). It also creates a lambda function named with your chosen `LAMBDA_FUNCTION_NAME`. This lambda function is automatically invoked with the cloud config file you provided, submitting multiple AWS Batch jobs to the job queue (named with the `PREFIX` you provided).
+This command automatically sets up an AWS Batch environment using instance specifications defined in the [cloud config files](https://github.com/autogluon/autogluon-bench/tree/master/sample_configs). It also creates a lambda function named with your chosen `LAMBDA_FUNCTION_NAME`. This lambda function is automatically invoked with the cloud config file you provided, submitting a single AWS Batch job or a parent job for [Array jobs](https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html) to the job queue (named with the `PREFIX` you provided).
 
-In order for the Lambda function to submit multiple jobs simultaneously, you need to specify a list of values for each module-specific key. Each combination of configurations is saved and uploaded to your specified `METRICS_BUCKET` in S3, stored under `S3://{METRICS_BUCKET}/configs/{BENCHMARK_NAME}_{timestamp}/{BENCHMARK_NAME}_split_{UID}.yaml`. Here, `UID` is a unique ID assigned to the split.
+In order for the Lambda function to submit multiple Array child jobs simultaneously, you need to specify a list of values for each module-specific key. Each combination of configurations is saved and uploaded to your specified `METRICS_BUCKET` in S3, stored under `S3://{METRICS_BUCKET}/configs/{module}/{BENCHMARK_NAME}_{timestamp}/{BENCHMARK_NAME}_split_{UID}.yaml`. Here, `UID` is a unique ID assigned to the split.
 
-The AWS infrastructure configurations and submitted job IDs are saved locally at `{WORKING_DIR}/{root_dir}/{module}/{benchmark_name}_{timestamp}/aws_configs.yaml`. You can use this file to check the job status at any time:
+The AWS infrastructure configurations and submitted job ID is saved locally at `{WORKING_DIR}/{root_dir}/{module}/{benchmark_name}_{timestamp}/aws_configs.yaml`. You can use this file to check the job status at any time:
 
 ```bash
 agbench get-job-status --config-file /path/to/aws_configs.yaml
@@ -272,5 +266,5 @@ agbench clean-amlb-results --help
 Step 3: Run evaluation on multiple cleaned files from `Step 2`
 
 ```
-agbench evaluate-amlb-results --frameworks-run framework_1 --frameworks-run framework_2 --results-dir-input data/results/input/prepared/openml/ --paths file_name_1.csv --paths file_name_2.csv --no-clean-data
+agbench evaluate-amlb-results --frameworks-run framework_1 --frameworks-run framework_2 --results-dir-input data/results/input/prepared/openml/ --paths file_name_1.csv --paths file_name_2.csv --output-suffix benchmark_name --no-clean-data
 ```
diff --git a/pyproject.toml b/pyproject.toml
@@ -108,3 +108,4 @@ xfail_strict = true
 
 [tool.setuptools_scm]
 write_to = "src/autogluon/bench/version.py"
+fallback_version = "0.0.1.dev0"
diff --git a/src/autogluon/bench/Dockerfile b/src/autogluon/bench/Dockerfile
@@ -2,6 +2,8 @@ ARG AG_BENCH_BASE_IMAGE
 FROM $AG_BENCH_BASE_IMAGE
 
 ENV DEBIAN_FRONTEND=noninteractive
+ENV RUNNING_IN_DOCKER=true
+ENV AGBENCH_BASE=src/autogluon/bench/
 
 # Install essential packages and Python 3.9
 RUN apt-get update && \
@@ -22,48 +24,38 @@ RUN apt-get install -y python3-pip unzip curl git pciutils && \
     rm -rf /var/lib/apt/lists/* /usr/local/aws
 
 # Application-specific steps
-ARG AG_BENCH_DEV_URL
 ARG AG_BENCH_VERSION
 ARG CDK_DEPLOY_REGION
 ARG FRAMEWORK_PATH
 ARG GIT_URI
 ARG GIT_BRANCH
-ARG BENCHMARK_DIR
 ARG AMLB_FRAMEWORK
 ARG AMLB_USER_DIR
 
 WORKDIR /app/
 
-RUN if [ -n "$AG_BENCH_DEV_URL" ]; then \
-        echo "Cloning: $AG_BENCH_DEV_URL" \
-        && AG_BENCH_DEV_REPO=$(echo "$AG_BENCH_DEV_URL" | cut -d "#" -f 1) \
-        && AG_BENCH_DEV_BRANCH=$(echo "$AG_BENCH_DEV_URL" | cut -d "#" -f 2) \
-        && git clone --branch "$AG_BENCH_DEV_BRANCH" --single-branch "$AG_BENCH_DEV_REPO" /app/autogluon-bench \
-        && python3 -m pip install -e /app/autogluon-bench; \
+# Copying necessary files for autogluon-bench package
+COPY . /app/
+COPY ${AGBENCH_BASE}entrypoint.sh /app/
+COPY ${AGBENCH_BASE}custom_configs /app/custom_configs/
+
+# check if autogluon.bench version contains "dev" tag
+RUN if echo "$AG_BENCH_VERSION" | grep -q "dev"; then \ 
+        # install from local source
+        pip install /app/; \
     else \
-        output=$(pip install autogluon.bench==$AG_BENCH_VERSION 2>&1) || true; \
-        if echo $output | grep -q "No matching distribution"; then \
-            echo -e "ERROR: No matching distribution found for autogluon.bench==$AG_BENCH_VERSION\n\
-            To resolve the issue, try 'agbench run <config_file> --dev-branch <autogluon_bench_fork_uri>#<git_branch>'"; \
-            exit 1; \
-        fi; \
+        pip install autogluon.bench==$AG_BENCH_VERSION; \
     fi
 
-COPY entrypoint.sh utils/hardware_utilization.sh $FRAMEWORK_PATH/setup.sh custom_configs/ /app/
-
-RUN chmod +x setup.sh entrypoint.sh hardware_utilization.sh \
+RUN chmod +x entrypoint.sh \
     && if echo "$FRAMEWORK_PATH" | grep -q -E "tabular|timeseries"; then \
         if [ -n "$AMLB_USER_DIR" ]; then \
-            bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR $AMLB_FRAMEWORK $AMLB_USER_DIR; \
+            bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AMLB_FRAMEWORK $AMLB_USER_DIR; \
         else \
-            bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR $AMLB_FRAMEWORK; \
+            bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AMLB_FRAMEWORK; \
         fi; \
     elif echo "$FRAMEWORK_PATH" | grep -q "multimodal"; then \
-        if [ -n "$AG_BENCH_DEV_URL" ]; then \
-            bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR --AGBENCH_DEV_URL=$AG_BENCH_DEV_URL; \
-        else \
-            bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR --AG_BENCH_VER=$AG_BENCH_VERSION; \
-        fi; \
+        bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AG_BENCH_VERSION; \
     fi \
     && echo "CDK_DEPLOY_REGION=$CDK_DEPLOY_REGION" >> /etc/environment
 

diff --git a/src/autogluon/bench/cloud/aws/batch_stack/lambdas/amlb_configs/__init__.py b/src/autogluon/bench/cloud/aws/batch_stack/lambdas/amlb_configs/__init__.py
diff --git a/src/autogluon/bench/cloud/aws/batch_stack/lambdas/custom_configs/amlb_configs/__init__.py b/src/autogluon/bench/cloud/aws/batch_stack/lambdas/custom_configs/amlb_configs/__init__.py
diff --git a/src/autogluon/bench/cloud/aws/batch_stack/lambdas/lambda_function.py b/src/autogluon/bench/cloud/aws/batch_stack/lambdas/lambda_function.py
@@ -2,7 +2,6 @@
 import itertools
 import logging
 import os
-import uuid
 import zipfile
 
 import requests
@@ -18,7 +17,7 @@
 AMLB_DEPENDENT_MODULES = ["tabular", "timeseries"]
 
 
-def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: str):
+def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: str, array_size: int):
     """
     Submits a Batch job with the given environment variables, job name, job queue and job definition.
 
@@ -27,17 +26,23 @@ def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: s
         job_name (str): Name of the job.
         job_queue (str): Name of the job queue.
         job_definition (str): Name of the job definition.
+        array_size (int): Number of jobs to submit.
 
     Returns:
         str: Job ID.
     """
     container_overrides = {"environment": env}
-    response = aws_batch.submit_job(
-        jobName=job_name,
-        jobQueue=job_queue,
-        jobDefinition=job_definition,
-        containerOverrides=container_overrides,
-    )
+    job_params = {
+        "jobName": job_name,
+        "jobQueue": job_queue,
+        "jobDefinition": job_definition,
+        "containerOverrides": container_overrides,
+    }
+    if array_size > 1:
+        job_params["arrayProperties"] = {"size": array_size}
+
+    response = aws_batch.submit_job(**job_params)
+
     logger.info("Job %s submitted to AWS Batch queue %s.", job_name, job_queue)
     logger.info(response)
     return response["jobId"]
@@ -88,7 +93,7 @@ def download_dir_from_s3(s3_path: str, local_path: str) -> str:
     return local_path
 
 
-def upload_config(bucket: str, benchmark_name: str, file: str):
+def upload_config(config_list: list, bucket: str, benchmark_name: str):
     """
     Uploads a file to the given S3 bucket.
 
@@ -99,28 +104,9 @@ def upload_config(bucket: str, benchmark_name: str, file: str):
     Returns:
         str: S3 path of the uploaded file.
     """
-    file_name = f'{file.split("/")[-1].split(".")[0]}.yaml'
-    s3_path = f"configs/{benchmark_name}/{file_name}"
-    s3.upload_file(file, bucket, s3_path)
-    return f"s3://{bucket}/{s3_path}"
-
-
-def save_configs(configs: dict, uid: str):
-    """
-    Saves the given dictionary of configs to a YAML file with the given UID as a part of the filename.
-
-    Args:
-        configs (Dict[str, Any]): Dictionary of configurations to be saved.
-        uid (str): UID to be added to the filename of the saved file.
-
-    Returns:
-        str: Local path of the saved file.
-    """
-    benchmark_name = configs["benchmark_name"]
-    config_file_path = os.path.join("/tmp", f"{benchmark_name}_split_{uid}.yaml")
-    with open(config_file_path, "w+") as f:
-        yaml.dump(configs, f, default_flow_style=False)
-    return config_file_path
+    s3_key = f"configs/{benchmark_name}/{benchmark_name}_job_configs.yaml"
+    s3.put_object(Body=yaml.dump(config_list), Bucket=bucket, Key=s3_key)
+    return f"s3://{bucket}/{s3_key}"
 
 
 def download_automlbenchmark_resources():
@@ -217,59 +203,37 @@ def process_benchmark_runs(module_configs: dict, amlb_benchmark_search_dirs: lis
                 module_configs["fold_to_run"][benchmark][task] = amlb_task_folds[benchmark][task]
 
 
-def process_combination(configs, metrics_bucket, batch_job_queue, batch_job_definition):
-    """
-    Processes a combination of configurations by generating and submitting Batch jobs.
-
-    Args:
-        combination (Tuple): tuple of configurations to process.
-        keys (List[str]): list of keys of the configurations.
-        metrics_bucket (str): name of the bucket to upload metrics to.
-        batch_job_queue (str): name of the Batch job queue to submit jobs to.
-        batch_job_definition (str): name of the Batch job definition to use for submitting jobs.
-
-    Returns:
-        str: job id of the submitted batch job.
-    """
-    logger.info(f"Generating config with: {configs}")
-    config_uid = uuid.uuid1().hex
-    config_local_path = save_configs(configs=configs, uid=config_uid)
-    config_s3_path = upload_config(
-        bucket=metrics_bucket, benchmark_name=configs["benchmark_name"], file=config_local_path
-    )
-    job_name = f"{configs['benchmark_name']}-{configs['module']}-{config_uid}"
-    env = [{"name": "config_file", "value": config_s3_path}]
-
-    job_id = submit_batch_job(
-        env=env,
-        job_name=job_name,
-        job_queue=batch_job_queue,
-        job_definition=batch_job_definition,
-    )
-    return job_id, config_s3_path
+def get_cloudwatch_logs_url(region: str, job_id: str, log_group_name: str = "aws/batch/job"):
+    base_url = f"https://console.aws.amazon.com/cloudwatch/home?region={region}"
+    job_response = aws_batch.describe_job(jobs=[job_id])
+    log_stream_name = job_response["jobs"][0]["attempts"][0]["container"]["logStreamName"]
+    return f"{base_url}#logsV2:log-groups/log-group/{log_group_name.replace('/', '%2F')}/log-events/{log_stream_name.replace('/', '%2F')}"
 
 
 def generate_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
-    job_configs = {}
-    config.pop("cdk_context")
+    job_configs = []
     if config["module"] in AMLB_DEPENDENT_MODULES:
-        job_configs = generate_amlb_module_config_combinations(
-            config, metrics_bucket, batch_job_queue, batch_job_definition
-        )
+        job_configs = generate_amlb_module_config_combinations(config)
     elif config["module"] == "multimodal":
-        job_configs = generate_multimodal_config_combinations(
-            config, metrics_bucket, batch_job_queue, batch_job_definition
-        )
+        job_configs = generate_multimodal_config_combinations(config)
     else:
         raise ValueError("Invalid module. Choose either 'tabular', 'timeseries', or 'multimodal'.")
 
-    response = {
-        "job_configs": job_configs,
-    }
-    return response
+    benchmark_name = config["benchmark_name"]
+    config_s3_path = upload_config(config_list=job_configs, bucket=metrics_bucket, benchmark_name=benchmark_name)
+    env = [{"name": "config_file", "value": config_s3_path}]
+    job_name = f"{benchmark_name}-array-job"
+    parent_job_id = submit_batch_job(
+        env=env,
+        job_name=job_name,
+        job_queue=batch_job_queue,
+        job_definition=batch_job_definition,
+        array_size=len(job_configs),
+    )
+    return {parent_job_id: config_s3_path}
 
 
-def generate_multimodal_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
+def generate_multimodal_config_combinations(config):
     common_keys = []
     specific_keys = []
     for key in config.keys():
@@ -278,23 +242,21 @@ def generate_multimodal_config_combinations(config, metrics_bucket, batch_job_qu
         else:
             common_keys.append(key)
 
-    job_configs = {}
     specific_value_combinations = list(
         itertools.product(*(config[key] for key in specific_keys if key in config.keys()))
     ) or [None]
 
+    all_configs = []
     for combo in specific_value_combinations:
         new_config = {key: config[key] for key in common_keys}
         if combo is not None:
             new_config.update(dict(zip(specific_keys, combo)))
+        all_configs.append(new_config)
 
-        job_id, config_s3_path = process_combination(new_config, metrics_bucket, batch_job_queue, batch_job_definition)
-        job_configs[job_id] = config_s3_path
-
-    return job_configs
+    return all_configs
 
 
-def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
+def generate_amlb_module_config_combinations(config):
     specific_keys = ["git_uri#branch", "framework", "amlb_constraint", "amlb_user_dir"]
     exclude_keys = ["amlb_benchmark", "amlb_task", "fold_to_run"]
     common_keys = []
@@ -308,13 +270,13 @@ def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_q
         else:
             common_keys.append(key)
 
-    job_configs = {}
     specific_value_combinations = list(
         itertools.product(*(config[key] for key in specific_keys if key in config.keys()))
     ) or [None]
 
     # Iterate through the combinations and the amlb benchmark task keys
     # Generates a config for each combination of specific key and keys in `fold_to_run`
+    all_configs = []
     for combo in specific_value_combinations:
         for benchmark, tasks in config["fold_to_run"].items():
             for task, fold_numbers in tasks.items():
@@ -325,62 +287,16 @@ def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_q
                     new_config["amlb_benchmark"] = benchmark
                     new_config["amlb_task"] = task
                     new_config["fold_to_run"] = fold_num
-                    job_id, config_s3_path = process_combination(
-                        new_config, metrics_bucket, batch_job_queue, batch_job_definition
-                    )
-                    job_configs[job_id] = config_s3_path
-    return job_configs
+                    all_configs.append(new_config)
+
+    return all_configs
 
 
 def handler(event, context):
     """
     Execution entrypoint for AWS Lambda.
     Triggers batch jobs with hyperparameter combinations.
     ENV variables are set by the AWS CDK infra code.
-
-    Sample of cloud_configs.yaml to be supplied by user
-
-    # Infra configurations
-    cdk_context:
-        CDK_DEPLOY_ACCOUNT: dummy
-        CDK_DEPLOY_REGION: dummy
-
-    # Benchmark configurations
-    module: multimodal
-    mode: aws
-    benchmark_name: test_yaml
-    metrics_bucket: autogluon-benchmark-metrics
-
-    # Module specific configurations
-    module_configs:
-        # Multimodal specific
-        multimodal:
-            git_uri#branch: https://github.com/autogluon/autogluon#master
-            dataset_name: melbourne_airbnb
-            presets: medium_quality
-            hyperparameters:
-            optimization.learning_rate: 0.0005
-            optimization.max_epochs: 5
-            time_limit: 10
-
-
-        # Tabular specific
-        # You can refer to AMLB (https://github.com/openml/automlbenchmark#quickstart) for more details
-        tabular:
-            framework:
-                - AutoGluon
-            label:
-                - stable
-            amlb_benchmark:
-                - test
-                - small
-            amlb_task:
-                test: null
-                small:
-                    - credit-g
-                    - vehicle
-            amlb_constraint:
-                - test
     """
     if "config_file" not in event or not event["config_file"].startswith("s3"):
         raise KeyError("S3 path of config file is required.")
Original file line number	Diff line number	Diff line change
Expand Up		@@ -108,3 +108,4 @@ xfail_strict = true

		[tool.setuptools_scm]
		write_to = "src/autogluon/bench/version.py"
		fallback_version = "0.0.1.dev0"