Skip to content

Commit

Permalink
Infra improvements (#66)
Browse files Browse the repository at this point in the history
* set docker context to root of autogluon-bench project

to prepare for copying package setup files to docker

* install agbench according to local agbench version

* use static base dir in docker to increase caching

* Use /home as base dir for dependencies

* require IMDSV2 in instances

* use AWS array job to avoid throttle

* raise lambda error

* use custom metrics with standard metrics

* use custom_configs/ for mounting

* handle empty params and default eval_metric for init

* add metrics support

* update tests

* lint

* update README
  • Loading branch information
suzhoum authored Nov 9, 2023
1 parent 5bcd5bc commit fcf701e
Show file tree
Hide file tree
Showing 26 changed files with 303 additions and 413 deletions.
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*
!.git/
!src/
!pyproject.toml
14 changes: 4 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,6 @@ cd autogluon-bench
pip install -e ".[tests]"
```

For development, please be aware that `autogluon.bench` is installed as a dependency in certain places, such as the [Dockerfile](https://github.com/autogluon/autogluon-bench/blob/master/src/autogluon/bench/Dockerfile) and [Multimodal Setup](https://github.com/autogluon/autogluon-bench/blob/master/src/autogluon/bench/frameworks/multimodal/setup.sh). We made it possible to reflect the development changes by pushing the changes to a remote GitHub branch, and providing the URI when testing on benchmark runs:

```
agbench run sample_configs/multimodal_cloud_configs.yaml --dev-branch https://github.com/<username>/autogluon-bench.git#<dev_branch>
```


## Run benchmarks locally

Expand Down Expand Up @@ -144,11 +138,11 @@ After having the configuration file ready, use the command below to initiate ben
agbench run /path/to/cloud_config_file
```

This command automatically sets up an AWS Batch environment using instance specifications defined in the [cloud config files](https://github.com/autogluon/autogluon-bench/tree/master/sample_configs). It also creates a lambda function named with your chosen `LAMBDA_FUNCTION_NAME`. This lambda function is automatically invoked with the cloud config file you provided, submitting multiple AWS Batch jobs to the job queue (named with the `PREFIX` you provided).
This command automatically sets up an AWS Batch environment using instance specifications defined in the [cloud config files](https://github.com/autogluon/autogluon-bench/tree/master/sample_configs). It also creates a lambda function named with your chosen `LAMBDA_FUNCTION_NAME`. This lambda function is automatically invoked with the cloud config file you provided, submitting a single AWS Batch job or a parent job for [Array jobs](https://docs.aws.amazon.com/batch/latest/userguide/array_jobs.html) to the job queue (named with the `PREFIX` you provided).

In order for the Lambda function to submit multiple jobs simultaneously, you need to specify a list of values for each module-specific key. Each combination of configurations is saved and uploaded to your specified `METRICS_BUCKET` in S3, stored under `S3://{METRICS_BUCKET}/configs/{BENCHMARK_NAME}_{timestamp}/{BENCHMARK_NAME}_split_{UID}.yaml`. Here, `UID` is a unique ID assigned to the split.
In order for the Lambda function to submit multiple Array child jobs simultaneously, you need to specify a list of values for each module-specific key. Each combination of configurations is saved and uploaded to your specified `METRICS_BUCKET` in S3, stored under `S3://{METRICS_BUCKET}/configs/{module}/{BENCHMARK_NAME}_{timestamp}/{BENCHMARK_NAME}_split_{UID}.yaml`. Here, `UID` is a unique ID assigned to the split.

The AWS infrastructure configurations and submitted job IDs are saved locally at `{WORKING_DIR}/{root_dir}/{module}/{benchmark_name}_{timestamp}/aws_configs.yaml`. You can use this file to check the job status at any time:
The AWS infrastructure configurations and submitted job ID is saved locally at `{WORKING_DIR}/{root_dir}/{module}/{benchmark_name}_{timestamp}/aws_configs.yaml`. You can use this file to check the job status at any time:

```bash
agbench get-job-status --config-file /path/to/aws_configs.yaml
Expand Down Expand Up @@ -272,5 +266,5 @@ agbench clean-amlb-results --help
Step 3: Run evaluation on multiple cleaned files from `Step 2`

```
agbench evaluate-amlb-results --frameworks-run framework_1 --frameworks-run framework_2 --results-dir-input data/results/input/prepared/openml/ --paths file_name_1.csv --paths file_name_2.csv --no-clean-data
agbench evaluate-amlb-results --frameworks-run framework_1 --frameworks-run framework_2 --results-dir-input data/results/input/prepared/openml/ --paths file_name_1.csv --paths file_name_2.csv --output-suffix benchmark_name --no-clean-data
```
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,4 @@ xfail_strict = true

[tool.setuptools_scm]
write_to = "src/autogluon/bench/version.py"
fallback_version = "0.0.1.dev0"
40 changes: 16 additions & 24 deletions src/autogluon/bench/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ ARG AG_BENCH_BASE_IMAGE
FROM $AG_BENCH_BASE_IMAGE

ENV DEBIAN_FRONTEND=noninteractive
ENV RUNNING_IN_DOCKER=true
ENV AGBENCH_BASE=src/autogluon/bench/

# Install essential packages and Python 3.9
RUN apt-get update && \
Expand All @@ -22,48 +24,38 @@ RUN apt-get install -y python3-pip unzip curl git pciutils && \
rm -rf /var/lib/apt/lists/* /usr/local/aws

# Application-specific steps
ARG AG_BENCH_DEV_URL
ARG AG_BENCH_VERSION
ARG CDK_DEPLOY_REGION
ARG FRAMEWORK_PATH
ARG GIT_URI
ARG GIT_BRANCH
ARG BENCHMARK_DIR
ARG AMLB_FRAMEWORK
ARG AMLB_USER_DIR

WORKDIR /app/

RUN if [ -n "$AG_BENCH_DEV_URL" ]; then \
echo "Cloning: $AG_BENCH_DEV_URL" \
&& AG_BENCH_DEV_REPO=$(echo "$AG_BENCH_DEV_URL" | cut -d "#" -f 1) \
&& AG_BENCH_DEV_BRANCH=$(echo "$AG_BENCH_DEV_URL" | cut -d "#" -f 2) \
&& git clone --branch "$AG_BENCH_DEV_BRANCH" --single-branch "$AG_BENCH_DEV_REPO" /app/autogluon-bench \
&& python3 -m pip install -e /app/autogluon-bench; \
# Copying necessary files for autogluon-bench package
COPY . /app/
COPY ${AGBENCH_BASE}entrypoint.sh /app/
COPY ${AGBENCH_BASE}custom_configs /app/custom_configs/

# check if autogluon.bench version contains "dev" tag
RUN if echo "$AG_BENCH_VERSION" | grep -q "dev"; then \
# install from local source
pip install /app/; \
else \
output=$(pip install autogluon.bench==$AG_BENCH_VERSION 2>&1) || true; \
if echo $output | grep -q "No matching distribution"; then \
echo -e "ERROR: No matching distribution found for autogluon.bench==$AG_BENCH_VERSION\n\
To resolve the issue, try 'agbench run <config_file> --dev-branch <autogluon_bench_fork_uri>#<git_branch>'"; \
exit 1; \
fi; \
pip install autogluon.bench==$AG_BENCH_VERSION; \
fi

COPY entrypoint.sh utils/hardware_utilization.sh $FRAMEWORK_PATH/setup.sh custom_configs/ /app/

RUN chmod +x setup.sh entrypoint.sh hardware_utilization.sh \
RUN chmod +x entrypoint.sh \
&& if echo "$FRAMEWORK_PATH" | grep -q -E "tabular|timeseries"; then \
if [ -n "$AMLB_USER_DIR" ]; then \
bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR $AMLB_FRAMEWORK $AMLB_USER_DIR; \
bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AMLB_FRAMEWORK $AMLB_USER_DIR; \
else \
bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR $AMLB_FRAMEWORK; \
bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AMLB_FRAMEWORK; \
fi; \
elif echo "$FRAMEWORK_PATH" | grep -q "multimodal"; then \
if [ -n "$AG_BENCH_DEV_URL" ]; then \
bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR --AGBENCH_DEV_URL=$AG_BENCH_DEV_URL; \
else \
bash setup.sh $GIT_URI $GIT_BRANCH $BENCHMARK_DIR --AG_BENCH_VER=$AG_BENCH_VERSION; \
fi; \
bash ${AGBENCH_BASE}${FRAMEWORK_PATH}setup.sh $GIT_URI $GIT_BRANCH "/home" $AG_BENCH_VERSION; \
fi \
&& echo "CDK_DEPLOY_REGION=$CDK_DEPLOY_REGION" >> /etc/environment

Expand Down

This file was deleted.

Empty file.
176 changes: 46 additions & 130 deletions src/autogluon/bench/cloud/aws/batch_stack/lambdas/lambda_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import itertools
import logging
import os
import uuid
import zipfile

import requests
Expand All @@ -18,7 +17,7 @@
AMLB_DEPENDENT_MODULES = ["tabular", "timeseries"]


def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: str):
def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: str, array_size: int):
"""
Submits a Batch job with the given environment variables, job name, job queue and job definition.
Expand All @@ -27,17 +26,23 @@ def submit_batch_job(env: list, job_name: str, job_queue: str, job_definition: s
job_name (str): Name of the job.
job_queue (str): Name of the job queue.
job_definition (str): Name of the job definition.
array_size (int): Number of jobs to submit.
Returns:
str: Job ID.
"""
container_overrides = {"environment": env}
response = aws_batch.submit_job(
jobName=job_name,
jobQueue=job_queue,
jobDefinition=job_definition,
containerOverrides=container_overrides,
)
job_params = {
"jobName": job_name,
"jobQueue": job_queue,
"jobDefinition": job_definition,
"containerOverrides": container_overrides,
}
if array_size > 1:
job_params["arrayProperties"] = {"size": array_size}

response = aws_batch.submit_job(**job_params)

logger.info("Job %s submitted to AWS Batch queue %s.", job_name, job_queue)
logger.info(response)
return response["jobId"]
Expand Down Expand Up @@ -88,7 +93,7 @@ def download_dir_from_s3(s3_path: str, local_path: str) -> str:
return local_path


def upload_config(bucket: str, benchmark_name: str, file: str):
def upload_config(config_list: list, bucket: str, benchmark_name: str):
"""
Uploads a file to the given S3 bucket.
Expand All @@ -99,28 +104,9 @@ def upload_config(bucket: str, benchmark_name: str, file: str):
Returns:
str: S3 path of the uploaded file.
"""
file_name = f'{file.split("/")[-1].split(".")[0]}.yaml'
s3_path = f"configs/{benchmark_name}/{file_name}"
s3.upload_file(file, bucket, s3_path)
return f"s3://{bucket}/{s3_path}"


def save_configs(configs: dict, uid: str):
"""
Saves the given dictionary of configs to a YAML file with the given UID as a part of the filename.
Args:
configs (Dict[str, Any]): Dictionary of configurations to be saved.
uid (str): UID to be added to the filename of the saved file.
Returns:
str: Local path of the saved file.
"""
benchmark_name = configs["benchmark_name"]
config_file_path = os.path.join("/tmp", f"{benchmark_name}_split_{uid}.yaml")
with open(config_file_path, "w+") as f:
yaml.dump(configs, f, default_flow_style=False)
return config_file_path
s3_key = f"configs/{benchmark_name}/{benchmark_name}_job_configs.yaml"
s3.put_object(Body=yaml.dump(config_list), Bucket=bucket, Key=s3_key)
return f"s3://{bucket}/{s3_key}"


def download_automlbenchmark_resources():
Expand Down Expand Up @@ -217,59 +203,37 @@ def process_benchmark_runs(module_configs: dict, amlb_benchmark_search_dirs: lis
module_configs["fold_to_run"][benchmark][task] = amlb_task_folds[benchmark][task]


def process_combination(configs, metrics_bucket, batch_job_queue, batch_job_definition):
"""
Processes a combination of configurations by generating and submitting Batch jobs.
Args:
combination (Tuple): tuple of configurations to process.
keys (List[str]): list of keys of the configurations.
metrics_bucket (str): name of the bucket to upload metrics to.
batch_job_queue (str): name of the Batch job queue to submit jobs to.
batch_job_definition (str): name of the Batch job definition to use for submitting jobs.
Returns:
str: job id of the submitted batch job.
"""
logger.info(f"Generating config with: {configs}")
config_uid = uuid.uuid1().hex
config_local_path = save_configs(configs=configs, uid=config_uid)
config_s3_path = upload_config(
bucket=metrics_bucket, benchmark_name=configs["benchmark_name"], file=config_local_path
)
job_name = f"{configs['benchmark_name']}-{configs['module']}-{config_uid}"
env = [{"name": "config_file", "value": config_s3_path}]

job_id = submit_batch_job(
env=env,
job_name=job_name,
job_queue=batch_job_queue,
job_definition=batch_job_definition,
)
return job_id, config_s3_path
def get_cloudwatch_logs_url(region: str, job_id: str, log_group_name: str = "aws/batch/job"):
base_url = f"https://console.aws.amazon.com/cloudwatch/home?region={region}"
job_response = aws_batch.describe_job(jobs=[job_id])
log_stream_name = job_response["jobs"][0]["attempts"][0]["container"]["logStreamName"]
return f"{base_url}#logsV2:log-groups/log-group/{log_group_name.replace('/', '%2F')}/log-events/{log_stream_name.replace('/', '%2F')}"


def generate_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
job_configs = {}
config.pop("cdk_context")
job_configs = []
if config["module"] in AMLB_DEPENDENT_MODULES:
job_configs = generate_amlb_module_config_combinations(
config, metrics_bucket, batch_job_queue, batch_job_definition
)
job_configs = generate_amlb_module_config_combinations(config)
elif config["module"] == "multimodal":
job_configs = generate_multimodal_config_combinations(
config, metrics_bucket, batch_job_queue, batch_job_definition
)
job_configs = generate_multimodal_config_combinations(config)
else:
raise ValueError("Invalid module. Choose either 'tabular', 'timeseries', or 'multimodal'.")

response = {
"job_configs": job_configs,
}
return response
benchmark_name = config["benchmark_name"]
config_s3_path = upload_config(config_list=job_configs, bucket=metrics_bucket, benchmark_name=benchmark_name)
env = [{"name": "config_file", "value": config_s3_path}]
job_name = f"{benchmark_name}-array-job"
parent_job_id = submit_batch_job(
env=env,
job_name=job_name,
job_queue=batch_job_queue,
job_definition=batch_job_definition,
array_size=len(job_configs),
)
return {parent_job_id: config_s3_path}


def generate_multimodal_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
def generate_multimodal_config_combinations(config):
common_keys = []
specific_keys = []
for key in config.keys():
Expand All @@ -278,23 +242,21 @@ def generate_multimodal_config_combinations(config, metrics_bucket, batch_job_qu
else:
common_keys.append(key)

job_configs = {}
specific_value_combinations = list(
itertools.product(*(config[key] for key in specific_keys if key in config.keys()))
) or [None]

all_configs = []
for combo in specific_value_combinations:
new_config = {key: config[key] for key in common_keys}
if combo is not None:
new_config.update(dict(zip(specific_keys, combo)))
all_configs.append(new_config)

job_id, config_s3_path = process_combination(new_config, metrics_bucket, batch_job_queue, batch_job_definition)
job_configs[job_id] = config_s3_path

return job_configs
return all_configs


def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_queue, batch_job_definition):
def generate_amlb_module_config_combinations(config):
specific_keys = ["git_uri#branch", "framework", "amlb_constraint", "amlb_user_dir"]
exclude_keys = ["amlb_benchmark", "amlb_task", "fold_to_run"]
common_keys = []
Expand All @@ -308,13 +270,13 @@ def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_q
else:
common_keys.append(key)

job_configs = {}
specific_value_combinations = list(
itertools.product(*(config[key] for key in specific_keys if key in config.keys()))
) or [None]

# Iterate through the combinations and the amlb benchmark task keys
# Generates a config for each combination of specific key and keys in `fold_to_run`
all_configs = []
for combo in specific_value_combinations:
for benchmark, tasks in config["fold_to_run"].items():
for task, fold_numbers in tasks.items():
Expand All @@ -325,62 +287,16 @@ def generate_amlb_module_config_combinations(config, metrics_bucket, batch_job_q
new_config["amlb_benchmark"] = benchmark
new_config["amlb_task"] = task
new_config["fold_to_run"] = fold_num
job_id, config_s3_path = process_combination(
new_config, metrics_bucket, batch_job_queue, batch_job_definition
)
job_configs[job_id] = config_s3_path
return job_configs
all_configs.append(new_config)

return all_configs


def handler(event, context):
"""
Execution entrypoint for AWS Lambda.
Triggers batch jobs with hyperparameter combinations.
ENV variables are set by the AWS CDK infra code.
Sample of cloud_configs.yaml to be supplied by user
# Infra configurations
cdk_context:
CDK_DEPLOY_ACCOUNT: dummy
CDK_DEPLOY_REGION: dummy
# Benchmark configurations
module: multimodal
mode: aws
benchmark_name: test_yaml
metrics_bucket: autogluon-benchmark-metrics
# Module specific configurations
module_configs:
# Multimodal specific
multimodal:
git_uri#branch: https://github.com/autogluon/autogluon#master
dataset_name: melbourne_airbnb
presets: medium_quality
hyperparameters:
optimization.learning_rate: 0.0005
optimization.max_epochs: 5
time_limit: 10
# Tabular specific
# You can refer to AMLB (https://github.com/openml/automlbenchmark#quickstart) for more details
tabular:
framework:
- AutoGluon
label:
- stable
amlb_benchmark:
- test
- small
amlb_task:
test: null
small:
- credit-g
- vehicle
amlb_constraint:
- test
"""
if "config_file" not in event or not event["config_file"].startswith("s3"):
raise KeyError("S3 path of config file is required.")
Expand Down
Loading

0 comments on commit fcf701e

Please sign in to comment.