autogluon · boranhan · Nov 19, 2024 · Nov 15, 2024 · Nov 19, 2024 · Nov 19, 2024
diff --git a/README.md b/README.md
@@ -4,9 +4,9 @@
 [![GitHub license](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](./LICENSE)
 [![Continuous Integration](https://github.com/autogluon/autogluon-assistant/actions/workflows/lint.yml/badge.svg)](https://github.com/autogluon/autogluon-assistant/actions/workflows/lint.yml)
 
-AutoGluon Assistant (AG-A) provides users a simple interface where they can input their data, describe their problem, and receive a highly accurate and competitive ML solution — without writing any code. By leveraging the state-of-the-art AutoML capabilities of AutoGluon and integrating them with a Large Language Model (LLM), AG-A automates the entire data science pipeline. AG-A takes AutoGluon’s automation from three lines of code to zero, enabling users to solve new supervised learning tabular problems using only natural language descriptions.
+AutoGluon Assistant (AG-A) provides users a simple interface where they can input their data, describe their problem, and receive a highly accurate and competitive ML solution — without writing any code. By leveraging the state-of-the-art AutoML capabilities of [AutoGluon](https://github.com/autogluon/autogluon) and integrating them with a Large Language Model (LLM), AG-A automates the entire data science pipeline. AG-A takes [AutoGluon](https://github.com/autogluon/autogluon)'s automation from three lines of code to zero, enabling users to solve new supervised learning tabular problems using only natural language descriptions.
 
-## Setup
+## 💾 Installation
 
 Installing from source: 
 

diff --git a/docs/tutorials/autogluon-assistant-quick-start.ipynb b/docs/tutorials/autogluon-assistant-quick-start.ipynb
@@ -43,12 +43,29 @@
     "!pip install autogluon.assistant"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "3ea8b014",
+   "metadata": {},
+   "source": [
+    "*Warning: If you are using an MacOS, you may need to install libomp with*\n",
+    "```bash\n",
+    "brew install libomp\n",
+    "pip install --upgrade lightgbm\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "8d4f6834",
    "metadata": {},
    "source": [
-    "AutoGluon Assistant supports two LLM providers: AWS Bedrock (default) and OpenAI. Choose one of the following setups:"
+    "AutoGluon Assistant supports two LLM providers: AWS Bedrock (default) and OpenAI. You can configure with our provided tool:\n",
+    "```bash\n",
+    "wget https://raw.githubusercontent.com/autogluon/autogluon-assistant/refs/heads/main/tools/configure_llms.sh\n",
+    "source ./configure_llms.sh\n",
+    "```\n",
+    "Or choose one of the following setups:"
    ]
   },
   {
@@ -234,50 +251,62 @@
     "```\n",
     "INFO:root:Starting AutoGluon-Assistant\n",
     "INFO:root:Presets: medium_quality\n",
-    "INFO:root:Loading default config from: /media/deephome/autogluon-assistant/src/autogluon.assistant/configs/medium_quality.yaml\n",
+    "INFO:root:Loading default config from: /media/deephome/autogluon-assistant/src/autogluon/assistant/configs/default.yaml\n",
+    "INFO:root:Merging medium_quality config from: /media/deephome/autogluon-assistant/src/autogluon/assistant/configs/medium_quality.yaml\n",
     "INFO:root:Successfully loaded config\n",
     "🤖  Welcome to AutoGluon-Assistant \n",
     "Will use task config:\n",
     "{\n",
     "    'infer_eval_metric': True,\n",
     "    'detect_and_drop_id_column': False,\n",
     "    'task_preprocessors_timeout': 3600,\n",
+    "    'time_limit': 600,\n",
     "    'save_artifacts': {'enabled': False, 'append_timestamp': True, 'path': './aga-artifacts'},\n",
-    "    'feature_transformers': None,\n",
-    "    'autogluon': {'predictor_init_kwargs': {}, 'predictor_fit_kwargs': {'presets': 'medium_quality', 'time_limit': 600}},\n",
-    "    'llm': {\n",
-    "        'provider': 'bedrock',\n",
-    "        'model': 'anthropic.claude-3-5-sonnet-20241022-v2:0',\n",
-    "        'max_tokens': 512,\n",
-    "        'proxy_url': None,\n",
-    "        'temperature': 0,\n",
-    "        'verbose': True\n",
-    "    }\n",
+    "    'feature_transformers': {\n",
+    "        'enabled_models': [],\n",
+    "        'models': {\n",
+    "            'CAAFE': {\n",
+    "                '_target_': 'autogluon.assistant.transformer.feature_transformers.caafe.CAAFETransformer',\n",
+    "                'eval_model': 'lightgbm',\n",
+    "                'llm_provider': '${llm.provider}',\n",
+    "                'llm_model': '${llm.model}',\n",
+    "                'num_iterations': 5,\n",
+    "                'optimization_metric': 'roc'\n",
+    "            },\n",
+    "            'OpenFE': {'_target_': 'autogluon.assistant.transformer.feature_transformers.openfe.OpenFETransformer', 'n_jobs': 1, 'num_features_to_keep': 10},\n",
+    "            'PretrainedEmbedding': {'_target_': 'autogluon.assistant.transformer.feature_transformers.scentenceFT.PretrainedEmbeddingTransformer', 'model_name': 'all-mpnet-base-v2'}\n",
+    "        }\n",
+    "    },\n",
+    "    'autogluon': {'predictor_init_kwargs': {}, 'predictor_fit_kwargs': {'presets': 'medium_quality'}},\n",
+    "    'llm': {'provider': 'bedrock', 'model': 'anthropic.claude-3-5-sonnet-20241022-v2:0', 'max_tokens': 512, 'proxy_url': None, 'temperature': 0, 'verbose': True}\n",
     "}\n",
-    "Task path: /media/deephome/testdir/toy_data\n",
+    "Task path: /media/deephome/autogluon-assistant/toy_data_newest_backup\n",
     "Task loaded!\n",
-    "TabularPredictionTask(name=toy_data, description=, 3 datasets)\n",
-    "INFO:botocore.credentials:Found credentials in environment variables.\n",
-    "INFO:autogluon.assistant.llm.llm:AGA is using model anthropic.claude-3-5-sonnet-20241022-v2:0 from Bedrock to assist you with the task.\n",
-    "INFO:autogluon.assistant.assistant:Task understanding starts...\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:description: data_description_file: You are solving this data science tasks of binary classification: \\nThe dataset presented here (the spaceship dataset) comprises a lot of features, including both numerical and categorical features. Some of the features are missing, with nan value. We have splitted the dataset into three parts of train, valid and test. Your task is to predict the Transported item, which is a binary label with True and False. The evaluation metric is the classification accuracy.\\n\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:train_data: /media/deephome/testdir/toy_data/train.csv\n",
-    "Loaded data from: /media/deephome/testdir/toy_data/train.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:test_data: /media/deephome/testdir/toy_data/test.csv\n",
-    "Loaded data from: /media/deephome/testdir/toy_data/test.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:WARNING: Failed to identify the sample_submission_data of the task, it is set to None.\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:label_column: Transported\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:problem_type: binary\n",
-    "INFO:autogluon.assistant.task_inference.task_inference:eval_metric: accuracy\n",
-    "INFO:autogluon.assistant.assistant:Total number of prompt tokens: 1582\n",
-    "INFO:autogluon.assistant.assistant:Total number of completion tokens: 155\n",
-    "INFO:autogluon.assistant.assistant:Task understanding complete!\n",
-    "INFO:autogluon.assistant.assistant:Automatic feature generation is disabled. \n",
+    "TabularPredictionTask(name=toy_data_newest_backup, description=, 3 datasets)\n",
+    "INFO:botocore.credentials:Found credentials from IAM Role: Bedrock_Access\n",
+    "AGA is using model anthropic.claude-3-5-sonnet-20241022-v2:0 from Bedrock to assist you with the task.\n",
+    "INFO:botocore.credentials:Found credentials from IAM Role: Bedrock_Access\n",
+    "INFO:root:It took 0.16 seconds initializing components. Time remaining: 599.83/600.00\n",
+    "Task understanding starts...\n",
+    "description: data_description_file: You are solving this data science tasks of binary classification: \\nThe dataset presented here (the spaceship dataset) comprises a lot of features, including both numerical and categorical features. Some of the features are missing, with nan value. We have splitted the dataset into three parts of train, valid and test. Your task is to predict the Transported item, which is a binary label with True and False. The evaluation metric is the classification accuracy.\\n\n",
+    "train_data: /media/deephome/autogluon-assistant/toy_data_newest_backup/train.csv\n",
+    "Loaded data from: /media/deephome/autogluon-assistant/toy_data_newest_backup/train.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
+    "test_data: /media/deephome/autogluon-assistant/toy_data_newest_backup/test.csv\n",
+    "Loaded data from: /media/deephome/autogluon-assistant/toy_data_newest_backup/test.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
+    "WARNING: Failed to identify the sample_submission_data of the task, it is set to None.\n",
+    "label_column: Transported\n",
+    "problem_type: binary\n",
+    "eval_metric: accuracy\n",
+    "Total number of prompt tokens: 1614\n",
+    "Total number of completion tokens: 179\n",
+    "Task understanding complete!\n",
+    "Automatic feature generation is disabled. \n",
+    "INFO:root:It took 17.31 seconds preprocessing task. Time remaining: 582.51/600.00\n",
     "Model training starts...\n",
-    "INFO:autogluon.assistant.predictor:Fitting AutoGluon TabularPredictor\n",
-    "INFO:autogluon.assistant.predictor:predictor_init_kwargs: {'learner_kwargs': {'ignored_columns': []}, 'label': 'Transported', 'problem_type': 'binary', 'eval_metric': 'accuracy'}\n",
-    "INFO:autogluon.assistant.predictor:predictor_fit_kwargs: {'presets': 'medium_quality', 'time_limit': 600}\n",
-    "No path specified. Models will be saved in: \"AutogluonModels/ag-20241111_055131\"\n",
+    "Fitting AutoGluon TabularPredictor\n",
+    "predictor_init_kwargs: {'learner_kwargs': {'ignored_columns': []}, 'label': 'Transported', 'problem_type': 'binary', 'eval_metric': 'accuracy'}\n",
+    "predictor_fit_kwargs: {'presets': 'medium_quality'}\n",
+    "No path specified. Models will be saved in: \"AutogluonModels/ag-20241119_214901\"\n",
     "Verbosity: 2 (Standard Logging)\n",
     "=================== System Info ===================\n",
     "AutoGluon Version:  1.1.1\n",
@@ -286,12 +315,12 @@
     "Platform Machine:   x86_64\n",
     "Platform Version:   #54~20.04.1-Ubuntu SMP Fri Oct 6 22:04:33 UTC 2023\n",
     "CPU Count:          96\n",
-    "Memory Avail:       1030.28 GB / 1121.80 GB (91.8%)\n",
-    "Disk Space Avail:   64.75 GB / 860.63 GB (7.5%)\n",
+    "Memory Avail:       1024.83 GB / 1121.80 GB (91.4%)\n",
+    "Disk Space Avail:   63.91 GB / 860.63 GB (7.4%)\n",
     "===================================================\n",
     "Presets specified: ['medium_quality']\n",
-    "Beginning AutoGluon training ... Time limit = 600s\n",
-    "AutoGluon will save models to \"AutogluonModels/ag-20241111_055131\"\n",
+    "Beginning AutoGluon training ... Time limit = 583s\n",
+    "AutoGluon will save models to \"AutogluonModels/ag-20241119_214901\"\n",
     "Train Data Rows:    1000\n",
     "Train Data Columns: 15\n",
     "Label Column:       Transported\n",
@@ -300,7 +329,7 @@
     "Selected class <--> label mapping:  class 1 = True, class 0 = False\n",
     "Using Feature Generators to preprocess the data ...\n",
     "Fitting AutoMLPipelineFeatureGenerator...\n",
-    "        Available Memory:                    1055013.00 MB\n",
+    "        Available Memory:                    1049426.06 MB\n",
     "        Train Data (Original)  Memory Usage: 0.48 MB (0.0% of available memory)\n",
     "        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
     "        Stage 1 Generators:\n",
@@ -329,7 +358,7 @@
     "        0.1s = Fit runtime\n",
     "        14 features in original data used to generate 14 features in processed data.\n",
     "        Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)\n",
-    "Data preprocessing and feature engineering runtime = 0.1s ...\n",
+    "Data preprocessing and feature engineering runtime = 0.09s ...\n",
     "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
     "        To change this, specify the eval_metric parameter of Predictor()\n",
     "Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200\n",
@@ -345,31 +374,53 @@
     "        'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],\n",
     "}\n",
     "Fitting 13 L1 models ...\n",
-    "Fitting model: KNeighborsUnif ... Training model for up to 599.9s of the 599.9s of remaining time.\n",
+    "Fitting model: KNeighborsUnif ... Training model for up to 582.42s of the 582.42s of remaining time.\n",
     "        0.805    = Validation score   (accuracy)\n",
     "        0.04s    = Training   runtime\n",
     "        0.04s    = Validation runtime\n",
-    "Fitting model: KNeighborsDist ... Training model for up to 599.82s of the 599.82s of remaining time.\n",
+    "Fitting model: KNeighborsDist ... Training model for up to 582.34s of the 582.33s of remaining time.\n",
     "        0.79     = Validation score   (accuracy)\n",
     "        0.03s    = Training   runtime\n",
-    "        0.03s    = Validation runtime\n",
-    "Fitting model: LightGBMXT ... Training model for up to 599.75s of the 599.75s of remaining time.\n",
+    "        0.04s    = Validation runtime\n",
+    "Fitting model: LightGBMXT ... Training model for up to 582.27s of the 582.27s of remaining time.\n",
     "        0.83     = Validation score   (accuracy)\n",
-    "        0.87s    = Training   runtime\n",
-    "        0.01s    = Validation runtime\n",
+    "        1.44s    = Training   runtime\n",
+    "        0.02s    = Validation runtime\n",
     "\n",
     "......\n",
     "\n",
-    "Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 581.72s of remaining time.\n",
+    "Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 556.21s of remaining time.\n",
     "        Ensemble Weights: {'LightGBMLarge': 0.4, 'NeuralNetTorch': 0.25, 'NeuralNetFastAI': 0.2, 'CatBoost': 0.15}\n",
     "        0.855    = Validation score   (accuracy)\n",
-    "        0.12s    = Training   runtime\n",
+    "        0.16s    = Training   runtime\n",
     "        0.0s     = Validation runtime\n",
-    "AutoGluon training complete, total runtime = 18.41s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4025.3 rows/s (200 batch size)\n",
-    "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/ag-20241111_055131\")\n",
+    "AutoGluon training complete, total runtime = 26.47s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 2470.3 rows/s (200 batch size)\n",
+    "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/ag-20241119_214901\")\n",
     "Model training complete!\n",
+    "INFO:root:It took 26.84 seconds training model. Time remaining: 555.67/600.00\n",
     "Prediction starts...\n",
-    "Prediction complete! Outputs written to aga-output-20241111_055149.csv\n",
+    "Prediction complete! Outputs written to aga-output-20241119_214928.csv\n",
+    "INFO:root:It took 0.15 seconds making predictions. Time remaining: 555.52/600.00\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43e89033",
+   "metadata": {},
+   "source": [
+    "You can override specific settings in the YAML configuration defined in the [config folder](https://github.com/autogluon/autogluon-assistant/tree/main/src/autogluon/assistant/configs) using\n",
+    "the `config_overrides` parameter with format `\"key1=value1, key2.nested=value2\"` from the command line.\n",
+    "\n",
+    "\n",
+    "Here are some example commands on using configuration overrides:\n",
+    "\n",
+    "```bash\n",
+    "aga run toy_data --config_overrides \"feature_transformers.enabled_models=None, time_limit=3600\"\n",
+    "\n",
+    "# OR\n",
+    "\n",
+    "aga run toy_data --config_overrides \"feature_transformers.enabled_models=None\" --config_overrides \"time_limit=3600\"\n",
     "```"
    ]
   },
@@ -478,7 +529,9 @@
    "id": "b20d780a",
    "metadata": {},
    "source": [
-    "AG-A Web UI should now be accessible in your web browser at http://localhost:8501 or the specified port."
+    "AG-A Web UI should now be accessible in your web browser at http://localhost:8501 or the specified port.\n",
+    "\n",
+    "*Note: It might take up to a few mins to launch webui for the first time, since we are downloading the sample datasets...*"
    ]
   },
   {
@@ -512,7 +565,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.15"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,