Harness ML is an MCP server that gives you structured access to a full ML pipeline — data ingestion, feature engineering, model training, backtesting, calibration, ensembling, and experiment tracking. You never write training loops or ML code. You declare intent through tool calls, and Harness handles execution, validation, and logging.
7 tools. 80+ actions. 17 guardrails. 45 metrics across 6 task types. 12 model wrappers. 22 view transform steps. 4 calibration methods. 8 CV strategies.
Everything below is your complete protocol reference.
workflow.enforce_phases: true**Error**:project_dir (defaults to cwd)configure(action="init", project_name="my_project", task="regression", target_column="price")
data(action="add", data_path="data/raw/train.csv")
data(action="inspect")
features(action="discover")
models(action="add", name="xgb_main", model_type="xgboost")
models(action="add", name="lgb_main", model_type="lightgbm")
configure(action="backtest", cv_strategy="kfold", fold_values=[1,2,3,4,5], metrics=["rmse","mae","r2"])
configure(action="ensemble", method="stacked", calibration="spline")
pipeline(action="run_backtest")
pipeline(action="diagnostics")
init → project_name?, task?, target_column?, key_columns?, time_column?
update_data → target_column?, key_columns?, time_column?
ensemble → method?(stacked|average|weighted), temperature?, exclude_models?, calibration?(spline|isotonic|platt|none), pre_calibration?(JSON), spline_n_bins?
backtest → cv_strategy?, fold_values?(list), metrics?(list), min_train_folds?, fold_column?
show → detail?(summary|full), section?(models|ensemble|backtest)
check_guardrails →
exclude_columns → add_columns?(list), remove_columns?(list)
set_denylist → add_columns?, remove_columns?
add_target → name, target_column, task?(binary), metrics?
list_targets →
set_target → name
studio →
add → data_path, join_on?(list), prefix?, auto_clean?
validate → data_path
fill_nulls → column, strategy?(median|mean|mode|zero|value), value?
fill_nulls_batch → columns(JSON array of {column, strategy?, value?})
drop_duplicates → columns?(subset list)
drop_rows → column?, condition?("null" or pandas query)
rename → mapping(JSON {old: new})
derive_column → name, expression, group_by?, dtype?
inspect → column?
profile → category?
list_features → prefix?
status →
add_source → name, data_path, format?(auto|csv|parquet|excel)
add_sources_batch → sources(JSON array)
add_view → name, source, steps?(JSON array), description?
add_views_batch → views(JSON array)
update_view → name, source?, steps?, description?
remove_view → name
list_views →
preview_view → name, n_rows?(5)
set_features_view → name
view_dag →
list_sources →
check_freshness →
refresh → name
refresh_all →
validate_source → name
sample → fraction(0.0-1.0), stratify_column?, seed?(42)
restore →
fetch_url → data_path(URL), name?
upload_drive → files(list), folder_id?, folder_name?, name?
upload_kaggle → files, dataset_slug, title?
View step formats (for steps JSON array):
{op: "filter", expr: "col > 0"}
{op: "select", columns: ["a", "b"]}
{op: "derive", columns: {new_col: "a - b"}}
{op: "group_by", keys: ["k"], aggs: {col: "mean"}}
{op: "join", other: "view_name", on: {left_col: right_col}}
{op: "union", other: "view_name"}
{op: "rolling", keys: ["grp"], order_by: "col", window: 5, aggs: {new: "src:mean"}}
{op: "sort", by: ["col"], ascending: true}
{op: "head", keys: ["grp"], n: 5, order_by: "col"}
{op: "cast", columns: {col: "int"}}
{op: "distinct", columns: ["col"]}
{op: "rank", columns: {rank_col: "value_col"}, keys: ["grp"], ascending: false}
{op: "isin", column: "col", values: ["a", "b"]}
{op: "cond_agg", keys: ["k"], aggs: {new: "src:sum:condition_expr"}}
{op: "lag", keys: ["k"], order_by: "col", columns: {new: "src"}, periods: 1}
{op: "ewm", keys: ["k"], order_by: "col", columns: {new: "src"}, span: 10}
{op: "diff", keys: ["k"], order_by: "col", columns: {new: "src"}}
{op: "trend", keys: ["k"], order_by: "col", columns: {new: "src"}, window: 5}
{op: "encode", columns: ["cat_col"], method: "ordinal"}
{op: "bin", column: "col", bins: 5, labels: [...]}
{op: "datetime", column: "dt", extract: ["year", "month", "dow"]}
{op: "null_indicator", columns: ["col"]}
add → name, formula?|source?|condition?|type?, pairwise_mode?(diff|ratio|both|none), category?, description?
add_batch → features(JSON array of {name, formula?, type?, source?, condition?, pairwise_mode?, category?, description?})
test_transformations → features(list of column names), test_interactions?(true)
discover → top_n?(20), method?(xgboost|mutual_info)
diversity →
auto_search → features?(list), search_types?(interactions|lags|rolling), top_n?
Feature types: entity (auto-generates pairwise), pairwise (instance formula), instance (context column), regime (boolean flag).
add → name, model_type?|preset?, features?, params?(JSON), active?, include_in_ensemble?, mode?(classifier|regressor), prediction_type?, cdf_scale?, zero_fill_features?, class_weight?
update → name, features?, append_features?, remove_features?, params?, active?, include_in_ensemble?, mode?, prediction_type?, cdf_scale?, replace_params?(false)
remove → name, purge?(false)
list →
show → name
presets →
add_batch → items(JSON array)
update_batch → items(JSON array)
remove_batch → items(JSON array of {name, purge?})
clone → name(source), items(JSON {new_name, ...overrides})
Model types: xgboost, lightgbm, catboost, random_forest, logistic_regression, elastic_net, mlp, tabnet, svm, hist_gbm, gam, ngboost.
class_weight: “balanced” or JSON dict {“0”: 1.0, “1”: 2.5}. Regressors need cdf_scale for probability conversion. params default-merges unless replace_params=true.
create → description, hypothesis
write_overlay → experiment_id, overlay(JSON — supports dot-notation keys)
run → experiment_id, primary_metric?, variant?
promote → experiment_id, primary_metric?
quick_run → description, overlay(JSON), hypothesis, primary_metric?
explore → search_space(JSON {axes, budget, primary_metric}), detail?(summary|full)
promote_trial → experiment_id(exploration ID), trial?(best), primary_metric?, hypothesis?
compare → experiment_ids(list of exactly 2)
journal → last_n?(20)
log_result → experiment_id, description?, hypothesis?, conclusion?, verdict?(keep|revert|partial)
Search space format:
{"axes": {"learning_rate": {"type": "float", "low": 0.01, "high": 0.3}}, "budget": 10, "primary_metric": "rmse"}
run_backtest → experiment_id?, variant?
predict → fold_value, run_id?, variant?
diagnostics → run_id?, detail?(summary|full)
list_runs →
show_run → run_id?, detail?
compare_runs → run_ids(list of exactly 2)
compare_latest →
compare_targets →
explain → name?(model), run_id?, top_n?(10)
inspect_predictions → run_id?, mode?(worst|best|uncertain), top_n?(10)
export_notebook → destination(colab|kaggle|local), output_path?
progress →
create → config(JSON), name?
list_formats →
simulate → name?, n_sims?(10000), seed?(42)
standings → name?, top_n?(20)
round_probs → name?, top_n?
generate_brackets → pool_size, name?, n_brackets?(3), n_sims?(10000), seed?
score_bracket → picks(JSON), actuals(JSON), name?
adjust → adjustments(JSON), name?
explain → name?
profiles → name?, top_n?
confidence → name?
export → output_dir, name?, format_type?(json|markdown|csv)
list_strategies →
No config/ directory → Run configure(action="init") first
Empty feature store → Run data(action="add", data_path="...") first
Unknown model type → Lists valid types + fuzzy suggestion
Invalid action → Lists valid actions + "Did you mean?"
Missing required param → "**Error**: {param} is required"
Invalid JSON → "Invalid JSON input: {details}"
Guardrail violation → {blocked: bool, rule: str, message: str, overridable: bool}
Workflow gate → "**Blocked**: Complete {phase} before {action}"
Skills are structured prompts loaded into your context to enforce methodology.
| Metrics (45 across 6 task types): brier, accuracy, log_loss, ece, auroc, f1, precision, recall | macro/micro/weighted variants | rmse, mae, r2, mape | ndcg, mrr, map | concordance_index, brier_survival | crps, calibration, sharpness |
Calibration: spline (PCHIP), isotonic, platt, beta
CV strategies: leave_one_out, expanding_window, sliding_window, purged_kfold, stratified_kfold, group_kfold, bootstrap
Models: xgboost, lightgbm, catboost, random_forest, logistic_regression, elastic_net, mlp, tabnet, svm, hist_gbm, gam, ngboost
Meta-learner types: logistic (default), ridge, gbm
Preprocessing: zscore, robust, quantile scaling (leakage-safe); median, mean, zero, knn, iterative imputation; frequency encoding
Feature selection: k_best, rfe, correlation_cluster
Ensemble diversity: disagreement, q_statistic, kappa, correlation
Drift detection: KS test, PSI, multi-feature drift
Conformal prediction: split-conformal with finite-sample correction
Explainability: SHAP values, partial dependence plots, feature interactions
View steps (22): filter, select, derive, group_by, join, union, unpivot, sort, head, rolling, cast, distinct, rank, isin, cond_agg, lag, ewm, diff, trend, encode, bin, datetime, null_indicator