MLSchema¶
Lightweight SDK for turning pandas DataFrames into front-end-ready field schemas. Designed to integrate cleanly with mlform, but fully usable as a standalone schema inference layer.
Overview¶
MLSchema converts tabular data into validated JSON-serialisable field definitions.
It is intended for projects where a pandas.DataFrame is already the source of truth, but the frontend still needs a stable contract to render inputs, validate values, or generate data-entry workflows. Instead of maintaining hand-written form schemas beside the data pipeline, MLSchema infers the first version of that contract directly from DataFrame columns and dtypes.
The default workflow is intentionally small: call infer_schema(df) and receive a list of fields ready to be consumed by a frontend layer.
import pandas as pd
from mlschema import infer_schema
df = pd.DataFrame(
{
"name": ["Ada", "Linus"],
"score": [98.5, 86.0],
"active": [True, False],
}
)
schema = infer_schema(df)
The result is a validated field list:
[
{
"kind": "text",
"label": "name",
"required": true,
"mappedTo": 0
},
{
"kind": "number",
"label": "score",
"required": true,
"mappedTo": 1,
"step": 0.1
},
{
"kind": "boolean",
"label": "active",
"required": true,
"mappedTo": 2
}
]
Why MLSchema¶
In many analytics and machine-learning systems, the same data contract is repeated across notebooks, APIs, dashboards, review tools, and prediction forms. That repetition is fragile: a column changes, a dtype is corrected, a categorical value is added, and the UI contract can silently drift away from the data.
MLSchema keeps the schema close to the DataFrame while still producing a frontend-friendly representation. The inferred schema is not a loose description; it is validated through Pydantic models before being returned.
The library is useful when the goal is not to build a form manually, but to derive a reliable baseline from data and then apply only the domain-specific refinements that matter.
What It Provides¶
| Capability | Detail |
|---|---|
| Automatic field inference | Detects text, number, category, boolean, date, and pair-shaped series columns. |
| Validated output | Field dictionaries are checked against strict Pydantic models before they are returned. |
| Frontend-ready contracts | Output is JSON-serialisable and designed to be consumed by UI libraries such as mlform. |
| Safe defaults | Builtin kinds are enabled by default; no registry setup is required for common DataFrame workflows. |
| Controlled extension | Custom builders and custom kinds allow domain-specific behaviour without changing consumer code. |
| Column-level refinement | Overrides provide final labels, bounds, defaults, placeholders, and units. |
Installation¶
Install MLSchema with uv:
uv add mlschema
For other package managers and environment details, see the Installation guide.
Quick Start¶
import pandas as pd
from mlschema import infer_schema
df = pd.read_csv("data.csv")
schema = infer_schema(df)
The generated schema can be passed to a frontend renderer, stored as a contract, inspected in tests, or post-processed before being sent to an API client.
MLSchema works best when DataFrame dtypes are intentional. Numeric columns should use numeric dtypes, categorical columns should use category, date columns should use pandas datetime dtypes, and boolean columns should use boolean dtypes. Ambiguous object columns fall back to text.
Builtin Field Kinds¶
MLSchema ships with builtin inference for the standard field types used in most form-generation workflows.
| Kind | Detection |
|---|---|
series |
Non-null cells are 2-element tuples, lists, or dictionaries. |
onehot-category |
0/1 columns grouped by onehot_separator. |
boolean |
bool, boolean |
category |
category |
date |
datetime64[ns], datetime64[us], datetime64 |
number |
int64, int32, float64, float32 |
text |
Fallback for columns not claimed by an earlier kind. |
The order matters. More specific detections run first, one-hot encoded columns such as color__red are grouped before ordinary numeric fields, and text runs last as the safe fallback.
Refining Inferred Schemas¶
Inference gives the structural baseline. Production interfaces usually need a more explicit product contract: clearer labels, minimum and maximum values, defaults, units, placeholders, or descriptions.
Use overrides for those final column-specific refinements.
schema = infer_schema(
df,
overrides={
"score": {
"label": "Quality score",
"min": 0,
"max": 100,
"unit": "points",
},
},
)
Overrides are applied after inference and before final validation. Invalid constraints are rejected instead of being returned as a broken frontend contract.
Extending Inference¶
Use a custom builder when an existing kind is correct, but a specific column needs domain-aware metadata.
from pandas import Series
from mlschema import FieldContext, infer_schema
def money_builder(series: Series, ctx: FieldContext) -> dict | None:
if ctx.name != "amount_eur":
return None
return {
"kind": "number",
"label": "Amount",
"required": ctx.required,
"mappedTo": ctx.mappedTo,
"step": 0.01,
"unit": "EUR",
"min": 0,
}
schema = infer_schema(df, builders=[money_builder])
Use a custom kind when the frontend needs a new field contract with its own discriminator and validation model.
from typing import Literal
from pandas import Series
from mlschema import BaseField, FieldContext, infer_schema, kind
class DurationField(BaseField):
kind: Literal["duration"] = "duration"
unit: Literal["seconds"] = "seconds"
minSeconds: int
maxSeconds: int
def duration_builder(series: Series, ctx: FieldContext) -> dict | None:
if ctx.dtype not in {"timedelta64[ns]", "timedelta64[us]"}:
return None
return {
"kind": "duration",
"label": ctx.name,
"required": ctx.required,
"mappedTo": ctx.mappedTo,
"unit": "seconds",
"minSeconds": int(series.min().total_seconds()),
"maxSeconds": int(series.max().total_seconds()),
}
schema = infer_schema(
df,
kinds=[
kind(model=DurationField, infer=duration_builder),
],
)
The extension model stays narrow by design: builtin inference handles the common path, builders handle reusable domain rules, custom kinds define new frontend contracts, and overrides apply final column-level decisions.
Designed For mlform¶
MLSchema pairs naturally with mlform. MLSchema infers and validates the field contract; mlform can consume that contract to render interactive forms.
This separation keeps the Python side focused on schema inference and the frontend side focused on rendering, interaction, and submission workflows.
MLSchema can also be used without mlform wherever a validated, JSON-serialisable field schema is useful.
Documentation¶
| Page | Purpose |
|---|---|
| Installation | Environment setup and package installation. |
| Usage Guide | Practical schema inference, overrides, builders, custom kinds, and series columns. |
| API Reference | Public symbols, signatures, and lower-level details. |
| Changelog | Version history and migration notes. |
Links¶
| Resource | URL |
|---|---|
| GitHub | github.com/UlloaSP/mlschema |
| mlform | github.com/UlloaSP/mlform |
| mlform docs | ulloasp.github.io/mlform |