Schema Standard¶
The MLSchema standard defines the JSON field contract emitted by
infer_schema(): a validated, frontend-ready representation of tabular inputs derived frompandas.DataFramecolumns.
Overview¶
MLSchema is not a generic JSON Schema generator. It defines a compact field standard for data-driven interfaces: prediction forms, review tools, annotation workflows, internal dashboards, and frontend libraries such as mlform.
The standard is intentionally close to the DataFrame. Column names become labels, pandas dtypes drive field kinds, nullability determines whether a field is required, and Pydantic models validate the final contract before it is returned.
The canonical public workflow is:
from mlschema import infer_schema
schema = infer_schema(df)
The result is always a list of field dictionaries.
[
{
"kind": "text",
"label": "name",
"required": true
},
{
"kind": "number",
"label": "score",
"required": true,
"step": 0.1
}
]
This page documents the schema contract itself: its payload shape, reserved attributes, builtin kinds, validation rules, extension model, and compatibility expectations.
Design Principles¶
MLSchema follows a small set of constraints to keep generated schemas predictable across Python backends and frontend consumers.
The top-level payload is always a field list. There is no envelope, model metadata block, version header, or output schema section in the emitted payload. Those concerns can be added by downstream applications, but the MLSchema contract remains focused on input fields.
Each field is discriminated by kind. Consumers should branch on this value to decide how to render, validate, or transform the field.
Every emitted field is validated before it is returned. Invalid constraints are not passed through as “best effort” JSON.
The output is JSON-serialisable. Pydantic serialisation is performed in JSON mode and None values are omitted, which keeps the payload compact while preserving explicit defaults and constraints.
Builtin inference is enabled by default. Common DataFrame workflows do not require manual registration.
Extension is explicit. Domain behaviour enters through custom builders, custom kinds, and overrides; not by mutating internal registries or relying on private classes.
Canonical Payload Shape¶
The MLSchema payload is a list of fields.
[
{
"kind": "text",
"label": "customer_name",
"required": true,
"mappedTo": "customer_name",
"minLength": 1,
"maxLength": 100,
"placeholder": "Enter full name"
},
{
"kind": "number",
"label": "satisfaction_score",
"required": true,
"mappedTo": "satisfaction_score",
"min": 0,
"max": 100,
"step": 1,
"unit": "points"
}
]
Each object in the list represents one frontend field. Field order follows DataFrame column order.
A field dictionary contains three layers of information:
| Layer | Purpose |
|---|---|
| Base attributes | Shared contract present across field kinds. |
| Kind-specific attributes | Constraints and metadata belonging to a specific field kind. |
| Presentation hints | Optional kind-specific hints that do not change the core data contract. |
The standard does not require consumers to understand every key. A renderer can safely use kind, label, and required as the minimum baseline, then progressively support kind-specific attributes.
Base Field Contract¶
All fields inherit from BaseField.
The required base attributes are:
| Attribute | Type | Meaning |
|---|---|---|
kind |
str |
Field discriminator. |
label |
str |
Human-readable label. Defaults to the column name. |
required |
bool |
true when the source column contains no missing values. |
mappedTo |
str or int |
Backend target. Named columns use their column name; positional columns use original model input position. |
Optional base attributes are omitted when not set.
| Attribute | Type | Meaning |
|---|---|---|
description |
str |
Help text or domain explanation. |
valuePath |
str or list[str] |
Alternative path used by consumers when reading or writing values. |
defaultValue |
any | Initial field value. |
Field models reject unknown attributes. This is deliberate: the schema should fail at generation time rather than leak unsupported keys into the frontend contract.
Builtin Kind Resolution¶
Builtin kinds are enabled by default and evaluated in a fixed order.
series
onehot-category
boolean
category
date
number
text
The order is part of the contract. More specific detections run before broader fallbacks. series runs first because it detects pair-shaped object cells by content. text runs last because it accepts any column that was not claimed earlier.
| Kind | Detection | Inferred metadata |
|---|---|---|
series |
Non-null cells are 2-element tuples, lists, or dictionaries. | field1, field2 |
onehot-category |
0/1 columns grouped by onehot_separator. |
options[].mappedTo |
boolean |
bool, boolean |
Base field metadata |
category |
category |
options |
date |
datetime64[ns], datetime64[us], datetime64 |
Base field metadata |
number |
int64, int32, float64, float32 |
step |
text |
Fallback | Base field metadata |
The contract assumes that DataFrame dtypes are meaningful. A numeric column stored as object is not treated as number unless a custom builder or preprocessing step handles it.
Required Semantics¶
required is inferred from nullability.
A column with no missing values produces:
{
"kind": "text",
"label": "name",
"required": true,
"mappedTo": "name"
}
A column containing at least one missing value produces:
{
"kind": "text",
"label": "name",
"required": false,
"mappedTo": "name"
}
This rule is intentionally simple. MLSchema does not infer business-level optionality. If a nullable training column should still be required in a production form, set that decision with overrides.
Text Field¶
text represents free-form textual input and is also the fallback kind for columns not claimed by a more specific builtin.
Minimal inferred field:
{
"kind": "text",
"label": "customer_name",
"required": true,
"mappedTo": "customer_name"
}
Extended field:
{
"kind": "text",
"label": "Email",
"required": true,
"mappedTo": "email",
"description": "Primary contact email.",
"minLength": 5,
"maxLength": 254,
"pattern": "^[^@]+@[^@]+\\.[^@]+$",
"placeholder": "user@example.com",
"defaultValue": "ada@example.com"
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
minLength |
int |
Minimum accepted length. |
maxLength |
int |
Maximum accepted length. |
pattern |
str |
Validation pattern. |
placeholder |
str |
Input placeholder. |
Validation rejects inconsistent length bounds, and default values must satisfy declared text constraints.
Number Field¶
number represents integer and floating-point numeric input.
Integer columns infer step: 1.
{
"kind": "number",
"label": "age",
"required": true,
"mappedTo": "age",
"step": 1
}
Float columns infer step: 0.1.
{
"kind": "number",
"label": "score",
"required": true,
"mappedTo": "score",
"step": 0.1
}
Extended field:
{
"kind": "number",
"label": "Revenue",
"required": true,
"mappedTo": "revenue",
"description": "Revenue reported for the selected period.",
"min": 0,
"max": 1000000,
"step": 0.01,
"unit": "EUR",
"placeholder": "Enter amount",
"defaultValue": 0
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
min |
number | Minimum accepted value. |
max |
number | Maximum accepted value. |
step |
number | Increment used by numeric controls. |
placeholder |
str |
Input placeholder. |
unit |
str |
Unit displayed by consumers. |
Validation rejects min > max. Default values must be inside the declared range.
Category Field¶
category represents a closed set of options.
It is inferred from pandas categorical columns.
df["tier"] = pd.Categorical(
["pro", "free"],
categories=["free", "pro"],
)
Generated field:
{
"kind": "category",
"label": "tier",
"required": true,
"mappedTo": "tier",
"options": ["free", "pro"]
}
Extended field:
{
"kind": "category",
"label": "Plan",
"required": true,
"mappedTo": "tier",
"options": ["free", "pro"],
"defaultValue": "pro"
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
options |
list |
Accepted values. Must contain at least one item. |
defaultValue must be one of the declared options.
Category options are taken from the categorical dtype categories when available. A defensive fallback may use non-null unique values, but production schemas should prefer explicit pandas categorical dtypes because they preserve the intended option set and order.
OneHot Category Field¶
onehot-category represents several strict 0/1 model inputs as one category control.
Columns are grouped only when the DataFrame column name is a named encoded feature matching feature__value; pass onehot_separator to infer_schema() to use a different separator. The parent field does not include mappedTo; each option maps like a field to the original encoded input.
{
"kind": "onehot-category",
"label": "color",
"required": true,
"options": [
{ "label": "red", "value": "red", "mappedTo": "color__red" },
{ "label": "blue", "value": "blue", "mappedTo": "color__blue" }
]
}
For positional DataFrame columns, MLSchema does not infer a one-hot group from binary values alone. It emits ordinary fields with generated labels such as feature_0 and numeric mappedTo positions.
Boolean Field¶
boolean represents true/false input.
{
"kind": "boolean",
"label": "active",
"required": true,
"mappedTo": 0
}
Extended field:
{
"kind": "boolean",
"label": "Enabled",
"required": true,
"mappedTo": "enabled",
"trueLabel": "Yes",
"falseLabel": "No",
"defaultValue": true
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
trueLabel |
str |
Label for the true state. |
falseLabel |
str |
Label for the false state. |
Boolean fields support defaultValue through the base contract.
Date Field¶
date represents date-like input inferred from pandas datetime columns.
{
"kind": "date",
"label": "created",
"required": true,
"mappedTo": 0
}
Extended field:
{
"kind": "date",
"label": "Created at",
"required": true,
"mappedTo": "created",
"min": "2024-01-01",
"max": "2024-12-31",
"step": 1,
"defaultValue": "2024-01-01"
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
min |
str |
Minimum accepted date string. |
max |
str |
Maximum accepted date string. |
step |
positive integer | Step used by date controls. |
Validation rejects min > max. Default values must be inside the declared range.
The standard expects date strings to be serialised in a frontend-compatible format. ISO-style date strings are the recommended representation for cross-language consumers.
Series Field¶
series represents a two-axis value stored in a single DataFrame column. Typical examples are timestamp-value readings, coordinate pairs, or ordered measurement pairs.
A series column is detected by content rather than dtype. Non-null cells must all be 2-element tuples, 2-element lists, or 2-key dictionaries.
df = pd.DataFrame(
{
"reading": [
(pd.Timestamp("2024-01-01"), 23.5),
(pd.Timestamp("2024-01-02"), 24.1),
]
}
)
Generated field:
{
"kind": "series",
"label": "reading",
"required": true,
"mappedTo": 0,
"field1": {
"kind": "date",
"label": "field1",
"required": true,
"mappedTo": 0
},
"field2": {
"kind": "number",
"label": "field2",
"required": true,
"mappedTo": 0,
"step": 0.1
}
}
Supported cell shapes:
| Shape | Example | Subfield labels |
|---|---|---|
| Tuple | (x, y) |
field1, field2 |
| List | [x, y] |
field1, field2 |
| Dict | {"timestamp": x, "value": y} |
Dictionary keys converted to strings. |
Extended field:
{
"kind": "series",
"label": "Sensor reading",
"required": true,
"mappedTo": "reading",
"field1": {
"kind": "date",
"label": "field1",
"required": true,
"mappedTo": "reading"
},
"field2": {
"kind": "number",
"label": "field2",
"required": true,
"mappedTo": "reading",
"step": 0.1
},
"minPoints": 1,
"maxPoints": 100
}
Supported kind-specific attributes:
| Attribute | Type | Meaning |
|---|---|---|
field1 |
field object | First inferred subfield. |
field2 |
field object | Second inferred subfield. |
minPoints |
positive integer | Minimum accepted number of points. |
maxPoints |
positive integer | Maximum accepted number of points. |
Before subfield inference, object subseries may be coerced where possible. Python dates and datetimes become datetime series, parseable date strings become datetime series, and numeric-looking strings become numeric series. Other object values remain object values and continue through normal inference.
Nested series are rejected. A series field cannot contain another series field as field1 or field2.
Invalid pair shapes are not claimed as series. Empty series, null-only series, malformed tuples, malformed lists, malformed dictionaries, scalar strings, and ordinary object values continue through the remaining builders.
Determinism And Serialisation¶
The same DataFrame and the same inference configuration should produce the same field list.
MLSchema preserves DataFrame column order, normalises dtype names before matching, applies builders in a fixed order, validates fields with their registered Pydantic model, and serialises with JSON-compatible output.
None values are omitted. This keeps generated schemas compact and avoids forcing frontend consumers to distinguish between “not configured” and explicit null.
A minimal field therefore remains minimal:
{
"kind": "text",
"label": "name",
"required": true,
"mappedTo": 0
}
An enriched field only includes the attributes that are actually set:
{
"kind": "text",
"label": "Full name",
"required": true,
"mappedTo": "name",
"description": "Visible customer name.",
"minLength": 1,
"maxLength": 80,
"placeholder": "Ada Lovelace",
"defaultValue": "Ada"
}
Validation Model¶
Validation happens before the schema is returned.
The relevant validation layers are:
| Layer | Responsibility |
|---|---|
| DataFrame validation | Rejects empty DataFrames. |
| Kind registration | Rejects duplicate kind names and invalid custom kind models. |
| Builder validation | Rejects invalid builder return values, missing kinds, and unknown kinds. |
| Field model validation | Enforces Pydantic constraints for each field kind. |
| Override validation | Ensures patched fields still satisfy the target field model. |
This makes the generated schema suitable for frontend consumers that expect a stable contract. Broken fields fail during generation instead of failing later in rendering or submission.
Error Contract¶
MLSchema exposes library-level exceptions through mlschema.core.exceptions and re-exports them from mlschema.core.
| Error | Meaning |
|---|---|
MLSchemaError |
Base package exception. |
InvalidValueError |
Base class for configuration or user input violations. |
FieldServiceError |
Base runtime input inference error. |
EmptyDataFrameError |
The DataFrame has no rows or no columns. |
FieldKindError |
kind() received an invalid model, a model without kind, or a None kind default. |
FieldBuilderError |
A builder returned an invalid payload, omitted kind, no builder matched, or overrides targeted missing columns. |
FieldKindAlreadyRegisteredError |
Duplicate kind names were registered. |
UnknownFieldKindError |
A builder emitted a kind with no registered field model. |
pydantic.ValidationError |
A field payload violated its Pydantic model constraints. |
Applications that only need a broad failure boundary can catch MLSchemaError for library-level failures and pydantic.ValidationError for contract validation failures.
Compatibility Expectations¶
The MLSchema standard is intended to be consumed outside Python. Frontend consumers should treat the payload as a discriminated field list.
A consumer should:
- Preserve field order.
- Branch on
kind. - Treat unknown kinds as unsupported unless explicitly registered.
- Respect
required,defaultValue, and kind-specific constraints. - Ignore unknown attributes only after validating against a compatible schema version.
- Avoid relying on absent optional keys.
- Treat missing optional keys as “not configured”.
A consumer should not infer hidden semantics from labels or column order alone. The explicit field contract is the source of truth.
Practical Schema Design¶
Good schemas start with deliberate DataFrames.
Use pandas numeric dtypes for numeric fields, categorical dtypes for closed option sets, boolean dtypes for boolean controls, and datetime dtypes for date controls. Object columns are acceptable, but they are ambiguous and usually fall back to text.
Use overrides for final product decisions: labels, descriptions, bounds, defaults, placeholders, units, boolean labels, and point limits.
Use custom builders for reusable rules that still map to existing kinds.
Use custom kinds when a new frontend contract is required.
Keep custom kinds small and explicit. A new kind should exist because a consumer needs to render or validate it differently, not because a column needs a nicer label.
Do not depend on internal registry, service, or strategy classes. The supported contract is built around infer_schema(), optional builders, optional kinds, and optional overrides.
Boundary Of The Standard¶
MLSchema describes fields. It does not describe model weights, prediction responses, explanations, evaluation metrics, feature importance, transport protocols, authentication, storage, or UI layout.
Those concerns belong to the application layer.
This boundary is intentional. Keeping the standard limited to field contracts makes the payload stable, easy to validate, and simple to consume from Python, TypeScript, or any system that can read JSON.