The MLSchema Standard: A Extensible Contract for Machine Learning¶
This section explains the foundational design principles behind the MLSchema JSON format, how it achieves vendor-agnostic compatibility, and the deliberate constraints that enable safe extensibility.
1. Executive Overview¶
MLSchema is not merely a pandas → JSON converter. It establishes a data contract between machine learning models and consumer applications—ensuring that model inputs and outputs can be reliably described, validated, and rendered across heterogeneous systems.
Unlike ad-hoc form definitions or proprietary model serialization formats, MLSchema proposes a standard that is:
- Reproducible: The same DataFrame always produces identical schemas.
- Validated: Every field complies with domain-specific constraints (ranges, patterns, cardinality).
- Extensible: New field types can be registered without breaking existing consumers.
- Transport-agnostic: JSON serialization permits language-agnostic integration with web frameworks, microservices, and low-code platforms.
Why a Standard Matters¶
When a machine learning team deploys a model, downstream consumers must understand: - Which inputs the model accepts. - What types and ranges each input expects. - How to present those inputs in a UI (text box, slider, dropdown, date picker). - What the output schema looks like.
Without a formal specification, teams waste cycles reverse-engineering model contracts or hand-rolling form definitions. A standard eliminates that friction and unlocks reproducible, governable inference pipelines.
2. Core Principles¶
2.1 Strategy-Driven Architecture¶
MLSchema adopts the strategy pattern as its foundational design principle. Each pandas dtype is mapped to exactly one field type via a pluggable strategy:
pandas dtype (e.g., "int64")
↓
Strategy registry lookup
↓
Pydantic BaseField subclass
↓
JSON schema
Why strategies?
- Single Responsibility: Each strategy owns one problem domain (text encoding, numeric validation, categorical enumerations).
- Hot-Swap Extensibility: Register custom strategies without modifying core code.
- Forward Compatibility: Introduce domain-specific controls (geospatial, IoT widgets) as standalone strategies.
2.2 Mandatory Field Attributes¶
Every field in an MLSchema schema carries reserved attributes defined by the BaseField Pydantic model and always present in generated output:
| Attribute | Type | Constraint | Example |
|---|---|---|---|
label |
str |
1–100 characters | "Age" |
required |
bool |
Derived from nullability | true |
kind |
str |
Strategy-specific | "text", "number" |
Optional attributes (omitted when None):
| Attribute | Type | Description |
|---|---|---|
description |
str \| None |
Help text (max 500 chars) |
disabled |
bool \| None |
Field is disabled |
hidden |
bool \| None |
Field is hidden |
readOnly |
bool \| None |
Field is read-only |
disabledWhen |
Any \| None |
Declarative condition to disable the field |
hiddenWhen |
Any \| None |
Declarative condition to hide the field |
readOnlyWhen |
Any \| None |
Declarative condition to make read-only |
asyncValidationDebounceMs |
int \| None |
Debounce in ms for async validation |
inactiveFieldPolicy |
"include" \| "omit" \| "reset-on-hide" |
Behaviour when field becomes inactive |
valuePath |
str \| list[str] \| None |
Key path for reading the value on submit |
defaultValue |
Any \| None |
Initial value for the field |
ui |
dict[str, Any] \| None |
Arbitrary UI-layer props |
These attributes are reserved. Custom strategies must not emit them via attributes_from_series().
2.3 Domain-Specific Extensions¶
Each strategy introduces optional attributes that refine the field contract:
- NumberField:
defaultValue(inherited fromBaseField),min,max,step,unit,placeholder - TextField:
defaultValue(inherited fromBaseField),minLength,maxLength,pattern,placeholder - CategoryField:
defaultValue(inherited fromBaseField),options - BooleanField:
defaultValue(inherited fromBaseField),trueLabel,falseLabel - DateField:
defaultValue(inherited fromBaseField),min,max,step - SeriesField:
field1,field2,minPoints,maxPoints
These extensions are not free-form; they are rigorously typed and validated by Pydantic models.
2.4 Deterministic Output¶
The same DataFrame always produces identical JSON. This is guaranteed by:
- Normalizing pandas dtypes (e.g.,
np.int64→"int64"). - Preserving column order and names.
- Using Pydantic's
model_dump()with consistent serialization settings (mode="json",exclude_none=True).
Deterministic output is critical for CI/CD pipelines, caching, and contract versioning.
3. The MLSchema JSON Format¶
3.1 Canonical Structure¶
MLSchema generates JSON payloads with the following canonical shape:
{
"fields": [
{
"label": "customer_name",
"kind": "text",
"required": true,
"minLength": 1,
"maxLength": 100,
"placeholder": "Enter full name"
},
{
"label": "satisfaction_score",
"kind": "number",
"required": true,
"min": 0,
"max": 100,
"step": 1,
"unit": "points"
}
],
"reports": [],
"explanations": []
}
The top-level envelope (fields, reports, explanations) provides logical separation between model parameters, expected predictions, and explanation metadata.
3.2 Field Type Taxonomy¶
MLSchema ships with six built-in field types:
Kind: text¶
{
"kind": "text",
"label": "email",
"required": true,
"minLength": 5,
"maxLength": 254,
"pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
"placeholder": "user@example.com"
}
Supported pandas dtypes: object, string
Kind: number¶
{
"kind": "number",
"label": "revenue",
"required": true,
"min": 0,
"max": 1000000,
"step": 0.01,
"unit": "USD",
"placeholder": "Enter amount"
}
Supported pandas dtypes: int64, float64, int32, float32
Kind: category¶
The options key is mandatory and is automatically derived from the DataFrame's unique categorical values.
{
"kind": "category",
"label": "customer_segment",
"required": true,
"options": ["Bronze", "Silver", "Gold"]
}
Supported pandas dtypes: category
Kind: boolean¶
{
"kind": "boolean",
"label": "is_active",
"required": true,
"trueLabel": "Yes",
"falseLabel": "No"
}
Supported pandas dtypes: bool, boolean
Kind: date¶
min and max are ISO date strings (YYYY-MM-DD). Backend validation ensures min ≤ max via lexicographic comparison (valid for ISO format).
{
"kind": "date",
"label": "contract_renewal",
"required": false,
"min": "2024-01-01",
"max": "2026-12-31",
"step": 7
}
Supported pandas dtypes: datetime64[ns], datetime64
Kind: series¶
Represents a two-axis column where each cell is a 2-element compound value. Sub-fields are inferred automatically from element dtypes; nesting series inside series is explicitly rejected.
{
"kind": "series",
"label": "readings",
"required": true,
"field1": {
"kind": "date",
"label": "field1",
"required": true
},
"field2": {
"kind": "number",
"label": "field2",
"required": true,
"step": 0.1
},
"minPoints": 10,
"maxPoints": 1000
}
Detection: Content-based (not dtype-based). SeriesStrategy claims any object column whose non-null cells are all 2-element tuples, lists, or dicts.
Supported cell formats:
| Format | Example | Sub-field labels |
|---|---|---|
| Tuple | (v1, v2) |
field1, field2 |
| List | [v1, v2] |
field1, field2 |
| Dict | {"k1": v1, "k2": v2} |
dict keys |
Constraints:
| Constraint | Rule | Error |
|---|---|---|
field1 / field2 not series |
No nesting | PydanticCustomError("no_series_nesting") |
| Sub-field kind known | Must be registered via add_series_sub_field() |
PydanticCustomError("unknown_sub_field_type") |
minPoints / maxPoints ≥ 1 |
PositiveInt |
Pydantic validation error |
minPoints ≤ maxPoints |
Model validator | PydanticCustomError("series_points_constraint") |
3.3 Report Type Taxonomy¶
MLSchema ships with two built-in report types for describing model outputs:
Kind: regressor¶
{
"kind": "regressor",
"label": "Predicted price",
"source": "model_output",
"unit": "EUR",
"precision": 2
}
| Attribute | Type | Description |
|---|---|---|
unit |
str \| None |
Unit label (e.g. "€", "kg") |
precision |
int \| None |
Decimal places shown (mlform default: 2) |
explanations |
bool \| None |
Show feature-importance explanations |
Kind: classifier¶
{
"kind": "classifier",
"label": "Predicted class",
"source": "model_output",
"labels": ["cat", "dog", "bird"],
"details": true
}
| Attribute | Type | Description |
|---|---|---|
labels |
list[str] \| None |
Ordered class labels |
details |
bool \| None |
Show per-class breakdown (mlform default: true) |
explanations |
bool \| None |
Show feature-importance explanations |
4. Design Decisions & Rationale¶
4.1 Why Pydantic?¶
Pydantic v2 provides:
- Type safety: Schemas are validated at construction time, not at serialization.
- Composability: Custom models inherit from
BaseField, enabling incremental extension. - Standard format: Pydantic models emit JSON in a deterministic, language-agnostic format.
- Validators: Embedded, reusable validation logic (e.g.,
min ≤ max, regex patterns).
4.2 Why a Literal Type Annotation?¶
The kind field in each Pydantic model uses Python's Literal type:
class NumberField(BaseField):
kind: Literal[FieldTypes.NUMBER] = FieldTypes.NUMBER
This ensures:
- Type narrowing: IDEs and static analyzers can discriminate on the
kindfield. - Exhaustiveness: Consumer code can enforce complete handling of all field types.
- No collisions: Only one schema matches a given
kindstring.
4.3 Why Reserved Keys?¶
The reserved keys (label, kind, required, description) are always populated by the base Strategy class. Custom strategies cannot override them via attributes_from_series(). This ensures:
- Predictability: Consumers know these keys will always be present and meaningful.
- Schema integrity: The contract is never violated by careless implementations.
- Versioning safety: Future MLSchema versions can extend reserved keys safely.
4.4 Why exclude_none=True?¶
Pydantic's serialization mode exclude_none=True strips null values:
{
"kind": "text",
"label": "email",
"required": true,
"minLength": 1
// "placeholder": null is excluded
}
Optional attributes are omitted when not set, keeping payloads compact.
5. Achieving Safe Extensibility¶
5.1 The Extensibility Contract¶
✅ Safe to extend:
- Register custom strategies for new pandas dtypes.
- Create custom Pydantic models that inherit from
BaseFieldorBaseReport. - Override
attributes_from_series()to inject domain-specific metadata.
❌ Do not modify:
- The reserved attributes (
label,kind,required,description). - The core
Strategyclass API (build_dict(),dtypes,type_name). - The shape of the top-level envelope (
{"fields": [...], "reports": [...], "explanations": [...]}).
5.2 Example: Custom Strategy for Geospatial Data¶
from typing import Literal
from pydantic import Field
from pandas import Series
from mlschema.core import BaseField, Strategy
# 1️⃣ Define the Pydantic schema
class LocationField(BaseField):
kind: Literal["location"] = "location"
latitude: float | None = None
longitude: float | None = None
zoom: int = 10
# 2️⃣ Define the Strategy
class LocationStrategy(Strategy):
def __init__(self) -> None:
super().__init__(
type_name="location",
schema_cls=LocationField,
dtypes=("object",),
)
def attributes_from_series(self, series: Series) -> dict:
return {
"latitude": series.apply(extract_lat).mean(),
"longitude": series.apply(extract_lon).mean(),
}
# 3️⃣ Register it
mls = MLSchema()
mls.register(LocationStrategy())
Resulting schema:
{
"kind": "location",
"label": "store_location",
"required": true,
"latitude": 40.7128,
"longitude": -74.0060,
"zoom": 12
}
6. Validation & Error Handling¶
6.1 Multi-Layer Validation¶
MLSchema enforces validation at three points:
- Strategy registration: Ensures no dtype collisions or duplicate
type_names. - DataFrame inspection: Confirms all columns carry supported dtypes; raises if fallback is missing.
- Pydantic instantiation: Validates field constraints (e.g.,
min ≤ max).
6.2 Exception Hierarchy¶
All library-level exceptions inherit from mlschema.core.MLSchemaError:
try:
schema = mls.build(df)
except mlschema.core.MLSchemaError as e:
log.error(f"Schema generation failed: {e}")
Specific exceptions:
EmptyDataFrameError– DataFrame has no rows or columns.FallbackStrategyMissingError– Unsupported dtype with no fallback.StrategyNameAlreadyRegisteredError– Duplicatetype_nameonregister().StrategyDtypeAlreadyRegisteredError– Duplicate dtype onregister().ValidationError(Pydantic) – Field constraint violation.
7. Best Practices for Schema Design¶
7.1 Pre-validation Checklist¶
Before calling mls.build(df):
- ✅ Confirm dtypes: Inspect
df.dtypesto ensure columns carry the expected types. - ✅ Handle nulls: Decide whether columns should be
required: trueorrequired: false. - ✅ Register strategies: Only register the field types your application needs.
- ✅ Test edge cases: Empty DataFrames, single rows, all-null columns.
7.2 Custom Strategy Patterns¶
- Inherit from
BaseField: Ensure your Pydantic model extendsBaseField. - Use
Literalforkind: Setkind: Literal["custom_type"] = "custom_type". - Override
attributes_from_series()only: Do not overridebuild_dict()or other base methods. - Validate early: Use Pydantic
@model_validatordecorators to catch constraint violations at construction time. - Respect reserved keys: Do not emit
label,kind,required, ordescriptionfromattributes_from_series().
8. Summary¶
MLSchema establishes a data contract that bridges machine learning models and consumer applications. By combining a strategy-driven architecture with rigorous Pydantic validation, the library achieves:
- Reproducibility: Same DataFrame → identical schema.
- Extensibility: Custom strategies plug in without core modifications.
- Type Safety: Literal annotations, Pydantic models, and static analysis.
- Simplicity: Sensible defaults; no configuration required for common dtypes.
Next: Refer to the Usage Guide to implement your first custom strategy, or see the API Reference for exhaustive method signatures.