SchemaFieldPath Specification (Version 2)
This document outlines the formal specification for the fieldPath member of the SchemaField model. This specification (version 2) takes into account the unique requirements of supporting a wide variety of nested types, unions and optional fields and is a substantial improvement over the current implementation (version 1).
Requirements
The fieldPath
field is currently used by datahub for not just rendering the schema fields in the UI, but also as a
primary identifier of a field in other places such
as EditableSchemaFieldInfo,
usage stats and data profiles. Therefore, it must satisfy the following requirements.
- must be unique across all fields within a schema.
- make schema navigation in the UI more intuitive.
- allow for identifying the type of schema the field is part of, such as a
key-schema
or avalue-schema
. - allow for future-evolution
Existing Convention(v1)
The existing convention is to simply use the field's name as the fieldPath
for simple fields, and use the dot
delimited names for nested fields. This scheme does not satisfy the requirements stated above. The
following example illustrates where the uniqueness
requirement is not satisfied.
Example: Ambiguous field path
Consider the following Avro
schema which is a union
of two record types A
and B
, each having a simple field with
the same name f
that is of type string
. The v1 naming scheme cannot differentiate if a fieldPath=f
is referring to
the record type A
or B
.
[
{
"type": "record",
"name": "A",
"fields": [{ "name": "f", "type": "string" } ]
}, {
"type": "record",
"name": "B",
"fields": [{ "name": "f", "type": "string" } ]
}
]
The FieldPath encoding scheme(v2)
The syntax for V2 encoding of the fieldPath
is captured in the following grammar. The FieldPathSpec
is essentially
the type annotated path of the member, with each token along the path representing one level of nested member,
starting from the most-enclosing type, leading up to the member. In the case of unions
that have one-of
semantics,
the corresponding field will be emitted once for each member
of the union as its type
, along with one path
corresponding to the union
itself.
Formal Spec:
<SchemaFieldPath> := <VersionToken>.<PartOfKeySchemaToken>.<FieldPathSpec> // when part of a key-schema
| <VersionToken>.<FieldPathSpec> // when part of a value schema
<VersionToken> := [version=<VersionId>] // [version=2.0] for v2
<PartOfKeySchemaToken> := [key=True] // when part of a key schema
<FieldPathSpec> := <FieldToken>+ // this is the type prefixed path field (nested if repeats).
<FieldToken> := <TypePrefixToken>.<name_of_the_field> // type prefixed path of a field.
<TypePrefixToken> := <NestedTypePrefixToken>.<SimpleTypeToken> | <SimpleTypeToken>
<NestedTypePrefixToken> := [type=<NestedType>]
<SimpleTypeToken> := [type=<SimpleType>]
<NestedType> := <name of a struct/record> | union | array | map
<SimpleType> := int | float | double | string | fixed | enum
For the example above, this encoding would produce the following 2 unique paths
corresponding to the A.f
and B.f
fields.
unique_v2_field_paths = [
"[version=2.0].[type=union].[type=A].[type=string].f",
"[version=2.0].[type=union].[type=B].[type=string].f"
]
NOTE:
- this encoding always ensures uniqueness within a schema since the full type annotation leading to a field is encoded in the fieldPath itself.
- processing a fieldPath, such as from UI, gets simplified simply by walking each token along the path from left-to-right.
- adding PartOfKeySchemaToken allows for identifying if the field is part of key-schema.
- adding VersionToken allows for future evolvability.
- to represent
optional
fields, which sometimes are modeled asunions
in formats likeAvro
, instead of treating it as aunion
member, set thenullable
member ofSchemaField
toTrue
.
Examples
Primitive types
avro_schema = """
{
"type": "string"
}
"""
unique_v2_field_paths = [
"[version=2.0].[type=string]"
]
Records
Simple Record
avro_schema = """
{
"type": "record",
"name": "some.event.E",
"namespace": "some.event.N",
"doc": "this is the event record E"
"fields": [
{
"name": "a",
"type": "string",
"doc": "this is string field a of E"
},
{
"name": "b",
"type": "string",
"doc": "this is string field b of E"
}
]
}
"""
unique_v2_field_paths = [
"[version=2.0].[type=E].[type=string].a",
"[version=2.0].[type=E].[type=string].b",
]
Nested Record
avro_schema = """
{
"type": "record",
"name": "SimpleNested",
"namespace": "com.linkedin",
"fields": [{
"name": "nestedRcd",
"type": {
"type": "record",
"name": "InnerRcd",
"fields": [{
"name": "aStringField",
"type": "string"
} ]
}
}]
}
"""
unique_v2_field_paths = [
"[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd",
"[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd.[type=string].aStringField",
]
Recursive Record
avro_schema = """
{
"type": "record",
"name": "Recursive",
"namespace": "com.linkedin",
"fields": [{
"name": "r",
"type": {
"type": "record",
"name": "R",
"fields": [
{ "name" : "anIntegerField", "type" : "int" },
{ "name": "aRecursiveField", "type": "com.linkedin.R"}
]
}
}]
}
"""
unique_v2_field_paths = [
"[version=2.0].[type=Recursive].[type=R].r",
"[version=2.0].[type=Recursive].[type=R].r.[type=int].anIntegerField",
"[version=2.0].[type=Recursive].[type=R].r.[type=R].aRecursiveField"
]
avro_schema ="""
{
"type": "record",
"name": "TreeNode",
"fields": [
{
"name": "value",
"type": "long"
},
{
"name": "children",
"type": { "type": "array", "items": "TreeNode" }
}
]
}
"""
unique_v2_field_paths = [
"[version=2.0].[type=TreeNode].[type=long].value",
"[version=2.0].[type=TreeNode].[type=array].[type=TreeNode].children",
]
Unions
avro_schema = """
{
"type": "record",
"name": "ABUnion",
"namespace": "com.linkedin",
"fields": [{
"name": "a",
"type": [{
"type": "record",
"name": "A",
"fields": [{ "name": "f", "type": "string" } ]
}, {
"type": "record",
"name": "B",
"fields": [{ "name": "f", "type": "string" } ]
}
]
}]
}
"""
unique_v2_field_paths: List[str] = [
"[version=2.0].[key=True].[type=ABUnion].[type=union].a",
"[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a",
"[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a.[type=string].f",
"[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a",
"[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a.[type=string].f",
]
Arrays
avro_schema = """
{
"type": "record",
"name": "NestedArray",
"namespace": "com.linkedin",
"fields": [{
"name": "ar",
"type": {
"type": "array",
"items": {
"type": "array",
"items": [
"null",
{
"type": "record",
"name": "Foo",
"fields": [ {
"name": "a",
"type": "long"
} ]
}
]
}
}
}]
}
"""
unique_v2_field_paths: List[str] = [
"[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar",
"[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar.[type=long].a",
]
Maps
avro_schema = """
{
"type": "record",
"name": "R",
"namespace": "some.namespace",
"fields": [
{
"name": "a_map_of_longs_field",
"type": {
"type": "map",
"values": "long"
}
}
]
}
"""
unique_v2_field_paths = [
"[version=2.0].[type=R].[type=map].[type=long].a_map_of_longs_field",
]
Mixed Complex Type Examples
# Combines arrays, unions and records.
avro_schema = """
{
"type": "record",
"name": "ABFooUnion",
"namespace": "com.linkedin",
"fields": [{
"name": "a",
"type": [ {
"type": "record",
"name": "A",
"fields": [{ "name": "f", "type": "string" } ]
}, {
"type": "record",
"name": "B",
"fields": [{ "name": "f", "type": "string" } ]
}, {
"type": "array",
"items": {
"type": "array",
"items": [
"null",
{
"type": "record",
"name": "Foo",
"fields": [{ "name": "f", "type": "long" }]
}
]
}
}]
}]
}
"""
unique_v2_field_paths: List[str] = [
"[version=2.0].[type=ABFooUnion].[type=union].a",
"[version=2.0].[type=ABFooUnion].[type=union].[type=A].a",
"[version=2.0].[type=ABFooUnion].[type=union].[type=A].a.[type=string].f",
"[version=2.0].[type=ABFooUnion].[type=union].[type=B].a",
"[version=2.0].[type=ABFooUnion].[type=union].[type=B].a.[type=string].f",
"[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a",
"[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a.[type=long].f",
]
For more examples, see the unit-tests for AvroToMceSchemaConverter.
Backward-compatibility
While this format is not directly compatible with the v1 format, the v1 equivalent can easily be constructed from the v2
encoding by stripping away all the v2 tokens enclosed in the square-brackets [<new_in_v2>]
.