Browse Source

doc(schema): schema doc updates

Signed-off-by: Jiyong Huang <huangjy@emqx.io>
Jiyong Huang 2 years ago
parent
commit
2eef9a9e31

File diff suppressed because it is too large
+ 3 - 1
docs/en_US/concepts/sources/overview.md


+ 12 - 0
docs/en_US/operation/restapi/schemas.md

@@ -26,12 +26,24 @@ Schema content in a file:
 }
 ```
 
+Schema with static plugin:
+
+```json
+{
+  "name": "schema2",
+  "file": "file:///tmp/ekuiper/internal/schema/test/test2.proto",
+   "soFile": "file:///tmp/ekuiper/internal/schema/test/so.proto"
+}
+```
+
+
 ### Parameters
 
 1. name:the unique name of the schema.
 2. schema content, use `file` or `content` parameter to specify. After schema created, the schema content will be written into file `data/schemas/$shcema_type/$schema_name`.
    - file: the url of the schema file. The url can be `http` or `https` scheme or `file` scheme to refer to a local file path of the eKuiper server. The schema file must be the file type of the corresponding schema type. For example, protobuf schema file's extension name must be .proto.
    - content: the text content of the schema.
+3. soFile:The so file of the static plugin. Detail about the plugin creation, please check [customize format](../../rules/codecs.md#format-extension).
 
 ## Show schemas
 

+ 41 - 0
docs/en_US/operation/restapi/streams.md

@@ -65,6 +65,47 @@ Response Sample:
 }
 ```
 
+## Get stream schema
+
+The API is used to get the stream schema. The schema is inferred from the physical and logical schema definitions.
+
+```shell
+GET http://localhost:9081/streams/{id}/schema
+```
+
+The format is like Json schema:
+
+```json
+{
+    "id": {
+        "type": "bigint"
+	},
+    "name": {
+        "type": "string"
+	},
+    "age": {
+        "type": "bigint"
+	},
+    "hobbies": {
+        "type": "struct",
+        "properties": {
+          "indoor": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            }
+          },
+          "outdoor": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            }
+          }
+        }
+    }
+}
+```
+
 ## update a stream
 
 The API is used for update the stream definition.

+ 9 - 0
docs/en_US/operation/restapi/tables.md

@@ -71,6 +71,15 @@ Response Sample:
 }
 ```
 
+## Get table schema
+
+The API is used to get the table schema. The schema is inferred from the physical and logical schema definitions.
+
+```shell
+GET http://localhost:9081/tables/{id}/schema
+```
+
+
 ## update a table
 
 The API is used for update the table definition.

+ 82 - 2
docs/en_US/rules/codecs.md

@@ -4,7 +4,7 @@ The eKuiper uses a map based data structure internally during computation, so so
 
 ## Format
 
-There are two types of formats for codecs: schema and schema-less formats. The formats currently supported by eKuiper are `json`, `binary` and `protobuf`. Among them, `protobuf` is the schema format.
+There are two types of formats for codecs: schema and schema-less formats. The formats currently supported by eKuiper are `json`, `binary`, `protobuf` and `custom`. Among them, `protobuf` is the schema format.
 The schema format requires registering the schema first, and then setting the referenced schema along with the format. For example, when using mqtt sink, the format and schema can be configured as follows
 
 ```json
@@ -18,9 +18,89 @@ The schema format requires registering the schema first, and then setting the re
 }
 ```
 
+All formats provide the ability to codec and, optionally, the definition of schema. The codec computation can be built-in, such as JSON parsing; dynamic parsing schema for codecs, such as Protobuf parsing `*.proto` files; or user-defined static plug-ins (`*.so`) can be used for parsing. Among them, static parsing has the best performance, but it requires writing additional code and compiling into a plugin, which is more difficult to change. Dynamic parsing is more flexible.
+
+All currently supported formats, their supported codec methods and modes are shown in the following table.
+
+
+| Format   | Codec        | Custom Codec           | Schema                 |
+|----------|--------------|------------------------|------------------------|
+| json     | Built-in     | Unsupported            | Unsupported            |
+| binary   | Built-in     | Unsupported            | Unsupported            |
+| protobuf | Built-in     | Supported              | Supported and required |
+| custom   | Not Built-in | Supported and required | Supported and optional |
+
+### Format Extension
+
+When using `custom` format or `protobuf` format, the user can customize the codec and schema in the form of a go language plugin. Among them, `protobuf` only supports custom codecs, and the schema needs to be defined by `*.proto` file. The steps for customizing the format are as follows:
+
+1. implement codec-related interfaces. The Encode function encodes the incoming data (currently always `map[string]interface{}`) into a byte array. The Decode function, on the other hand, decodes the byte array into `map[string]interface{}`. The decode function is called in source, while the encode function will be called in sink.
+    ```go
+    // Converter converts bytes & map or []map according to the schema
+    type Converter interface {
+        Encode(d interface{}) ([]byte, error)
+        Decode(b []byte) (interface{}, error)
+    }
+    ```
+2. Implements the schema description interface. If the custom format is strongly typed, then this interface can be implemented. The interface returns a JSON schema-like string for use by source. The returned data structure will be used as a physical schema to help eKuiper implement capabilities such as SQL validation and optimization during the parse and load phase.
+    ```go
+    type SchemaProvider interface {
+	    GetSchemaJson() string
+    }
+    ```
+3. Compile as a plugin so file. Usually, format extensions do not need to depend on the main eKuiper project. Due to the limitations of the Go language plugin system, the compilation of the plugin still needs to be done in the same compilation environment as the main eKuiper application, including the same operations, Go language version, etc. If you need to deploy to the official docker, you can use the corresponding docker image for compilation.
+    ```shell
+    go build -trimpath --buildmode=plugin -o data/test/myFormat.so internal/converter/custom/test/*.go
+    ```
+4. Register the schema by REST API.
+    ```shell
+    ###
+    POST http://{{host}}/schemas/custom
+    Content-Type: application/json
+    
+    {
+      "name": "custom1",
+       "soFile": "file:///tmp/custom1.so"
+    }
+    ```
+5. Use custom format in source or sink with `format` and `schemaId` parameters.
+
+The complete custom format can be found in [myFormat.go](https://github.com/lf-edge/ekuiper/blob/master/internal/converter/custom/test/myformat.go). This file defines a simple custom format where the codec actually only calls JSON for serialization. It returns a data structure that can be used to infer the data structure of the eKuiper source.
+
+### Static Protobuf
+
+使用 Protobuf 格式时,我们支持动态解析和静态解析两种方式。使用动态解析时,用户仅需要在注册模式时指定 proto 文件。在解析性能要求更高的条件下,用户可采用静态解析的方式。静态解析需要开发解析插件,其步骤如下:
+
+1. Assume we have a proto file helloworld.proto. Use official protoc tool to generate go code. Check [Protocol Buffer Doc](https://developers.google.com/protocol-buffers/docs/reference/go-generated) for detail.
+   ```shell
+   protoc --go_opt=Mhelloworld.proto=com.main --go_out=. helloworld.proto
+   ```
+2. Move the generated code helloworld.pb.go to the go language project and rename the package to main.
+3. Create the wrapper struct for each message type. Implement 3 methods `Encode`, `Decode`, `GetXXX`. The main purpose of encoding and decoding is to convert the struct and map types of messages. Note that to ensure performance, do not use reflection. 
+4. Compile as a plugin so file. Usually, format extensions do not need to depend on the main eKuiper project. Due to the limitations of the Go language plugin system, the compilation of the plugin still needs to be done in the same compilation environment as the main eKuiper application, including the same operations, Go language version, etc. If you need to deploy to the official docker, you can use the corresponding docker image for compilation.
+   ```shell
+    go build -trimpath --buildmode=plugin -o data/test/helloworld.so internal/converter/protobuf/test/*.go
+   ```
+5. Register the schema by REST API. Notice that, the proto file and the so file are needed.
+    ```shell
+    ###
+    POST http://{{host}}/schemas/protobuf
+    Content-Type: application/json
+    
+    {
+      "name": "helloworld",
+      "file": "file:///tmp/helloworld.proto",
+       "soFile": "file:///tmp/helloworld.so"
+    }
+    ```
+6. Use custom format in source or sink with `format` and `schemaId` parameters.
+
+The complete static protobuf plugin can be found in [helloworld protobuf](https://github.com/lf-edge/ekuiper/tree/master/internal/converter/protobuf/test).
+
+
 ## Schema
 
-A schema is a set of metadata that defines the data structure. For example, the .proto file is used in the Protobuf format as the data format for schema definition transfers. Currently, eKuiper supports only one schema type Protobuf.
+A schema is a set of metadata that defines the data structure. For example, the .proto file is used in the Protobuf format as the data format for schema definition transfers. Currently, eKuiper supports schema types protobuf and custom.
 
 ### Schema Registry
 

+ 14 - 14
docs/en_US/sqls/streams.md

@@ -92,22 +92,22 @@ demo (
 	) WITH (DATASOURCE="test", FORMAT="JSON", KEY="USERID", SHARED="true");
 ```
 
+## Schema
+
+The schema of a stream contains two parts. One is the data structure defined in the data source definition, i.e. the logical schema, and the other is the SchemaId specified when using strongly typed data formats, i.e. the physical schema, such as those defined in Protobuf and Custom formats.
+
+Overall, we will support 3 recursive ways of schema.
+
+1. Schemaless, where the user does not need to define any kind of schema, mainly used for weakly structured data flows, or where the data structure changes frequently.
+2. Logical schema only, where the user defines the schema at the source level, mostly used for weakly typed encoding, such as the JSON format, for users whose data has a fixed or roughly fixed format and do not want to use a strongly typed data codec format. In the case, the StrictValidation parameter can be used to configure whether to perform data validation and conversion. 
+3. Physical schema, the user uses protobuf or custom formats and defines the schemaId, where the validation of the data structure is done by the format implementation.
+
+Both the logical and physical schema definitions are used for SQL syntax validation in the parsing and loading phases of rule creation and for runtime optimization. The inferred schema of the stream can be obtained via [Schema API](../operation/restapi/streams.md#get-stream-schema).
+
+
 ### Strict Validation
 
-```
-The value of StrictValidation can be true or false.
-1) True: Drop the message if the message  is not satisfy with the stream definition.
-2) False: Keep the message, but fill the missing field with default empty value.
-
-bigint: 0
-float: 0.0
-string: ""
-datetime: the current time
-boolean: false
-bytea: nil
-array: zero length array
-struct: null value
-```
+Used only for logically schema streams. If strict validation is set, the rule will verify the existence of the field and validate the field type based on the schema. If the data is in good format, it is recommended to turn off validation.
 
 ### Schema-less stream
 If the data type of the stream is unknown or varying, we can define it without the fields. This is called schema-less. It is defined by leaving the fields empty.

File diff suppressed because it is too large
+ 3 - 1
docs/zh_CN/concepts/sources/overview.md


+ 11 - 0
docs/zh_CN/operation/restapi/schemas.md

@@ -26,12 +26,23 @@ POST http://localhost:9081/schemas/protobuf
 }
 ```
 
+模式包含静态插件示例:
+
+```json
+{
+  "name": "schema2",
+  "file": "file:///tmp/ekuiper/internal/schema/test/test2.proto",
+   "soFile": "file:///tmp/ekuiper/internal/schema/test/so.proto"
+}
+```
+
 ### 参数
 
 1. name:模式的唯一名称。
 2. 模式的内容,可选用 file 或 content 参数来指定。模式创建后,模式内容将写入 `data/schemas/$shcema_type/$schema_name` 文件中。
    - file:模式文件的 URL。URL 支持 http 和 https 以及 file 模式。当使用 file 模式时,该文件必须在 eKuiper 服务器所在的机器上。它必须是模式类型对应的格式。例如 protobuf 模式的文件扩展名应为 .proto。
    - content:模式文件的内容。
+3. soFile:静态插件 so。插件创建请看[自定义格式](../../rules/codecs.md#格式扩展)。
 
 ## 显示模式
 

+ 42 - 0
docs/zh_CN/operation/restapi/streams.md

@@ -66,6 +66,48 @@ GET http://localhost:9081/streams/{id}}
 }
 ```
 
+## 获取数据结构
+
+该 API 用于获取流的数据结构,该数据结构为合并物理 schema 和逻辑 schema后推断出的实际定义结构。
+
+```shell
+GET http://localhost:9081/streams/{id}/schema
+```
+
+数据格式为类 Json Schema 的结构。示例如下:
+
+```json
+{
+    "id": {
+        "type": "bigint"
+	},
+    "name": {
+        "type": "string"
+	},
+    "age": {
+        "type": "bigint"
+	},
+    "hobbies": {
+        "type": "struct",
+        "properties": {
+          "indoor": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            }
+          },
+          "outdoor": {
+            "type": "array",
+            "items": {
+              "type": "string"
+            }
+          }
+        }
+    }
+}
+```
+
+
 ## 更新流
 
 该 API 用于更新流定义。

+ 8 - 0
docs/zh_CN/operation/restapi/tables.md

@@ -65,6 +65,14 @@ GET http://localhost:9081/tables/{id}}
 }
 ```
 
+## 获取数据结构
+
+该 API 用于获取流的数据结构,该数据结构为合并物理 schema 和逻辑 schema后推断出的实际定义结构。
+
+```shell
+GET http://localhost:9081/tables/{id}/schema
+```
+
 ## 更新表
 
 该 API 用于更新表的定义。

+ 84 - 4
docs/zh_CN/rules/codecs.md

@@ -4,7 +4,7 @@ eKuiper 计算过程中使用的是基于 Map 的数据结构,因此 source/si
 
 ## 格式
 
-编解码的格式分为两种:有模式和无模式的格式。当前 eKuiper 支持的格式有 `json`, `binary` 和 `protobuf`。其中,`protobuf` 为有模式的格式。
+编解码的格式分为两种:有模式和无模式的格式。当前 eKuiper 支持的格式有 `json`, `binary`, `protobuf` 和 `custom`。其中,`protobuf` 为有模式的格式。
 有模式的格式需要先注册模式,然后在设置格式的同时,设置引用的模式。例如,在使用 mqtt sink 时,可配置格式和模式:
 
 ```json
@@ -18,9 +18,90 @@ eKuiper 计算过程中使用的是基于 Map 的数据结构,因此 source/si
 }
 ```
 
+所有格式都提供了编解码的能力,同时也可选地提供了数据结构的定义,即模式。编解码的计算可内置,如 JSON 解析;可动态解析模式进行编解码,如 Protobuf 解析 `*.proto` 文件;也可使用用户自定义的静态插件(`*.so`)进行解析。其中,静态解析的性能最好,但是需要另外编写代码并编译成插件,变更较为困难。动态解析使用更为灵活。
+
+当前所有支持的格式,及其支持的编解码方法和模式如下表所示:
+
+| 格式       | 编解码 | 自定义编解码 | 模式    |
+|----------|-----|--------|-------|
+| json     | 内置  | 不支持    | 不支持   |
+| binary   | 内置  | 不支持    | 不支持   |
+| protobuf | 内置  | 支持     | 支持且必需 |
+| custom   | 无内置 | 支持且必需  | 支持且可选 |
+
+
+### 格式扩展
+
+当用户使用 `custom` 格式或者 `protobuf` 格式时,可采用 go 语言插件的形式自定义格式的编解码和模式。其中,`protobuf` 仅支持自定义编解码,模式需要通过 `*.proto` 文件定义。自定义格式的步骤如下:
+
+1. 实现编解码相关接口。其中,Encode 编码函数将传入的数据(当前总是为 map[string]interface{}) 编码为字节数组。而 Decode 解码函数则相反,将字节数组解码为 map[string]interface{}。解码函数在 source 中被调用,而编码函数将在 sink 中调用。
+    ```go
+    // Converter converts bytes & map or []map according to the schema
+    type Converter interface {
+        Encode(d interface{}) ([]byte, error)
+        Decode(b []byte) (interface{}, error)
+    }
+    ```
+2. 实现数据结构描述接口(格式为 custom 时可选)。若自定义的格式为强类型,则可实现该接口。接口返回一个类 JSON schema 的字符串,供 source 使用。返回的数据结构将作为一个物理 schema 使用,帮助 eKuiper 实现编译解析阶段的 SQL 验证和优化等能力。
+    ```go
+    type SchemaProvider interface {
+	    GetSchemaJson() string
+    }
+    ```
+3. 编译为插件 so 文件。通常格式的扩展无需依赖 eKuiper 的主项目。由于 Go 语言插件系统的限制,插件的编译仍然需要在与 eKuiper 主程序相同的编译环境中进行,包括操作相同,Go 语言版本等。若需要部署到官方 docker 中,则可使用对应的 docker 镜像进行编译。
+    ```shell
+    go build -trimpath --buildmode=plugin -o data/test/myFormat.so internal/converter/custom/test/*.go
+    ```
+4. 通过 REST API 进行模式注册。
+    ```shell
+    ###
+    POST http://{{host}}/schemas/custom
+    Content-Type: application/json
+    
+    {
+      "name": "custom1",
+       "soFile": "file:///tmp/custom1.so"
+    }
+    ```
+5. 在 source 或者 sink 中,通过 `format` 和 `schemaId` 参数使用自定义格式。
+
+完整的自定义格式可参考 [myFormat.go](https://github.com/lf-edge/ekuiper/blob/master/internal/converter/custom/test/myformat.go)。该文件定义了一个简单的自定义格式,编解码实际上仅调用 JSON 进行序列化。它返回了一个数据结构,可用于 eKuiper source 的数据结构推断。
+
+### 静态 Protobuf
+
+使用 Protobuf 格式时,我们支持动态解析和静态解析两种方式。使用动态解析时,用户仅需要在注册模式时指定 proto 文件。在解析性能要求更高的条件下,用户可采用静态解析的方式。静态解析需要开发解析插件,其步骤如下:
+
+1. 已有 proto 文件 helloworld.proto, 使用官方 protoc 工具生成 go 代码。详情参见[ Protocol Buffer 文档](https://developers.google.com/protocol-buffers/docs/reference/go-generated)。
+   ```shell
+   protoc --go_opt=Mhelloworld.proto=com.main --go_out=. helloworld.proto
+   ```
+2. 将生成的代码 helloworld.pb.go 移动到 go 语言项目(此处名为 test)中,包名重命名为 main 。
+3. 创建包装类。对于每个消息类型,实现 3 个方法 `Encode`, `Decode`, `GetXXX`。编解码中主要是进行消息的 struct 与 map 类型的转换。需要注意的是,为了保证性能,不要使用反射。
+4. 编译为插件 so 文件。通常格式的扩展无需依赖 eKuiper 的主项目。由于 Go 语言插件系统的限制,插件的编译仍然需要在与 eKuiper 主程序相同的编译环境中进行,包括操作相同,Go 语言版本等。若需要部署到官方 docker 中,则可使用对应的 docker 镜像进行编译。
+   ```shell
+    go build -trimpath --buildmode=plugin -o data/test/helloworld.so internal/converter/protobuf/test/*.go
+   ```
+5. 通过 REST API 进行模式注册。需要注意的是,proto 文件和 so 文件都需要指定。
+    ```shell
+    ###
+    POST http://{{host}}/schemas/protobuf
+    Content-Type: application/json
+    
+    {
+      "name": "helloworld",
+      "file": "file:///tmp/helloworld.proto",
+       "soFile": "file:///tmp/helloworld.so"
+    }
+    ```
+6. 在 source 或者 sink 中,通过 `format` 和 `schemaId` 参数使用自定义格式。
+
+完整的静态 protobuf 插件可参考 [helloworld protobuf](https://github.com/lf-edge/ekuiper/tree/master/internal/converter/protobuf/test)。
+
+
 ## 模式
 
-模式是一套元数据,用于定义数据结构。例如,Protobuf 格式中使用 .proto 文件作为模式定义传输的数据格式。目前,eKuiper 仅支持 Protobuf 这一种模式。
+模式是一套元数据,用于定义数据结构。例如,Protobuf 格式中使用 .proto 文件作为模式定义传输的数据格式。目前,eKuiper 仅支持 protobuf 和 custom 这两种模式。
+
 
 ### 模式注册
 
@@ -33,5 +114,4 @@ eKuiper 启动时,将会扫描该配置文件夹并自动注册里面的模式
 用户可使用模式注册表 API 在运行时对模式进行增删改查。详情请参考:
 
 - [模式注册表 REST API](../operation/restapi/schemas.md)
-- [模式注册表 CLI](../operation/cli/schemas.md)
-
+- [模式注册表 CLI](../operation/cli/schemas.md)

+ 13 - 13
docs/zh_CN/sqls/streams.md

@@ -94,21 +94,21 @@ demo (
 	) WITH (DATASOURCE="test", FORMAT="JSON", KEY="USERID", SHARED="true");
 ```
 
+## 数据结构
+
+流的数据结构(schema) 包含两个部分。一个是在数据源定义中定义的数据结构,即逻辑数据结构;另一个是在使用强类型数据格式时指定的 SchemaId 即物理数据结构,例如 Protobuf 和 Custom 格式定义的数据结构。
+
+整体上,我们将支持3种递进的数据结构方式:
+
+1. Schemaless,用户无需定义任何形式的 schema,主要用于弱结构化数据流,或数据结构经常变化的情况。
+2. 仅逻辑结构,用户在 source 层定义 schema,多用于弱类型的编码方式,例如最常用的 JSON。适用于用户的数据有固定或大致固定的格式,同时不想使用强类型的数据编解码格式。使用这种方式的情况下,可以可通过 StrictValidation 参数配置是否进行数据验证和转换。
+3. 物理结构,用户使用 protobuf 或者 custom 格式,并定义 schemaId。此时,数据结构的验证将由格式来实现。
+
+逻辑结构和物理结构定义都用于规则创建的解析和载入阶段的 SQL 语法验证以及运行时优化等。推断后的数据流的数据结构可通过 [Schema API](../operation/restapi/streams.md#获取数据结构)获取。
+
 ### Strict Validation
 
-```
-StrictValidation 的值可以为 true 或 false。
-1)True:如果消息不符合流定义,则删除消息。
-2)False:保留消息,但用默认的空值填充缺少的字段。
-
-bigint: 0
-float: 0.0
-string: ""
-datetime: (NOT support yet)
-boolean: false
-array: zero length array
-struct: null value
-```
+仅用于逻辑结构的数据流。若设置 strict validation,则规则运行中将根据逻辑结构对字段存在与否以及字段类型进行校验。若数据格式完好,建议关闭验证。
 
 ### Schema-less 流