Browse Source

doc: file source

Signed-off-by: Jiyong Huang <huangjy@emqx.io>
Jiyong Huang 2 years atrás
parent
commit
9cdad5c65d
2 changed files with 92 additions and 35 deletions
  1. 73 2
      docs/en_US/guide/sources/builtin/file.md
  2. 19 33
      docs/zh_CN/guide/sources/builtin/file.md

+ 73 - 2
docs/en_US/guide/sources/builtin/file.md

@@ -4,7 +4,7 @@
 <span style="background:green;color:white">scan table source</span>
 
 eKuiper provides built-in support for reading file content into the eKuiper processing pipeline. The file source is
-usually used as a [table](../../../sqls/tables.md) and it is the default type for create table statement. File sources
+usually used as a [table](../../../sqls/tables.md), and it is the default type for create table statement. File sources
 are also supported as streams, where it is usually necessary to set the `interval` parameter to pull updates at regular
 intervals.
 
@@ -48,4 +48,75 @@ default:
   ignoreEndLines: 0
 ```
 
-With this yaml file, the table will refer to the file *${eKuiper}/data/lookup.json* and read it in json format.
+### File Types
+
+The file source supports monitoring files or folders. If the monitored location is a folder, all files in the folder are required to be of the same type. When monitoring a folder, it will read in files order by file name alphabetically.
+
+The supported file types are
+
+- json: standard JSON array format files,
+  see [example](https://github.com/lf-edge/ekuiper/tree/master/internal/topo/source/test/test.json). If the file format is a line-separated JSON string, it needs to be defined in lines format.
+- csv: comma-separated csv files are supported, as well as custom separators.
+- lines: line-separated file. The decoding method of each line can be defined by the format parameter in the stream definition. For example, for a line-separated JSON string, the file type is set to lines and the format is set to json.
+
+Some files may have most of the data in standard format, but have some metadata in the opening and closing lines of the file. The user can use the `ignoreStartLines` and `ignoreEndLines` arguments to remove the non-standard parts of the beginning and end so that the above file types can be parsed.
+
+### Example
+
+File sources involve the parsing of file contents and intersect with format-related definitions in data streams. We
+describe with some examples how to combine file types and formats for parsing file sources.
+
+#### Read a csv with a custom separator
+
+The standard csv separator is a comma, but there are a large number of files that use the csv-like format with custom
+separators. Some csv-like files have column names defined in the first line instead of data.
+
+```csv
+id name age
+1 John 56
+2 Jane 34
+```
+
+When the file is read, the configuration file is as follows, specifying that the file has a header.
+
+```yaml
+csv:
+  fileType: csv
+  hasHeader: true
+```
+
+In the stream definition, set the stream data to ``DELIMITED`` format, specifying the separator with the ``DELIMITER``
+parameter.
+
+```SQL
+create
+stream cscFileDemo () WITH (FORMAT="DELIMITED", DATASOURCE="abc.csv", TYPE="file", DELIMITER=" ", CONF_KEY="csv"
+```
+
+#### Read multi-line JSON data
+
+With a standard JSON file, the entire file should be a JSON object or an array. In practice, we often need to parse
+files that contain multiple JSON objects. These files are not actually JSON themselves, but are considered to be
+multiple lines of JSON data, assuming that each JSON object is a single line.
+
+```text
+{"id": 1, "name": "John Doe"}
+{"id": 2, "name": "Jane Doe"}
+{"id": 3, "name": "John Smith"}
+```
+
+When reading this file, the configuration file is as follows, specifying the file type as lines.
+
+```yaml
+jsonlines:
+  fileType: lines
+```
+
+In the stream definition, set the stream data to be in `JSON` format.
+
+```SQL
+create stream linesFileDemo () WITH (FORMAT="JSON", TYPE="file", CONF_KEY="jsonlines"
+```
+
+Moreover, the lines file type can be combined with any format. For example, if you set the format to protobuf and
+configure the schema, it can be used to parse data that contains multiple Protobuf encoded lines.

+ 19 - 33
docs/zh_CN/guide/sources/builtin/file.md

@@ -46,34 +46,25 @@ default:
   ignoreEndLines: 0
 ```
 
-### File Types
+### 文件源
 
-The file source supports monitoring files or folders. If the monitored location is a folder, all files in the folder are
-required to be of the same type. When monitoring a folder, it will read in files order by file name alphabetically.
+文件源支持监控文件或文件夹。如果被监控的位置是一个文件夹,那么该文件夹中的所有文件必须是同一类型。当监测一个文件夹时,它将按照文件名的字母顺序来读取文件。
 
-The supported file types are
+支持的文件类型有:
 
-- json: standard JSON array format files,
-  see [example](https://github.com/lf-edge/ekuiper/tree/master/internal/topo/source/test/test.json). If the file format
-  is a line-separated JSON string, it needs to be defined in lines format.
-- csv: comma-separated csv files are supported, as well as custom separators.
-- lines: line-separated file. The decoding method of each line can be defined by the format parameter in the stream
-  definition. For example, for a line-separated JSON string, the file type is set to lines and the format is set to
-  json.
+- json:标准的JSON数组格式文件。见[例子](https://github.com/lf-edge/ekuiper/tree/master/internal/topo/source/test/test.json)。如果文件格式是一个以行分隔的JSON字符串,它需要以 `lines` 格式定义。
+- csv:支持逗号分隔的 csv 文件,也支持自定义分隔符。
+- lines:以行分隔的文件。每行的解码方法可以通过流定义中的 `format` 参数来定义。例如,对于一个按行分隔的 JSON 字符串文件,文件类型应设置为 `lines`,格式应设置为 `json`,表示单行的格式为 json。
 
-Some files may have most of the data in standard format, but have some metadata in the opening and closing lines of the
-file. The user can use the `ignoreStartLines` and `ignoreEndLines` arguments to remove the non-standard parts of the
-beginning and end so that the above file types can be parsed.
+有些文件可能有大部分数据是标准格式,但在文件的开头和结尾行有一些元数据。用户可以使用`ignoreStartLines`和`ignoreEndLines`参数来删除非标准的开头和结尾的非标准部分,这样上述文件类型就可以被解析了。
 
-### Example
+### 示例
 
-File sources involve the parsing of file contents and intersect with format-related definitions in data streams. We
-describe with some examples how to combine file types and formats for parsing file sources.
+文件源涉及对文件内容的解析,同时解析格式与数据流中的格式定义相关。我们用一些例子来描述如何结合文件类型和格式设置来解析文件源。
 
-#### Read a csv with a custom separator
+#### 读取自定义分隔符的 CSV 文件
 
-The standard csv separator is a comma, but there are a large number of files that use the csv-like format with custom
-separators. Some csv-like files have column names defined in the first line instead of data.
+标准的 csv 文件,分隔符是一个逗号,但是有大量的文件使用类 csv 格式,但使用自定义的分隔符。另外,一些类 csv 的文件在第一行定义了列名,而不是数据,如下例所示。
 
 ```csv
 id name age
@@ -81,7 +72,7 @@ id name age
 2 Jane 34
 ```
 
-When the file is read, the configuration file is as follows, specifying that the file has a header.
+该文件第一行为文件头,定义了文件的列名。读取这样的文件时,配置文件如下,需要指定文件有一个头。
 
 ```yaml
 csv:
@@ -89,19 +80,16 @@ csv:
   hasHeader: true
 ```
 
-In the stream definition, set the stream data to ``DELIMITED`` format, specifying the separator with the ``DELIMITER``
-parameter.
+在流定义中,将流数据设置为 `DELIMITED` 格式,用 `DELIMITER` 参数指定分隔符为空格。
 
 ```SQL
 create
 stream cscFileDemo () WITH (FORMAT="DELIMITED", DATASOURCE="abc.csv", TYPE="file", DELIMITER=" ", CONF_KEY="csv"
 ```
 
-#### Read multi-line JSON data
+#### 读取多行 JSON 数据
 
-With a standard JSON file, the entire file should be a JSON object or an array. In practice, we often need to parse
-files that contain multiple JSON objects. These files are not actually JSON themselves, but are considered to be
-multiple lines of JSON data, assuming that each JSON object is a single line.
+对于一个标准的 JSON 文件,整个文件应该是一个 JSON 对象或一个数组。在实践中,我们经常需要解析包含多个 JSON 对象的文件。这些文件实际上本身不是合法的 JSON 格式,但每行都是合法的 JSON 格式,可认为是多行JSON数据。
 
 ```text
 {"id": 1, "name": "John Doe"}
@@ -109,19 +97,17 @@ multiple lines of JSON data, assuming that each JSON object is a single line.
 {"id": 3, "name": "John Smith"}
 ```
 
-When reading this file, the configuration file is as follows, specifying the file type as lines.
+读取这种格式的文件时,配置中的文件类型设置为 `lines`。
 
 ```yaml
 jsonlines:
   fileType: lines
 ```
 
-In the stream definition, set the stream data to be in ``JSON`` format.
+在流定义中,设置流数据为`JSON`格式。
 
 ```SQL
-create
-stream linesFileDemo () WITH (FORMAT="JSON", TYPE="file", CONF_KEY="jsonlines"
+create stream linesFileDemo () WITH (FORMAT="JSON", TYPE="file", CONF_KEY="jsonlines"
 ```
 
-Moreover, the lines file type can be combined with any format. For example, if you set the format to protobuf and
-configure the schema, it can be used to parse data that contains multiple Protobuf encoded lines.
+此外,lines 文件类型可以与任何格式相结合。例如,如果你将格式设置为 protobuf,并且配置模式,它可以用来解析包含多个 Protobuf 编码行的数据。