Skip to main content

GitHub 中的扁平数据处理

git 是一种流行的系统,用于组织文本文件和更改的历史记录。Git 还可以存储和跟踪电子表格。

¥Git is a popular system for organizing a historical record of text files and changes. Git can also store and track spreadsheets.

GitHub 托管 Git 存储库并提供执行工作流程的基础设施。"扁平数据" 项目 探索使用 GitHub 基础设施存储和比较结构化数据的版本。

¥GitHub hosts Git repositories and provides infrastructure to execute workflows. The "Flat Data" project explores storing and comparing versions of structured data using GitHub infrastructure.

SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。

¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.

该演示使用 GitHub 中的 SheetJS 来处理电子表格。我们将探讨如何定期获取和处理电子表格,以及如何跟踪一段时间内的变化。

¥This demo uses SheetJS in GitHub to process spreadsheets. We'll explore how to fetch and process spreadsheets at regular intervals, and how to keep track of changes over time.

"Excel 转 CSV" 是一个官方示例,它从端点拉取 XLSX 工作簿并使用 SheetJS 解析工作簿并生成 CSV 文件。

¥"Excel to CSV" is an official example that pulls XLSX workbooks from an endpoint and uses SheetJS to parse the workbooks and generate CSV files.

下图描述了数据舞蹈:

¥The following diagram depicts the data dance:

扁平数据

¥Flat Data

政府和组织发布的许多官方数据都包含 XLSX 或 XLS 文件。不幸的是,某些数据源不保留旧版本。

¥Many official data releases by governments and organizations include XLSX or XLS files. Unfortunately some data sources do not retain older versions.

软件开发者通常使用版本控制系统(例如 Git)来跟踪源代码的更改。

¥Software developers typically use version control systems such as Git to track changes in source code.

"扁平数据" 项目的出发点是可以使用相同的版本控制系统来跟踪数据的更改。可以定期对第三方数据源进行快照并存储在 Git 存储库中。

¥The "Flat Data" project starts from the idea that the same version control systems can be used to track changes in data. Third-party data sources can be snapshotted at regular intervals and stored in Git repositories.

成分

¥Components

作为公司的一个项目,整个生命周期都使用 GitHub 产品:

¥As a project from the company, the entire lifecycle uses GitHub offerings:

  • GitHub.com[^1] 为 Git 存储库提供免费托管

    ¥GitHub.com[^1] offers free hosting for Git repositories

  • GitHub Actions[^2] 基础设施定期运行任务

    ¥GitHub Actions[^2] infrastructure runs tasks at regular intervals

  • githubocto/flat[^3] 库有助于获取数据并自动进行后处理

    ¥githubocto/flat[^3] library helps fetch data and automate post-processing

  • flat-postprocessing[^4] 库提供后处理辅助函数

    ¥flat-postprocessing[^4] library provides post-processing helper functions

  • "扁平查看器"[^5] 显示来自 Git 存储库的结构化 CSV 和 JSON 数据

    ¥"Flat Viewer"[^5] displays structured CSV and JSON data from Git repositories

需要 GitHub 账户。上次测试演示时,"GitHub 免费" 账户对公共存储库 [^6] 没有操作使用限制。

¥A GitHub account is required. When the demo was last tested, "GitHub Free" accounts had no Actions usage limits for public repositories[^6].

私有 GitHub 存储库可用于处理数据,但 Flat Viewer 将无法显示私有数据。

¥Private GitHub repositories can be used for processing data, but the Flat Viewer will not be able to display private data.

数据源

¥Data Source

任何公开可用的电子表格都可以是有效的数据源。该过程将按指定的时间间隔或事件获取数据。

¥Any publicly available spreadsheet can be a valid data source. The process will fetch the data on specified intervals or events.

对于本演示,将使用 https://xlsx.nodejs.cn/pres.xlsx

¥For this demo, https://xlsx.nodejs.cn/pres.xlsx will be used.

行动

¥Action

githubocto/flat 操作可以添加为工作流程中的一个步骤:

¥The githubocto/flat action can be added as a step in a workflow:

      - name: Fetch data
uses: githubocto/flat@v3
with:
http_url: https://xlsx.nodejs.cn/pres.xlsx
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts

此操作执行以下步骤:

¥This action performs the following steps:

  1. http_url 将被提取并保存到存储库中的 downloaded_filename。这可以通过以下命令来近似:

    ¥http_url will be fetched and saved to downloaded_filename in the repo. This can be approximated with the following command:

curl -L -o data.xlsx https://xlsx.nodejs.cn/pres.xlsx
  1. 保存后,将运行 postprocess 脚本。当 .ts 文件是脚本时,它将在 Deno 运行时运行该脚本。postprocess 脚本预计会读取下载的文件并在存储库中创建或覆盖文件。这可以通过以下命令来近似:

    ¥After saving, the postprocess script will be run. When a .ts file is the script, it will run the script in the Deno runtime. The postprocess script is expected to read the downloaded file and create or overwrite files in the repo. This can be approximated with the following command:

deno run -A ./postprocess.ts data.xlsx
  1. 该操作将比较存储库的内容,如果 postprocess 脚本中的源数据或工件发生更改,则创建新的提交。

    ¥The action will compare the contents of the repo, creating a new commit if the source data or artifacts from the postprocess script changed.

后处理数据

¥Post-Processing Data

flat-postprocessing 库包含许多适用于不同数据格式的实用程序。readXLSX 助手在底层使用了 SheetJS。

¥The flat-postprocessing library includes a number of utilities for different data formats. The readXLSX helper uses SheetJS under the hood.

该库使用旧版本的 SheetJS 库。要使用最新版本,请从 SheetJS CDN 导入示例:

¥The library uses an older version of the SheetJS library. To use the latest releases, the examples import from the SheetJS CDN:

// @deno-types="https://cdn.sheetjs.com/xlsx-0.20.3/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/xlsx.mjs';

详细信息请参见 "Deno" 安装部分

¥See the "Deno" installation section for more details.

后处理脚本

¥Post-Process Script

后处理脚本的第一个参数是文件名。

¥The first argument to the post-processing script is the filename.

SheetJS readFile 方法 [^7] 将读取该文件并生成 SheetJS 工作簿对象 [^8]。提取第一个工作表后,sheet_to_csv[^9] 生成一个 CSV 字符串。

¥The SheetJS readFile method[^7] will read the file and generate a SheetJS workbook object[^8]. After extracting the first worksheet, sheet_to_csv[^9] generates a CSV string.

生成 CSV 字符串后,应使用 Deno.writeFileSync[^10] 将字符串写入文件系统。按照约定,CSV 应保留文件名主干并将扩展名替换为 .csv

¥After generating a CSV string, the string should be written to the filesystem using Deno.writeFileSync[^10]. By convention, the CSV should preserve the file name stem and replace the extension with .csv:

postprocess.ts
// @deno-types="https://cdn.sheetjs.com/xlsx-0.20.3/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/xlsx.mjs';
/* load the codepage support library for extended support with older formats */
import * as cptable from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/cpexcel.full.mjs';
XLSX.set_cptable(cptable);

/* get the file path for the downloaded file and generate the CSV path */
const in_file = Deno.args[0];
const out_file = in_file.replace(/.xlsx$/, ".csv");

/* read file */
const workbook = XLSX.readFile(in_file);

/* generate CSV from first worksheet */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
const csv = XLSX.utils.sheet_to_csv(first_sheet);

/* write CSV */
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));

完整示例

¥Complete Example

测试部署

SheetJS 用户于 2024 年 6 月 7 日对此进行了最后一次测试。

¥This was last tested by SheetJS users on 2024 June 07.

https://github.com/SheetJS/flat-sheet 是之前测试的示例。存储库的 Flat Viewer URL 是 https://flatgithub.com/SheetJS/flat-sheet/

¥https://github.com/SheetJS/flat-sheet is an example from a previous test. The Flat Viewer URL for the repo is https://flatgithub.com/SheetJS/flat-sheet/

创建项目

¥Create Project

  1. 创建免费的 GitHub 账户或登录 GitHub Web 界面。

    ¥Create a free GitHub account or sign into the GitHub web interface.

  2. 创建新的存储库(单击右上角的 "*" 图标)。

    ¥Create a new repository (click the "+" icon in the upper-right corner).

  • 出现提示时,输入你选择的存储库名称。

    ¥When prompted, enter a repository name of your choosing.

  • 确保选择了 "公开"

    ¥Ensure "Public" is selected

  • 检查 "添加自述文件"

    ¥Check "Add a README file"

  • 单击底部的 "创建存储库"。

    ¥Click "Create repository" at the bottom.

你将被重定向到新项目。

¥You will be redirected to the new project.

添加代码

¥Add Code

  1. 在浏览器 URL 栏中,将 "github.com" 更改为 "github.dev"。例如,如果 URL 原来是 https://github.com/SheetJS/flat-sheet ,则新 URL 应为 https://github.dev/SheetJS/flat-sheet 。 按 Enter 键。

    ¥In the browser URL bar, change "github.com" to "github.dev". For example, if the URL was originally https://github.com/SheetJS/flat-sheet , the new URL should be https://github.dev/SheetJS/flat-sheet . Press Enter.

  2. 在左侧 "EXPLORER" 面板中,双击 README.md 正下方。README 上方将出现一个文本框。输入 postprocess.ts 并按 Enter 键。

    ¥In the left "EXPLORER" panel, double-click just below README.md. A text box will appear just above README. Type postprocess.ts and press Enter.

    主面板将显示 postprocess.ts 选项卡。将以下代码复制到主编辑器窗口:

    ¥The main panel will show a postprocess.ts tab. Copy the following code to the main editor window:

postprocess.ts
// @deno-types="https://cdn.sheetjs.com/xlsx-0.20.3/package/types/index.d.ts"
import * as XLSX from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/xlsx.mjs';
/* load the codepage support library for extended support with older formats */
import * as cptable from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/cpexcel.full.mjs';
XLSX.set_cptable(cptable);

/* get the file path for the downloaded file and generate the CSV path */
const in_file = Deno.args[0];
const out_file = in_file.replace(/.xlsx$/, ".csv");

/* read file */
const workbook = XLSX.readFile(in_file);

/* generate CSV */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];
const csv = XLSX.utils.sheet_to_csv(first_sheet);

/* write CSV */
Deno.writeFileSync(out_file, new TextEncoder().encode(csv));
  1. 在左侧 "EXPLORER" 面板中,双击 README.md 正下方。将出现一个文本框。输入 .github/workflows/data.yaml 并按 Enter 键。

    ¥In the left "EXPLORER" panel, double-click just below README.md. A text box will appear. Type .github/workflows/data.yaml and press Enter.

    将以下代码复制到主区域。它将创建一个大约每小时运行一次的操作:

    ¥Copy the following code into the main area. It will create an action that runs roughly once an hour:

.github/workflows/data.yaml
name: flatsheet

on:
workflow_dispatch:
schedule:
- cron: '0 * * * *'

jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Setup deno
uses: denoland/setup-deno@main
with:
deno-version: v1.x
- name: Check out repo
uses: actions/checkout@v2
- name: Fetch data
uses: githubocto/flat@v3
with:
http_url: https://xlsx.nodejs.cn/pres.xlsx
downloaded_filename: data.xlsx
postprocess: ./postprocess.ts
  1. 单击源代码控制图标(带有数字 2 的蓝色小圆圈)。在左侧面板中,选择消息框,键入 init,然后按 CTRL + Enter(在 Windows 上)(在 MacOS 上按 Command+Enter)。

    ¥Click on the source control icon (a little blue circle with the number 2). In the left panel, select Message box, type init and press CTRL +Enter on Windows (Command+Enter on MacOS).

  2. 单击 图标,然后单击 "转到存储库" 返回存储库页面。

    ¥Click the icon and click "Go to Repository" to return to the repo page.

测试行动

¥Test Action

  1. 单击 "设置" 查看存储库设置。在左栏中,单击 "行动" 展开子菜单,然后单击 "一般的"。

    ¥Click "Settings" to see the repository settings. In the left column, click "Actions" to expand the submenu and click "General".

    向下滚动到 "工作流程权限",如果未选择,请选择 "读写权限"。单击 "保存"。

    ¥Scroll down to "Workflow permissions" and select "Read and write permissions" if it is not selected. Click "Save".

  2. 单击 "行动" 查看工作流程。在左栏中,单击 flatsheet

    ¥Click "Actions" to see the workflows. In the left column, click flatsheet.

    这是该操作的页面。每次运行该操作时,都会将一个新条目添加到列表中。

    ¥This is the page for the action. Every time the action is run, a new entry will be added to the list.

    单击 "运行工作流程",然后单击弹出窗口中的 "运行工作流程" 按钮。这将开始新的运行。大约 30 秒后,主区域中应出现新行。该图标应该是绿色圆圈中的白色

    ¥Click "Run workflow", then click the "Run workflow" button in the popup. This will start a new run. After about 30 seconds, a new row should show up in the main area. The icon should be a white in a green circle.

  3. 单击 "代码" 返回主视图。它应该有一个文件列表,其中包括 data.xlsx(下载的文件)和 data.csv(生成的数据)

    ¥Click "Code" to return to the main view. It should have a file listing that includes data.xlsx (downloaded file) and data.csv (generated data)

  4. 重复步骤 8 以第二次运行该操作。再次单击 "代码"。

    ¥Repeat step 8 to run the action a second time. Click "Code" again.

查看器

¥Viewer

  1. 转到地址栏并将 "github.com" 更改为 "flatgithub.com"。例如,如果 URL 原来是 https://github.com/SheetJS/flat-sheet ,则新 URL 应为 https://flatgithub.com/SheetJS/flat-sheet 。 按 Enter 键。

    ¥Go to the URL bar and change "github.com" to "flatgithub.com". For example, if the URL was originally https://github.com/SheetJS/flat-sheet , the new URL should be https://flatgithub.com/SheetJS/flat-sheet . Press Enter.

你将看到 "扁平查看器"。在顶部栏中,"提交" 选项允许切换到旧版本的数据。

¥You will see the "Flat Viewer". In the top bar, the "Commit" option allows for switching to an older version of the data.

以下屏幕截图显示了查看器的运行情况:

¥The following screenshot shows the viewer in action:

Flat Viewer for SheetJS/flat-sheet

Index 列中的柱形图是直方图。

¥The column chart in the Index column is a histogram.

[^1]: 请参阅 GitHub 文档中的 "存储库文档"

¥See "Repositories documentation" in the GitHub documentation.

[^2]: 请参阅 GitHub 文档中的 "GitHub Actions 文档"

¥See "GitHub Actions documentation" in the GitHub documentation.

[^3]: 请参阅 GitHub 上的 githubocto/flat 存储库。

¥See githubocto/flat repo on GitHub.

[^4]: 请参阅 GitHub 上的 githubocto/flat-postprocessing 存储库。

¥See githubocto/flat-postprocessing repo on GitHub.

[^5]: 托管版本在 https://flatgithub.com/ 可用

¥The hosted version is available at https://flatgithub.com/

[^6]: 请参阅 GitHub 文档中的 "关于 GitHub Actions 的计费"

¥See "About billing for GitHub Actions" in the GitHub documentation.

[^7]: 见 readFile 于 "读取文件"

¥See readFile in "Reading Files"

[^8]: 见 "工作簿对象"

¥See "Workbook Object"

[^9]: 见 sheet_to_csv 于 "CSV 和文本"

¥See sheet_to_csv in "CSV and Text"

[^10]: 请参阅 Deno 运行时 API 文档中的 Deno.writeFileSync

¥See Deno.writeFileSync in the Deno Runtime APIs documentation.