Python 中的电子表格数据

熊猫是一个用于数据分析的 Python 库。

¥Pandas is a Python library for data analysis.

SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。

¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.

该演示使用 SheetJS 处理电子表格中的数据并将其转换为 Pandas DataFrame 格式。我们将探索如何从 Python 脚本加载 SheetJS、从工作簿生成 DataFrame 以及将 DataFrame 写回工作簿。

¥This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.

"完整示例" 包括一个封装库，可简化电子表格的导入和导出。

¥The "Complete Example" includes a wrapper library that simplifies importing and exporting spreadsheets.

Pandas 对读取电子表格 (pandas.from_excel) 和编写 XLSX 电子表格 (pandas.DataFrame.to_excel) 提供有限支持。

¥Pandas includes limited support for reading spreadsheets (pandas.from_excel) and writing XLSX spreadsheets (pandas.DataFrame.to_excel).

SheetJS 支持许多 Pandas 无法处理的常见电子表格格式。

¥SheetJS supports many common spreadsheet formats that Pandas cannot process.

SheetJS 操作还为处理复杂工作表提供了更大的灵活性。

¥SheetJS operations also offer more flexibility in processing complex worksheets.

测试环境

该演示在以下部署中进行了测试：

¥This demo was tested in the following deployments:

架构	JS 引擎	熊猫	Python	日期
`darwin-x64`	杜克胶带 `2.7.0`	2.2.1	3.12.2	2024-03-15
`darwin-arm`	杜克胶带 `2.7.0`	2.2.2	3.12.3	2024-06-30
`win10-x64`	杜克胶带 `2.7.0`	2.2.1	3.12.2	2024-03-25
`win11-arm`	杜克胶带 `2.7.0`	2.2.2	3.11.5	2024-06-20
`linux-x64`	杜克胶带 `2.7.0`	1.5.3	3.11.3	2024-03-21
`linux-arm`	杜克胶带 `2.7.0`	1.5.3	3.11.2	2024-06-20

集成详情

¥Integration Details

sheetjs.py 是一个封装脚本，提供用于读取和写入电子表格的辅助方法。安装说明包含在 "完整示例" 部分中。

¥sheetjs.py is a wrapper script that provides helper methods for reading and writing spreadsheets. Installation notes are included in the "Complete Example" section.

Python 中的 JS

¥JS in Python

JS 代码无法在 Python 实现中直接计算。

¥JS code cannot be directly evaluated in Python implementations.

要从 Python 运行 JS 代码，JavaScript 引擎 [^1] 可以嵌入到 Python 模块中或使用 ctypes 外部函数库 [^2] 动态加载。该演示使用 ctypes 和杜克胶带发动机。

¥To run JS code from Python, JavaScript engines[^1] can be embedded in Python modules or dynamically loaded using the ctypes foreign function library[^2]. This demo uses ctypes with the Duktape engine.

封装器

¥Wrapper

该脚本导出一个名为 SheetJSWrapper 的类。它是一个上下文管理器，用于初始化 Duktape 引擎并在入口处执行 SheetJS 脚本。所有工作都应在以下上下文中进行：

¥The script exports a class named SheetJSWrapper. It is a context manager that initializes the Duktape engine and executes SheetJS scripts on entrance. All work should be performed in the context:

Complete Example
#!/usr/bin/env python3
from sheetjs import SheetJSWrapper

with SheetJSWrapper() as sheetjs:

  # Parse file
  wb = sheetjs.read_file("pres.numbers")
  print("Loaded file pres.numbers")

  # Get first worksheet name
  first_ws_name = wb.get_sheet_names()[0]
  print(f"Reading from sheet {first_ws_name}")

  # Generate DataFrame from first worksheet
  df = wb.get_df(first_ws_name)
  print(df.info())

  # Export DataFrame to XLSB
  sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

读取文件

¥Reading Files

sheetjs.read_file 接受电子表格文件的路径。它将解析文件并返回代表工作簿的对象。

¥sheetjs.read_file accepts a path to a spreadsheet file. It will parse the file and return an object representing the workbook.

工作簿的 get_sheet_names 方法返回工作表名称列表。

¥The get_sheet_names method of the workbook returns a list of sheet names.

工作簿的 get_df 方法从工作簿生成一个 DataFrame。可以通过传递名称来选择特定的工作表。

¥The get_df method of the workbook generates a DataFrame from the workbook. The specific sheet can be selected by passing the name.

例如，以下代码读取 pres.numbers 并从第二个工作表生成一个 DataFrame：

¥For example, the following code reads pres.numbers and generates a DataFrame from the second worksheet:

Generating a DataFrame from the second worksheet
with SheetJSWrapper() as sheetjs:
  # Parse file
  wb = sheetjs.read_file(path)

  # Generate DataFrame from second worksheet
  ws_name = wb.get_sheet_names()[1]
  df = wb.get_df(ws_name)

  # Print metadata
  print(df.info())

在幕后，sheetjs.py 执行以下步骤：

¥Under the hood, sheetjs.py performs the following steps:

纯 Python 操作读取电子表格文件并生成字节字符串。

¥Pure Python operations read the spreadsheet file and generate a byte string.
SheetJS 库解析字符串并生成干净的 CSV。

¥SheetJS libraries parse the string and generate a clean CSV.

read 方法 [^3] 将文件字节解析为 SheetJS 工作簿对象 [^4]

¥The read method[^3] parses file bytes into a SheetJS workbook object[^4]
选择工作表后，sheet_to_csv[^5] 生成 CSV 字符串

¥After selecting a worksheet, sheet_to_csv[^5] generates a CSV string

Python 操作将 CSV 字符串转换为流对象。[^6]

¥Python operations convert the CSV string to a stream object.[^6]
Pandas read_csv 方法 [^7] 摄取流并生成 DataFrame。

¥The Pandas read_csv method[^7] ingests the stream and generate a DataFrame.

写入文件

¥Writing Files

sheetjs.write_df 接受 DataFrame 和路径。它将尝试将数据导出到电子表格文件。

¥sheetjs.write_df accepts a DataFrame and a path. It will attempt to export the data to a spreadsheet file.

例如，以下代码将 DataFrame 导出到 SheetJSPandas.xlsb：

¥For example, the following code exports a DataFrame to SheetJSPandas.xlsb:

Exporting a DataFrame to XLSB
with SheetJSWrapper() as sheetjs:
  # Export DataFrame to XLSB
  sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

在幕后，sheetjs.py 执行以下步骤：

¥Under the hood, sheetjs.py performs the following steps:

Pandas DataFrame to_json method[^8] 生成一个 JSON 字符串。

¥The Pandas DataFrame to_json method[^8] generates a JSON string.
JS 引擎操作将 JSON 字符串转换为对象数组。

¥JS engine operations translate the JSON string to an array of objects.
SheetJS 库处理数据数组并生成文件字节。

¥SheetJS libraries process the data array and generate file bytes.

json_to_sheet 方法 [^9] 从数据创建一个 SheetJS 工作表对象。

¥The json_to_sheet method[^9] creates a SheetJS sheet object from the data.
book_new 方法 [^10] 创建一个包含该工作表的 SheetJS 工作簿。

¥The book_new method[^10] creates a SheetJS workbook that includes the sheet.
write 方法 [^11] 生成电子表格文件字节。

¥The write method[^11] generates the spreadsheet file bytes.

纯 Python 操作将字节写入文件。

¥Pure Python operations write the bytes to file.

完整示例

¥Complete Example

此示例将从 Apple Numbers 电子表格中提取数据并生成 DataFrame。DataFrame 将导出为二进制 XLSB 电子表格格式。

¥This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.

Windows 构建需要带有 "使用 C++ 进行桌面开发" 的 Visual Studio。命令必须在 "原生工具命令提示符" 会话中运行。

¥The Windows build requires Visual Studio with "Desktop development with C++". Commands must be run in a "Native Tools Command Prompt" session.

安装熊猫：

¥Install Pandas:

python3 -m pip install pandas

在 macOS 和 Linux 上，安装命令可能需要 root 访问权限：

¥On macOS and Linux, the install command may require root access:

sudo python3 -m pip install pandas

当 pip 没有安装时，命令会失败：

¥When pip is not installed, the command will fail:

/usr/bin/python3: No module named pip

必须安装 pip。在基于 Arch Linux 的平台（包括 Steam Deck）上，可以通过包管理器安装 python-pip：

¥pip must be installed. On Arch Linux-based platforms including the Steam Deck, python-pip can be installed through the package manager:

sudo pacman -Syu python-pip

在一些本地测试中，安装失败并出现以下错误：

¥In some local tests, the install failed with the following error:

error: externally-managed-environment

必须通过包管理器安装 Pandas：

¥Pandas must be installed through the package manager:

Debian 和 Ubuntu 发行版：

¥Debian and Ubuntu distributions:

sudo apt-get install python3-pandas

基于 Arch Linux 的平台，包括 Steam Deck：

¥Arch Linux-based platforms including the Steam Deck:

sudo pacman -Syu python-pandas

带有 Homebrew 的 Python 版本的 macOS 系统：

¥macOS systems with a Python version from Homebrew:

sudo python3 -m pip install pandas --break-system-packages

构建 Duktape 共享库：

¥Build the Duktape shared library:

MacOS
Linux
Windows

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..

下载并解压源 tarball。命令必须在 WSL bash 中运行：

¥Download and extract the source tarball. Commands must be run in WSL bash:

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz

（运行 bash，然后运行上述命令，然后运行 exit 退出 WSL）

¥(Run bash, then run the aforementioned commands, then run exit to exit WSL)

进入源文件夹：

¥Enter the source folder:

cd duktape-2.7.0

编辑 src\duk_config.h 并将高亮的行添加到文件末尾：

¥Edit src\duk_config.h and add the highlighted lines to the end of the file:

src\duk_config.h (add highlighted lines)
#endif  /* DUK_CONFIG_H_INCLUDED */

#define DUK_EXTERNAL_DECL  extern __declspec(dllexport)
#define DUK_EXTERNAL       __declspec(dllexport)

构建 Duktape DLL：

¥Build the Duktape DLL:

cl /O2 /W3 /Isrc /LD /DDUK_SINGLE_FILE /DDUK_F_DLL_BUILD /DDUK_F_WINDOWS /DDUK_COMPILING_DUKTAPE src\\duktape.c

上移到父目录：

¥Move up to the parent directory:

cd ..

将共享库复制到当前文件夹。上次测试演示时，共享库文件名因平台而异：

¥Copy the shared library to the current folder. When the demo was last tested, the shared library file name differed by platform:

OS	name
苹果系统	`libduktape.207.20700.so`
Linux	`libduktape.so.207.20700`
Windows	`duktape.dll`

MacOS
Linux
Windows

cp duktape-*/libduktape.* .

cp duktape-*/libduktape.* .

copy duktape-2.7.0\duktape.dll .

下载 SheetJS Standalone 脚本并移至项目目录：

¥Download the SheetJS Standalone script and move to the project directory:

curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js

下载以下测试脚本和文件：

¥Download the following test scripts and files:

curl -LO https://xlsx.nodejs.cn/pres.numbers
curl -LO https://xlsx.nodejs.cn/pandas/sheetjs.py
curl -LO https://xlsx.nodejs.cn/pandas/SheetJSPandas.py

编辑 sheetjs.py 脚本。

¥Edit the sheetjs.py script.

lib 变量声明库的路径：

¥The lib variable declares the path to the library:

sheetjs.py (edit highlighted line)
lib = "libduktape.207.20700.so"

MacOS
Linux
Windows

库的名称是 libduktape.207.20700.so：

¥The name of the library is libduktape.207.20700.so:

sheetjs.py (change highlighted line)
lib = "libduktape.207.20700.so"

库的名称是 libduktape.so.207.20700：

¥The name of the library is libduktape.so.207.20700:

sheetjs.py (change highlighted line)
lib = "./libduktape.so.207.20700"

库的名称是 duktape.dll：

¥The name of the library is duktape.dll:

sheetjs.py (change highlighted line)
lib = ".\\duktape.dll"

此外，还必须进行以下更改：

¥In addition, the following changes must be made:

str_to_c 必须定义如下：

¥str_to_c must be defined as follows:

sheetjs.py (replace str_to_c function)
def str_to_c(s):
  b = s
  if type(b) == str: b = s.encode("latin1")
  return [c_char_p(b), len(b)]

eval_file 必须 open 且模式为 rb：

¥eval_file must open with mode rb:

sheetjs.py (edit highlighted line)
def eval_file(ctx, path):
  with open(path, "rb") as f:
    code = f.read()

运行脚本：

¥Run the script:

python3 SheetJSPandas.py pres.numbers

如果成功，脚本将显示 DataFrame 元数据：

¥If successful, the script will display DataFrame metadata:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    5 non-null      object
 1   Index   5 non-null      int64
dtypes: int64(1), object(1)

它还会将 DataFrame 导出到 SheetJSPandas.xlsb。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。

¥It will also export the DataFrame to SheetJSPandas.xlsb. The file can be inspected with a spreadsheet editor that supports XLSB files.

其他库

¥Other Libraries

其他 Python DataFrame 库镜像 Pandas DataFrame API。

¥Other Python DataFrame libraries mirror the Pandas DataFrame API.

Polars

Polars 是一个类似的 DataFrame 库，提供 Pandas DataFrames 的许多功能。

¥Polars is a similar DataFrame library that offers many features from Pandas DataFrames.

Polars 通过封装第三方库对读取和写入电子表格提供有限的支持。在实践中，Polars 使用中间 CSV 文件与第三方库进行通信。[^12]

¥Polars includes limited support for reading and writing spreadsheets by wrapping third-party libraries. In practice, Polars communicates with the third-party libraries using intermediate CSV files.[^12]

SheetJS 支持 Polars 无法处理的许多常见电子表格格式。

¥SheetJS supports many common spreadsheet formats that Polars cannot process.

SheetJS 操作还为处理复杂工作表提供了更大的灵活性。

¥SheetJS operations also offer more flexibility in processing complex worksheets.

Pandas 示例需要进行一些细微的更改才能与 Polars 配合使用：

¥The Pandas example requires a few slight changes to work with Polars:

Polars DataFrames 公开 write_json 而不是 to_json：

¥Polars DataFrames expose write_json instead of to_json:

-  json = df.to_json(orient="records")
+  json = df.write_json(row_oriented=True)

Polars DataFrames 不公开 info

¥Polars DataFrames do not expose info

Polars 演示

¥Polars Demo

测试环境

该演示在以下部署中进行了测试：

¥This demo was tested in the following deployments:

架构	JS 引擎	Polars	Python	日期
`darwin-x64`	杜克胶带 `2.7.0`	0.20.15	3.12.2	2024-03-15
`darwin-arm`	杜克胶带 `2.7.0`	0.20.31	3.12.3	2024-06-30
`win10-x64`	杜克胶带 `2.7.0`	0.20.16	3.12.2	2024-03-25
`win10-arm`	杜克胶带 `2.7.0`	0.20.31	3.11.5	2024-06-20
`linux-x64`	杜克胶带 `2.7.0`	0.20.16	3.11.3	2024-03-21
`linux-arm`	杜克胶带 `2.7.0`	0.20.31	3.11.2	2024-06-20

沿着熊猫 "完整示例" 到底。

¥Follow the Pandas "Complete Example" through the end.
编辑 sheetjs.py。

¥Edit sheetjs.py.

在脚本顶部附近，将导入从 pandas 更改为 polars：

¥Near the top of the script, change the import from pandas to polars:

sheetjs.py (edit highlighted line)
from io import StringIO
from polars import read_csv

duk = CDLL(lib)

在 export_df_to_wb 函数中，更改 df.to_json 行：

¥Within the export_df_to_wb function, change the df.to_json line:

sheetjs.py (edit highlighted line)
def export_df_to_wb(ctx, df, path, sheet_name="Sheet1", book_type=None):
  json = df.write_json(row_oriented=True)

编辑 SheetJSPandas.py。

¥Edit SheetJSPandas.py.

在 process 函数中，将 df.info() 更改为 df：

¥In the process function, change df.info() to df:

SheetJSPandas.py (edit highlighted line)
    # Generate DataFrame from first worksheet
    df = wb.get_df()
    print(df)

将导出文件名从 SheetJSPandas.xlsb 更改为 SheetJSPolars.xlsb：

¥Change the export filename from SheetJSPandas.xlsb to SheetJSPolars.xlsb:

SheetJSPandas.py (edit highlighted line)
    # Export DataFrame to XLSB
    sheetjs.write_df(df, "SheetJSPolars.xlsb", sheet_name="DataFrame")

安装极地：

¥Install Polars:

python3 -m pip install polars

在 macOS 和 Linux 上，安装命令可能需要 root 访问权限：

¥On macOS and Linux, the install command may require root access:

sudo python3 -m pip install pandas

在 Windows 上，必须通过 Visual Studio 安装程序安装 C++ Clang Compiler for Windows 组件。

¥On Windows, the C++ Clang Compiler for Windows component must be installed through the Visual Studio installer.

在基于 Arch Linux 的平台（包括 Steam Deck）上，安装可能会失败：

¥On Arch Linux-based platforms including the Steam Deck, the install may fail:

error: externally-managed-environment

建议对 Polars 使用虚拟环境。

¥It is recommended to use a virtual environment for Polars.

venv 必须通过系统包管理器安装：

¥venv must be installed through the system package manager:

Debian 和 Ubuntu 发行版：

¥Debian and Ubuntu distributions:

sudo apt-get install python3.11-venv

venv 包含在基于 Arch Linux 的平台的 python 包中。

¥venv is included in the python package in Arch Linux-based platforms.
带有 Homebrew 的 Python 版本的 macOS 系统：

¥macOS systems with a Python version from Homebrew:

brew install pyenv-virtualenv

安装 venv 后，以下命令设置虚拟环境：

¥After installing venv, the following commands set up the virtual environment:

mkdir sheetjs-polars
cd sheetjs-polars
python3 -m venv .
./bin/pip install polars
cp ../libduktape.* ../SheetJSPandas.py ../sheetjs.py ../*.js ../*.numbers .

运行脚本：

¥Run the script:

python3 SheetJSPandas.py pres.numbers

如果上一步配置了虚拟环境，则运行：

¥If the virtual environment was configured in the previous step, run:

./bin/python3 SheetJSPandas.py pres.numbers

如果成功，脚本将显示 DataFrame 数据：

¥If successful, the script will display DataFrame data:

shape: (5, 2)
┌──────────────┬───────┐
│ Name         ┆ Index │
│ ---          ┆ ---   │
│ str          ┆ i64   │
╞══════════════╪═══════╡
│ Bill Clinton ┆ 42    │
│ GeorgeW Bush ┆ 43    │
│ Barack Obama ┆ 44    │
│ Donald Trump ┆ 45    │
│ Joseph Biden ┆ 46    │
└──────────────┴───────┘

它还会将 DataFrame 导出到 SheetJSPolars.xlsb。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。

¥It will also export the DataFrame to SheetJSPolars.xlsb. The file can be inspected with a spreadsheet editor that supports XLSB files.

[^1]: 更多示例请参见 "其他语言"。

¥See "Other Languages" for more examples.

[^2]: 请参阅 Python 文档中的 ctypes。

¥See ctypes in the Python documentation.

[^3]: 见 read 于 "读取文件"

¥See read in "Reading Files"

[^4]: 见 "工作簿对象"

¥See "Workbook Object"

[^5]: 见 sheet_to_csv 于 "实用工具"

¥See sheet_to_csv in "Utilities"

[^6]: 请参阅 Pandas 文档中的 "IO 工具" 中的例子。

¥See the examples in "IO tools" in the Pandas documentation.

[^7]: 请参阅 Pandas 文档中的 pandas.read_csv。

¥See pandas.read_csv in the Pandas documentation.

[^8]: 请参阅 Pandas 文档中的 pandas.DataFrame.to_json。

¥See pandas.DataFrame.to_json in the Pandas documentation.

[^9]: 见 json_to_sheet 于 "实用工具"

¥See json_to_sheet in "Utilities"

[^10]: 见 book_new 于 "实用工具"

¥See book_new in "Utilities"

[^11]: 见 write 于 "写入文件"

¥See write in "Writing Files"

[^12]: 正如在 Polars 文档中所解释的，“...目标 Excel 工作表首先转换为 CSV ...然后用 Polars 的 read_csv() 函数解析。”

¥As explained in the Polars documentation, "... the target Excel sheet is first converted to CSV ... and then parsed with Polars’ read_csv() function."

集成详情​

Python 中的 JS​

封装器​

读取文件​

写入文件​

完整示例​

其他库​

Polars​

Polars 演示​