Skip to main content

Python 中的电子表格数据

熊猫 是一个用于数据分析的 Python 库。

¥Pandas is a Python library for data analysis.

SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。

¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.

该演示使用 SheetJS 处理电子表格中的数据并将其转换为 Pandas DataFrame 格式。我们将探索如何从 Python 脚本加载 SheetJS、从工作簿生成 DataFrame 以及将 DataFrame 写回工作簿。

¥This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.

"完整示例" 包括一个封装库,可简化电子表格的导入和导出。

¥The "Complete Example" includes a wrapper library that simplifies importing and exporting spreadsheets.

Pandas 对读取电子表格 (pandas.from_excel) 和编写 XLSX 电子表格 (pandas.DataFrame.to_excel) 提供有限支持。

¥Pandas includes limited support for reading spreadsheets (pandas.from_excel) and writing XLSX spreadsheets (pandas.DataFrame.to_excel).

SheetJS 支持许多 Pandas 无法处理的常见电子表格格式。

¥SheetJS supports many common spreadsheet formats that Pandas cannot process.

SheetJS 操作还为处理复杂工作表提供了更大的灵活性。

¥SheetJS operations also offer more flexibility in processing complex worksheets.

测试环境

该演示在以下部署中进行了测试:

¥This demo was tested in the following deployments:

架构JS 引擎熊猫Python日期
darwin-x64杜克胶带 2.7.02.2.13.12.22024-03-15
darwin-arm杜克胶带 2.7.02.2.23.12.32024-06-30
win10-x64杜克胶带 2.7.02.2.13.12.22024-03-25
win11-arm杜克胶带 2.7.02.2.23.11.52024-06-20
linux-x64杜克胶带 2.7.01.5.33.11.32024-03-21
linux-arm杜克胶带 2.7.01.5.33.11.22024-06-20

集成详情

¥Integration Details

sheetjs.py 是一个封装脚本,提供用于读取和写入电子表格的辅助方法。安装说明包含在 "完整示例" 部分中。

¥sheetjs.py is a wrapper script that provides helper methods for reading and writing spreadsheets. Installation notes are included in the "Complete Example" section.

Python 中的 JS

¥JS in Python

JS 代码无法在 Python 实现中直接计算。

¥JS code cannot be directly evaluated in Python implementations.

要从 Python 运行 JS 代码,JavaScript 引擎 [^1] 可以嵌入到 Python 模块中或使用 ctypes 外部函数库 [^2] 动态加载。该演示使用 ctypes杜克胶带发动机

¥To run JS code from Python, JavaScript engines[^1] can be embedded in Python modules or dynamically loaded using the ctypes foreign function library[^2]. This demo uses ctypes with the Duktape engine.

封装器

¥Wrapper

该脚本导出一个名为 SheetJSWrapper 的类。它是一个上下文管理器,用于初始化 Duktape 引擎并在入口处执行 SheetJS 脚本。所有工作都应在以下上下文中进行:

¥The script exports a class named SheetJSWrapper. It is a context manager that initializes the Duktape engine and executes SheetJS scripts on entrance. All work should be performed in the context:

Complete Example
#!/usr/bin/env python3
from sheetjs import SheetJSWrapper

with SheetJSWrapper() as sheetjs:

# Parse file
wb = sheetjs.read_file("pres.numbers")
print("Loaded file pres.numbers")

# Get first worksheet name
first_ws_name = wb.get_sheet_names()[0]
print(f"Reading from sheet {first_ws_name}")

# Generate DataFrame from first worksheet
df = wb.get_df(first_ws_name)
print(df.info())

# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

读取文件

¥Reading Files

sheetjs.read_file 接受电子表格文件的路径。它将解析文件并返回代表工作簿的对象。

¥sheetjs.read_file accepts a path to a spreadsheet file. It will parse the file and return an object representing the workbook.

工作簿的 get_sheet_names 方法返回工作表名称列表。

¥The get_sheet_names method of the workbook returns a list of sheet names.

工作簿的 get_df 方法从工作簿生成一个 DataFrame。可以通过传递名称来选择特定的工作表。

¥The get_df method of the workbook generates a DataFrame from the workbook. The specific sheet can be selected by passing the name.

例如,以下代码读取 pres.numbers 并从第二个工作表生成一个 DataFrame:

¥For example, the following code reads pres.numbers and generates a DataFrame from the second worksheet:

Generating a DataFrame from the second worksheet
with SheetJSWrapper() as sheetjs:
# Parse file
wb = sheetjs.read_file(path)

# Generate DataFrame from second worksheet
ws_name = wb.get_sheet_names()[1]
df = wb.get_df(ws_name)

# Print metadata
print(df.info())

在幕后,sheetjs.py 执行以下步骤:

¥Under the hood, sheetjs.py performs the following steps:

  1. 纯 Python 操作读取电子表格文件并生成字节字符串。

    ¥Pure Python operations read the spreadsheet file and generate a byte string.

  2. SheetJS 库解析字符串并生成干净的 CSV。

    ¥SheetJS libraries parse the string and generate a clean CSV.

  • read 方法 [^3] 将文件字节解析为 SheetJS 工作簿对象 [^4]

    ¥The read method[^3] parses file bytes into a SheetJS workbook object[^4]

  • 选择工作表后,sheet_to_csv[^5] 生成 CSV 字符串

    ¥After selecting a worksheet, sheet_to_csv[^5] generates a CSV string

  1. Python 操作将 CSV 字符串转换为流对象。[^6]

    ¥Python operations convert the CSV string to a stream object.[^6]

  2. Pandas read_csv 方法 [^7] 摄取流并生成 DataFrame。

    ¥The Pandas read_csv method[^7] ingests the stream and generate a DataFrame.

写入文件

¥Writing Files

sheetjs.write_df 接受 DataFrame 和路径。它将尝试将数据导出到电子表格文件。

¥sheetjs.write_df accepts a DataFrame and a path. It will attempt to export the data to a spreadsheet file.

例如,以下代码将 DataFrame 导出到 SheetJSPandas.xlsb

¥For example, the following code exports a DataFrame to SheetJSPandas.xlsb:

Exporting a DataFrame to XLSB
with SheetJSWrapper() as sheetjs:
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")

在幕后,sheetjs.py 执行以下步骤:

¥Under the hood, sheetjs.py performs the following steps:

  1. Pandas DataFrame to_json method[^8] 生成一个 JSON 字符串。

    ¥The Pandas DataFrame to_json method[^8] generates a JSON string.

  2. JS 引擎操作将 JSON 字符串转换为对象数组。

    ¥JS engine operations translate the JSON string to an array of objects.

  3. SheetJS 库处理数据数组并生成文件字节。

    ¥SheetJS libraries process the data array and generate file bytes.

  • json_to_sheet 方法 [^9] 从数据创建一个 SheetJS 工作表对象。

    ¥The json_to_sheet method[^9] creates a SheetJS sheet object from the data.

  • book_new 方法 [^10] 创建一个包含该工作表的 SheetJS 工作簿。

    ¥The book_new method[^10] creates a SheetJS workbook that includes the sheet.

  • write 方法 [^11] 生成电子表格文件字节。

    ¥The write method[^11] generates the spreadsheet file bytes.

  1. 纯 Python 操作将字节写入文件。

    ¥Pure Python operations write the bytes to file.

完整示例

¥Complete Example

此示例将从 Apple Numbers 电子表格中提取数据并生成 DataFrame。DataFrame 将导出为二进制 XLSB 电子表格格式。

¥This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.

Windows 构建需要带有 "使用 C++ 进行桌面开发" 的 Visual Studio。命令必须在 "原生工具命令提示符" 会话中运行。

¥The Windows build requires Visual Studio with "Desktop development with C++". Commands must be run in a "Native Tools Command Prompt" session.

  1. 安装熊猫:

    ¥Install Pandas:

python3 -m pip install pandas

在 macOS 和 Linux 上,安装命令可能需要 root 访问权限:

¥On macOS and Linux, the install command may require root access:

sudo python3 -m pip install pandas

pip 没有安装时,命令会失败:

¥When pip is not installed, the command will fail:

/usr/bin/python3: No module named pip

必须安装 pip。在基于 Arch Linux 的平台(包括 Steam Deck)上,可以通过包管理器安装 python-pip

¥pip must be installed. On Arch Linux-based platforms including the Steam Deck, python-pip can be installed through the package manager:

sudo pacman -Syu python-pip

在一些本地测试中,安装失败并出现以下错误:

¥In some local tests, the install failed with the following error:

error: externally-managed-environment

必须通过包管理器安装 Pandas:

¥Pandas must be installed through the package manager:

  • Debian 和 Ubuntu 发行版:

    ¥Debian and Ubuntu distributions:

sudo apt-get install python3-pandas
  • 基于 Arch Linux 的平台,包括 Steam Deck:

    ¥Arch Linux-based platforms including the Steam Deck:

sudo pacman -Syu python-pandas
  • 带有 Homebrew 的 Python 版本的 macOS 系统:

    ¥macOS systems with a Python version from Homebrew:

sudo python3 -m pip install pandas --break-system-packages
  1. 构建 Duktape 共享库:

    ¥Build the Duktape shared library:

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..
  1. 将共享库复制到当前文件夹。上次测试演示时,共享库文件名因平台而异:

    ¥Copy the shared library to the current folder. When the demo was last tested, the shared library file name differed by platform:

OSname
苹果系统libduktape.207.20700.so
Linuxlibduktape.so.207.20700
Windowsduktape.dll
cp duktape-*/libduktape.* .
  1. 下载 SheetJS Standalone 脚本并移至项目目录:

    ¥Download the SheetJS Standalone script and move to the project directory:

curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js
  1. 下载以下测试脚本和文件:

    ¥Download the following test scripts and files:

curl -LO https://xlsx.nodejs.cn/pres.numbers
curl -LO https://xlsx.nodejs.cn/pandas/sheetjs.py
curl -LO https://xlsx.nodejs.cn/pandas/SheetJSPandas.py
  1. 编辑 sheetjs.py 脚本。

    ¥Edit the sheetjs.py script.

lib 变量声明库的路径:

¥The lib variable declares the path to the library:

sheetjs.py (edit highlighted line)
lib = "libduktape.207.20700.so"

库的名称是 libduktape.207.20700.so

¥The name of the library is libduktape.207.20700.so:

sheetjs.py (change highlighted line)
lib = "libduktape.207.20700.so"
  1. 运行脚本:

    ¥Run the script:

python3 SheetJSPandas.py pres.numbers

如果成功,脚本将显示 DataFrame 元数据:

¥If successful, the script will display DataFrame metadata:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Index 5 non-null int64
dtypes: int64(1), object(1)

它还会将 DataFrame 导出到 SheetJSPandas.xlsb。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。

¥It will also export the DataFrame to SheetJSPandas.xlsb. The file can be inspected with a spreadsheet editor that supports XLSB files.

其他库

¥Other Libraries

其他 Python DataFrame 库镜像 Pandas DataFrame API。

¥Other Python DataFrame libraries mirror the Pandas DataFrame API.

Polars

Polars 是一个类似的 DataFrame 库,提供 Pandas DataFrames 的许多功能。

¥Polars is a similar DataFrame library that offers many features from Pandas DataFrames.

Polars 通过封装第三方库对读取和写入电子表格提供有限的支持。在实践中,Polars 使用中间 CSV 文件与第三方库进行通信。[^12]

¥Polars includes limited support for reading and writing spreadsheets by wrapping third-party libraries. In practice, Polars communicates with the third-party libraries using intermediate CSV files.[^12]

SheetJS 支持 Polars 无法处理的许多常见电子表格格式。

¥SheetJS supports many common spreadsheet formats that Polars cannot process.

SheetJS 操作还为处理复杂工作表提供了更大的灵活性。

¥SheetJS operations also offer more flexibility in processing complex worksheets.

Pandas 示例需要进行一些细微的更改才能与 Polars 配合使用:

¥The Pandas example requires a few slight changes to work with Polars:

  • Polars DataFrames 公开 write_json 而不是 to_json

    ¥Polars DataFrames expose write_json instead of to_json:

-  json = df.to_json(orient="records")
+ json = df.write_json(row_oriented=True)
  • Polars DataFrames 不公开 info

    ¥Polars DataFrames do not expose info

Polars 演示

¥Polars Demo

测试环境

该演示在以下部署中进行了测试:

¥This demo was tested in the following deployments:

架构JS 引擎PolarsPython日期
darwin-x64杜克胶带 2.7.00.20.153.12.22024-03-15
darwin-arm杜克胶带 2.7.00.20.313.12.32024-06-30
win10-x64杜克胶带 2.7.00.20.163.12.22024-03-25
win10-arm杜克胶带 2.7.00.20.313.11.52024-06-20
linux-x64杜克胶带 2.7.00.20.163.11.32024-03-21
linux-arm杜克胶带 2.7.00.20.313.11.22024-06-20
  1. 沿着 熊猫 "完整示例" 到底。

    ¥Follow the Pandas "Complete Example" through the end.

  2. 编辑 sheetjs.py

    ¥Edit sheetjs.py.

  • 在脚本顶部附近,将导入从 pandas 更改为 polars

    ¥Near the top of the script, change the import from pandas to polars:

sheetjs.py (edit highlighted line)
from io import StringIO
from polars import read_csv

duk = CDLL(lib)
  • export_df_to_wb 函数中,更改 df.to_json 行:

    ¥Within the export_df_to_wb function, change the df.to_json line:

sheetjs.py (edit highlighted line)
def export_df_to_wb(ctx, df, path, sheet_name="Sheet1", book_type=None):
json = df.write_json(row_oriented=True)
  1. 编辑 SheetJSPandas.py

    ¥Edit SheetJSPandas.py.

  • process 函数中,将 df.info() 更改为 df

    ¥In the process function, change df.info() to df:

SheetJSPandas.py (edit highlighted line)
    # Generate DataFrame from first worksheet
df = wb.get_df()
print(df)

将导出文件名从 SheetJSPandas.xlsb 更改为 SheetJSPolars.xlsb

¥Change the export filename from SheetJSPandas.xlsb to SheetJSPolars.xlsb:

SheetJSPandas.py (edit highlighted line)
    # Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPolars.xlsb", sheet_name="DataFrame")
  1. 安装极地:

    ¥Install Polars:

python3 -m pip install polars

在 macOS 和 Linux 上,安装命令可能需要 root 访问权限:

¥On macOS and Linux, the install command may require root access:

sudo python3 -m pip install pandas

在 Windows 上,必须通过 Visual Studio 安装程序安装 C++ Clang Compiler for Windows 组件。

¥On Windows, the C++ Clang Compiler for Windows component must be installed through the Visual Studio installer.

在基于 Arch Linux 的平台(包括 Steam Deck)上,安装可能会失败:

¥On Arch Linux-based platforms including the Steam Deck, the install may fail:

error: externally-managed-environment

建议对 Polars 使用虚拟环境。

¥It is recommended to use a virtual environment for Polars.

venv 必须通过系统包管理器安装:

¥venv must be installed through the system package manager:

  • Debian 和 Ubuntu 发行版:

    ¥Debian and Ubuntu distributions:

sudo apt-get install python3.11-venv
  • venv 包含在基于 Arch Linux 的平台的 python 包中。

    ¥venv is included in the python package in Arch Linux-based platforms.

  • 带有 Homebrew 的 Python 版本的 macOS 系统:

    ¥macOS systems with a Python version from Homebrew:

brew install pyenv-virtualenv

安装 venv 后,以下命令设置虚拟环境:

¥After installing venv, the following commands set up the virtual environment:

mkdir sheetjs-polars
cd sheetjs-polars
python3 -m venv .
./bin/pip install polars
cp ../libduktape.* ../SheetJSPandas.py ../sheetjs.py ../*.js ../*.numbers .
  1. 运行脚本:

    ¥Run the script:

python3 SheetJSPandas.py pres.numbers

如果上一步配置了虚拟环境,则运行:

¥If the virtual environment was configured in the previous step, run:

./bin/python3 SheetJSPandas.py pres.numbers

如果成功,脚本将显示 DataFrame 数据:

¥If successful, the script will display DataFrame data:

shape: (5, 2)
┌──────────────┬───────┐
│ Name ┆ Index │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════╪═══════╡
│ Bill Clinton ┆ 42 │
│ GeorgeW Bush ┆ 43 │
│ Barack Obama ┆ 44 │
│ Donald Trump ┆ 45 │
│ Joseph Biden ┆ 46 │
└──────────────┴───────┘

它还会将 DataFrame 导出到 SheetJSPolars.xlsb。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。

¥It will also export the DataFrame to SheetJSPolars.xlsb. The file can be inspected with a spreadsheet editor that supports XLSB files.

[^1]: 更多示例请参见 "其他语言"

¥See "Other Languages" for more examples.

[^2]: 请参阅 Python 文档中的 ctypes

¥See ctypes in the Python documentation.

[^3]: 见 read 于 "读取文件"

¥See read in "Reading Files"

[^4]: 见 "工作簿对象"

¥See "Workbook Object"

[^5]: 见 sheet_to_csv 于 "实用工具"

¥See sheet_to_csv in "Utilities"

[^6]: 请参阅 Pandas 文档中的 "IO 工具" 中的例子

¥See the examples in "IO tools" in the Pandas documentation.

[^7]: 请参阅 Pandas 文档中的 pandas.read_csv

¥See pandas.read_csv in the Pandas documentation.

[^8]: 请参阅 Pandas 文档中的 pandas.DataFrame.to_json

¥See pandas.DataFrame.to_json in the Pandas documentation.

[^9]: 见 json_to_sheet 于 "实用工具"

¥See json_to_sheet in "Utilities"

[^10]: 见 book_new 于 "实用工具"

¥See book_new in "Utilities"

[^11]: 见 write 于 "写入文件"

¥See write in "Writing Files"

[^12]: 正如 在 Polars 文档中 所解释的,“...目标 Excel 工作表首先转换为 CSV ...然后用 Polars 的 read_csv() 函数解析。”

¥As explained in the Polars documentation, "... the target Excel sheet is first converted to CSV ... and then parsed with Polars’ read_csv() function."