Python 中的电子表格数据
熊猫 是一个用于数据分析的 Python 库。
¥Pandas is a Python library for data analysis.
SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。
¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.
该演示使用 SheetJS 处理电子表格中的数据并将其转换为 Pandas DataFrame 格式。我们将探索如何从 Python 脚本加载 SheetJS、从工作簿生成 DataFrame 以及将 DataFrame 写回工作簿。
¥This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.
"完整示例" 包括一个封装库,可简化电子表格的导入和导出。
¥The "Complete Example" includes a wrapper library that simplifies importing and exporting spreadsheets.
Pandas 对读取电子表格 (pandas.from_excel
) 和编写 XLSX 电子表格 (pandas.DataFrame.to_excel
) 提供有限支持。
¥Pandas includes limited support for reading spreadsheets (pandas.from_excel
)
and writing XLSX spreadsheets (pandas.DataFrame.to_excel
).
SheetJS 支持许多 Pandas 无法处理的常见电子表格格式。
¥SheetJS supports many common spreadsheet formats that Pandas cannot process.
SheetJS 操作还为处理复杂工作表提供了更大的灵活性。
¥SheetJS operations also offer more flexibility in processing complex worksheets.
该演示在以下部署中进行了测试:
¥This demo was tested in the following deployments:
架构 | JS 引擎 | 熊猫 | Python | 日期 |
---|---|---|---|---|
darwin-x64 | 杜克胶带 2.7.0 | 2.2.1 | 3.12.2 | 2024-03-15 |
darwin-arm | 杜克胶带 2.7.0 | 2.2.2 | 3.12.3 | 2024-06-30 |
win10-x64 | 杜克胶带 2.7.0 | 2.2.1 | 3.12.2 | 2024-03-25 |
win11-arm | 杜克胶带 2.7.0 | 2.2.2 | 3.11.5 | 2024-06-20 |
linux-x64 | 杜克胶带 2.7.0 | 1.5.3 | 3.11.3 | 2024-03-21 |
linux-arm | 杜克胶带 2.7.0 | 1.5.3 | 3.11.2 | 2024-06-20 |
集成详情
¥Integration Details
sheetjs.py
是一个封装脚本,提供用于读取和写入电子表格的辅助方法。安装说明包含在 "完整示例" 部分中。
¥sheetjs.py
is a wrapper script that provides
helper methods for reading and writing spreadsheets. Installation notes are
included in the "Complete Example" section.
Python 中的 JS
¥JS in Python
JS 代码无法在 Python 实现中直接计算。
¥JS code cannot be directly evaluated in Python implementations.
要从 Python 运行 JS 代码,JavaScript 引擎 [^1] 可以嵌入到 Python 模块中或使用 ctypes
外部函数库 [^2] 动态加载。该演示使用 ctypes
和 杜克胶带发动机。
¥To run JS code from Python, JavaScript engines[^1] can be embedded in Python
modules or dynamically loaded using the ctypes
foreign function library[^2].
This demo uses ctypes
with the Duktape engine.
封装器
¥Wrapper
该脚本导出一个名为 SheetJSWrapper
的类。它是一个上下文管理器,用于初始化 Duktape 引擎并在入口处执行 SheetJS 脚本。所有工作都应在以下上下文中进行:
¥The script exports a class named SheetJSWrapper
. It is a context manager that
initializes the Duktape engine and executes SheetJS scripts on entrance. All
work should be performed in the context:
#!/usr/bin/env python3
from sheetjs import SheetJSWrapper
with SheetJSWrapper() as sheetjs:
# Parse file
wb = sheetjs.read_file("pres.numbers")
print("Loaded file pres.numbers")
# Get first worksheet name
first_ws_name = wb.get_sheet_names()[0]
print(f"Reading from sheet {first_ws_name}")
# Generate DataFrame from first worksheet
df = wb.get_df(first_ws_name)
print(df.info())
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
读取文件
¥Reading Files
sheetjs.read_file
接受电子表格文件的路径。它将解析文件并返回代表工作簿的对象。
¥sheetjs.read_file
accepts a path to a spreadsheet file. It will parse the file
and return an object representing the workbook.
工作簿的 get_sheet_names
方法返回工作表名称列表。
¥The get_sheet_names
method of the workbook returns a list of sheet names.
工作簿的 get_df
方法从工作簿生成一个 DataFrame。可以通过传递名称来选择特定的工作表。
¥The get_df
method of the workbook generates a DataFrame from the workbook. The
specific sheet can be selected by passing the name.
例如,以下代码读取 pres.numbers
并从第二个工作表生成一个 DataFrame:
¥For example, the following code reads pres.numbers
and generates a DataFrame
from the second worksheet:
with SheetJSWrapper() as sheetjs:
# Parse file
wb = sheetjs.read_file(path)
# Generate DataFrame from second worksheet
ws_name = wb.get_sheet_names()[1]
df = wb.get_df(ws_name)
# Print metadata
print(df.info())
在幕后,sheetjs.py
执行以下步骤:
¥Under the hood, sheetjs.py
performs the following steps:
-
纯 Python 操作读取电子表格文件并生成字节字符串。
¥Pure Python operations read the spreadsheet file and generate a byte string.
-
SheetJS 库解析字符串并生成干净的 CSV。
¥SheetJS libraries parse the string and generate a clean CSV.
-
read
方法 [^3] 将文件字节解析为 SheetJS 工作簿对象 [^4]¥The
read
method[^3] parses file bytes into a SheetJS workbook object[^4] -
选择工作表后,
sheet_to_csv
[^5] 生成 CSV 字符串¥After selecting a worksheet,
sheet_to_csv
[^5] generates a CSV string
-
Python 操作将 CSV 字符串转换为流对象。[^6]
¥Python operations convert the CSV string to a stream object.[^6]
-
Pandas
read_csv
方法 [^7] 摄取流并生成 DataFrame。¥The Pandas
read_csv
method[^7] ingests the stream and generate a DataFrame.
写入文件
¥Writing Files
sheetjs.write_df
接受 DataFrame 和路径。它将尝试将数据导出到电子表格文件。
¥sheetjs.write_df
accepts a DataFrame and a path. It will attempt to export
the data to a spreadsheet file.
例如,以下代码将 DataFrame 导出到 SheetJSPandas.xlsb
:
¥For example, the following code exports a DataFrame to SheetJSPandas.xlsb
:
with SheetJSWrapper() as sheetjs:
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
在幕后,sheetjs.py
执行以下步骤:
¥Under the hood, sheetjs.py
performs the following steps:
-
Pandas DataFrame
to_json
method[^8] 生成一个 JSON 字符串。¥The Pandas DataFrame
to_json
method[^8] generates a JSON string. -
JS 引擎操作将 JSON 字符串转换为对象数组。
¥JS engine operations translate the JSON string to an array of objects.
-
SheetJS 库处理数据数组并生成文件字节。
¥SheetJS libraries process the data array and generate file bytes.
-
json_to_sheet
方法 [^9] 从数据创建一个 SheetJS 工作表对象。¥The
json_to_sheet
method[^9] creates a SheetJS sheet object from the data. -
book_new
方法 [^10] 创建一个包含该工作表的 SheetJS 工作簿。¥The
book_new
method[^10] creates a SheetJS workbook that includes the sheet. -
write
方法 [^11] 生成电子表格文件字节。¥The
write
method[^11] generates the spreadsheet file bytes.
-
纯 Python 操作将字节写入文件。
¥Pure Python operations write the bytes to file.
完整示例
¥Complete Example
此示例将从 Apple Numbers 电子表格中提取数据并生成 DataFrame。DataFrame 将导出为二进制 XLSB 电子表格格式。
¥This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.
Windows 构建需要带有 "使用 C++ 进行桌面开发" 的 Visual Studio。命令必须在 "原生工具命令提示符" 会话中运行。
¥The Windows build requires Visual Studio with "Desktop development with C++". Commands must be run in a "Native Tools Command Prompt" session.
-
安装熊猫:
¥Install Pandas:
python3 -m pip install pandas
在 macOS 和 Linux 上,安装命令可能需要 root 访问权限:
¥On macOS and Linux, the install command may require root access:
sudo python3 -m pip install pandas
当 pip
没有安装时,命令会失败:
¥When pip
is not installed, the command will fail:
/usr/bin/python3: No module named pip
必须安装 pip
。在基于 Arch Linux 的平台(包括 Steam Deck)上,可以通过包管理器安装 python-pip
:
¥pip
must be installed. On Arch Linux-based platforms including the Steam Deck,
python-pip
can be installed through the package manager:
sudo pacman -Syu python-pip
在一些本地测试中,安装失败并出现以下错误:
¥In some local tests, the install failed with the following error:
error: externally-managed-environment
必须通过包管理器安装 Pandas:
¥Pandas must be installed through the package manager:
-
Debian 和 Ubuntu 发行版:
¥Debian and Ubuntu distributions:
sudo apt-get install python3-pandas
-
基于 Arch Linux 的平台,包括 Steam Deck:
¥Arch Linux-based platforms including the Steam Deck:
sudo pacman -Syu python-pandas
-
带有 Homebrew 的 Python 版本的 macOS 系统:
¥macOS systems with a Python version from Homebrew:
sudo python3 -m pip install pandas --break-system-packages
-
构建 Duktape 共享库:
¥Build the Duktape shared library:
- MacOS
- Linux
- Windows
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..
-
下载并解压源 tarball。命令必须在 WSL
bash
中运行:¥Download and extract the source tarball. Commands must be run in WSL
bash
:
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
(运行 bash
,然后运行上述命令,然后运行 exit
退出 WSL)
¥(Run bash
, then run the aforementioned commands, then run exit
to exit WSL)
-
进入源文件夹:
¥Enter the source folder:
cd duktape-2.7.0
-
编辑
src\duk_config.h
并将高亮的行添加到文件末尾:¥Edit
src\duk_config.h
and add the highlighted lines to the end of the file:
#endif /* DUK_CONFIG_H_INCLUDED */
#define DUK_EXTERNAL_DECL extern __declspec(dllexport)
#define DUK_EXTERNAL __declspec(dllexport)
-
构建 Duktape DLL:
¥Build the Duktape DLL:
cl /O2 /W3 /Isrc /LD /DDUK_SINGLE_FILE /DDUK_F_DLL_BUILD /DDUK_F_WINDOWS /DDUK_COMPILING_DUKTAPE src\\duktape.c
-
上移到父目录:
¥Move up to the parent directory:
cd ..
-
将共享库复制到当前文件夹。上次测试演示时,共享库文件名因平台而异:
¥Copy the shared library to the current folder. When the demo was last tested, the shared library file name differed by platform:
OS | name |
---|---|
苹果系统 | libduktape.207.20700.so |
Linux | libduktape.so.207.20700 |
Windows | duktape.dll |
- MacOS
- Linux
- Windows
cp duktape-*/libduktape.* .
cp duktape-*/libduktape.* .
copy duktape-2.7.0\duktape.dll .
-
下载 SheetJS Standalone 脚本并移至项目目录:
¥Download the SheetJS Standalone script and move to the project directory:
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js
-
下载以下测试脚本和文件:
¥Download the following test scripts and files:
curl -LO https://xlsx.nodejs.cn/pres.numbers
curl -LO https://xlsx.nodejs.cn/pandas/sheetjs.py
curl -LO https://xlsx.nodejs.cn/pandas/SheetJSPandas.py
-
编辑
sheetjs.py
脚本。¥Edit the
sheetjs.py
script.
lib
变量声明库的路径:
¥The lib
variable declares the path to the library:
lib = "libduktape.207.20700.so"
- MacOS
- Linux
- Windows
库的名称是 libduktape.207.20700.so
:
¥The name of the library is libduktape.207.20700.so
:
lib = "libduktape.207.20700.so"
库的名称是 libduktape.so.207.20700
:
¥The name of the library is libduktape.so.207.20700
:
lib = "./libduktape.so.207.20700"
库的名称是 duktape.dll
:
¥The name of the library is duktape.dll
:
lib = ".\\duktape.dll"
此外,还必须进行以下更改:
¥In addition, the following changes must be made:
-
str_to_c
必须定义如下:¥
str_to_c
must be defined as follows:
def str_to_c(s):
b = s
if type(b) == str: b = s.encode("latin1")
return [c_char_p(b), len(b)]
-
eval_file
必须open
且模式为rb
:¥
eval_file
mustopen
with moderb
:
def eval_file(ctx, path):
with open(path, "rb") as f:
code = f.read()
-
运行脚本:
¥Run the script:
python3 SheetJSPandas.py pres.numbers
如果成功,脚本将显示 DataFrame 元数据:
¥If successful, the script will display DataFrame metadata:
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Index 5 non-null int64
dtypes: int64(1), object(1)
它还会将 DataFrame 导出到 SheetJSPandas.xlsb
。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。
¥It will also export the DataFrame to SheetJSPandas.xlsb
. The file can be
inspected with a spreadsheet editor that supports XLSB files.
其他库
¥Other Libraries
其他 Python DataFrame 库镜像 Pandas DataFrame API。
¥Other Python DataFrame libraries mirror the Pandas DataFrame API.
Polars
Polars 是一个类似的 DataFrame 库,提供 Pandas DataFrames 的许多功能。
¥Polars is a similar DataFrame library that offers many features from Pandas DataFrames.
Polars 通过封装第三方库对读取和写入电子表格提供有限的支持。在实践中,Polars 使用中间 CSV 文件与第三方库进行通信。[^12]
¥Polars includes limited support for reading and writing spreadsheets by wrapping third-party libraries. In practice, Polars communicates with the third-party libraries using intermediate CSV files.[^12]
SheetJS 支持 Polars 无法处理的许多常见电子表格格式。
¥SheetJS supports many common spreadsheet formats that Polars cannot process.
SheetJS 操作还为处理复杂工作表提供了更大的灵活性。
¥SheetJS operations also offer more flexibility in processing complex worksheets.
Pandas 示例需要进行一些细微的更改才能与 Polars 配合使用:
¥The Pandas example requires a few slight changes to work with Polars:
-
Polars DataFrames 公开
write_json
而不是to_json
:¥Polars DataFrames expose
write_json
instead ofto_json
:
- json = df.to_json(orient="records")
+ json = df.write_json(row_oriented=True)
-
Polars DataFrames 不公开
info
¥Polars DataFrames do not expose
info
Polars 演示
¥Polars Demo
该演示在以下部署中进行了测试:
¥This demo was tested in the following deployments:
架构 | JS 引擎 | Polars | Python | 日期 |
---|---|---|---|---|
darwin-x64 | 杜克胶带 2.7.0 | 0.20.15 | 3.12.2 | 2024-03-15 |
darwin-arm | 杜克胶带 2.7.0 | 0.20.31 | 3.12.3 | 2024-06-30 |
win10-x64 | 杜克胶带 2.7.0 | 0.20.16 | 3.12.2 | 2024-03-25 |
win10-arm | 杜克胶带 2.7.0 | 0.20.31 | 3.11.5 | 2024-06-20 |
linux-x64 | 杜克胶带 2.7.0 | 0.20.16 | 3.11.3 | 2024-03-21 |
linux-arm | 杜克胶带 2.7.0 | 0.20.31 | 3.11.2 | 2024-06-20 |
-
沿着 熊猫 "完整示例" 到底。
¥Follow the Pandas "Complete Example" through the end.
-
编辑
sheetjs.py
。¥Edit
sheetjs.py
.
-
在脚本顶部附近,将导入从
pandas
更改为polars
:¥Near the top of the script, change the import from
pandas
topolars
:
from io import StringIO
from polars import read_csv
duk = CDLL(lib)
-
在
export_df_to_wb
函数中,更改df.to_json
行:¥Within the
export_df_to_wb
function, change thedf.to_json
line:
def export_df_to_wb(ctx, df, path, sheet_name="Sheet1", book_type=None):
json = df.write_json(row_oriented=True)
-
编辑
SheetJSPandas.py
。¥Edit
SheetJSPandas.py
.
-
在
process
函数中,将df.info()
更改为df
:¥In the
process
function, changedf.info()
todf
:
# Generate DataFrame from first worksheet
df = wb.get_df()
print(df)
将导出文件名从 SheetJSPandas.xlsb
更改为 SheetJSPolars.xlsb
:
¥Change the export filename from SheetJSPandas.xlsb
to SheetJSPolars.xlsb
:
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPolars.xlsb", sheet_name="DataFrame")
-
安装极地:
¥Install Polars:
python3 -m pip install polars
在 macOS 和 Linux 上,安装命令可能需要 root 访问权限:
¥On macOS and Linux, the install command may require root access:
sudo python3 -m pip install pandas
在 Windows 上,必须通过 Visual Studio 安装程序安装 C++ Clang Compiler for Windows
组件。
¥On Windows, the C++ Clang Compiler for Windows
component must be installed
through the Visual Studio installer.
在基于 Arch Linux 的平台(包括 Steam Deck)上,安装可能会失败:
¥On Arch Linux-based platforms including the Steam Deck, the install may fail:
error: externally-managed-environment
建议对 Polars 使用虚拟环境。
¥It is recommended to use a virtual environment for Polars.
venv
必须通过系统包管理器安装:
¥venv
must be installed through the system package manager:
-
Debian 和 Ubuntu 发行版:
¥Debian and Ubuntu distributions:
sudo apt-get install python3.11-venv
-
venv
包含在基于 Arch Linux 的平台的python
包中。¥
venv
is included in thepython
package in Arch Linux-based platforms. -
带有 Homebrew 的 Python 版本的 macOS 系统:
¥macOS systems with a Python version from Homebrew:
brew install pyenv-virtualenv
安装 venv
后,以下命令设置虚拟环境:
¥After installing venv
, the following commands set up the virtual environment:
mkdir sheetjs-polars
cd sheetjs-polars
python3 -m venv .
./bin/pip install polars
cp ../libduktape.* ../SheetJSPandas.py ../sheetjs.py ../*.js ../*.numbers .
-
运行脚本:
¥Run the script:
python3 SheetJSPandas.py pres.numbers
如果上一步配置了虚拟环境,则运行:
¥If the virtual environment was configured in the previous step, run:
./bin/python3 SheetJSPandas.py pres.numbers
如果成功,脚本将显示 DataFrame 数据:
¥If successful, the script will display DataFrame data:
shape: (5, 2)
┌──────────────┬───────┐
│ Name ┆ Index │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════╪═══════╡
│ Bill Clinton ┆ 42 │
│ GeorgeW Bush ┆ 43 │
│ Barack Obama ┆ 44 │
│ Donald Trump ┆ 45 │
│ Joseph Biden ┆ 46 │
└──────────────┴───────┘
它还会将 DataFrame 导出到 SheetJSPolars.xlsb
。可以使用支持 XLSB 文件的电子表格编辑器检查该文件。
¥It will also export the DataFrame to SheetJSPolars.xlsb
. The file can be
inspected with a spreadsheet editor that supports XLSB files.
[^1]: 更多示例请参见 "其他语言"。
¥See "Other Languages" for more examples.
[^2]: 请参阅 Python 文档中的 ctypes
。
¥See ctypes
in the Python documentation.
[^3]: 见 read
于 "读取文件"
[^4]: 见 "工作簿对象"
¥See "Workbook Object"
[^5]: 见 sheet_to_csv
于 "实用工具"
¥See sheet_to_csv
in "Utilities"
[^6]: 请参阅 Pandas 文档中的 "IO 工具" 中的例子。
¥See the examples in "IO tools" in the Pandas documentation.
[^7]: 请参阅 Pandas 文档中的 pandas.read_csv
。
¥See pandas.read_csv
in the Pandas documentation.
[^8]: 请参阅 Pandas 文档中的 pandas.DataFrame.to_json
。
¥See pandas.DataFrame.to_json
in the Pandas documentation.
[^9]: 见 json_to_sheet
于 "实用工具"
¥See json_to_sheet
in "Utilities"
[^10]: 见 book_new
于 "实用工具"
[^11]: 见 write
于 "写入文件"
[^12]: 正如 在 Polars 文档中 所解释的,“...目标 Excel 工作表首先转换为 CSV ...然后用 Polars 的 read_csv()
函数解析。”
¥As explained in the Polars documentation, "... the target Excel sheet is first converted to CSV ... and then parsed with Polars’ read_csv()
function."