浏览器自动化
无头自动化涉及控制 "无头浏览器" 访问网站并提交或下载数据。还可以使用自定义浏览器扩展来自动化浏览器。
¥Headless automation involves controlling "headless browsers" to access websites and submit or download data. It is also possible to automate browsers using custom browser extensions.
通过插入 SCRIPT
标签,可以将 SheetJS 独立脚本 添加到任何网站。无头浏览器通常提供实用函数,用于在浏览器中运行自定义片段并将数据传递回自动化脚本。
¥The SheetJS standalone scripts
can be added to any website by inserting a SCRIPT
tag. Headless browsers
usually provide utility functions for running custom snippets in the browser and
passing data back to the automation script.
使用案例
¥Use Case
此演示重点是将表数据导出到工作簿。无头浏览器通常不支持在浏览器上下文和自动化脚本之间传递对象,因此必须在浏览器上下文中生成文件数据并将其发送回自动化脚本以保存在文件系统中。
¥This demo focuses on exporting table data to a workbook. Headless browsers do not generally support passing objects between the browser context and the automation script, so the file data must be generated in the browser context and sent back to the automation script for saving in the file system.
该演示从 https://sheetjs.com/demos/table 导出数据。
¥This demo exports data from https://sheetjs.com/demos/table.
也可以从浏览器上下文解析文件,但从自动化上下文解析更有效,强烈建议这样做。
¥It is also possible to parse files from the browser context, but parsing from the automation context is more efficient and strongly recommended.
关键步骤
¥Key Steps
-
启动无头浏览器并加载目标站点。
¥Launch the headless browser and load the target site.
-
将独立的 SheetJS 构建添加到页面的
SCRIPT
标记中。¥Add the standalone SheetJS build to the page in a
SCRIPT
tag. -
将脚本添加到页面(在浏览器上下文中),该脚本将:
¥Add a script to the page (in the browser context) that will:
-
使用 SheetJS
table_to_book
[^2] 方法从第一个表创建 SheetJS 工作簿对象 [^1]。¥Make a SheetJS workbook object[^1] from the first table using the SheetJS
table_to_book
[^2] method. -
使用 SheetJS
write
[^3] 方法为 XLSB 文件生成字节。¥Generate the bytes for an XLSB file using the SheetJS
write
[^3] method. -
将字节发送回自动化脚本
¥Send the bytes back to the automation script
-
当自动化上下文接收到数据时,保存到文件中
¥When the automation context receives data, save to a file
Puppeteer
Puppeteer 为 NodeJS 和 BunJS 启用了无头 Chromium 自动化。发布附带安装无头浏览器的脚本。
¥Puppeteer enables headless Chromium automation for NodeJS and BunJS. Releases ship with a script that installs a headless browser.
二进制字符串是最受欢迎的数据类型。它们可以安全地从浏览器上下文传递到自动化脚本。NodeJS 提供了一个 API 将二进制字符串写入文件(fs.writeFileSync
使用编码 binary
)。
¥Binary strings are the favored data type. They can be safely passed from the
browser context to the automation script. NodeJS provides an API to write
binary strings to file (fs.writeFileSync
using encoding binary
).
关键步骤注释如下:
¥The key steps are commented below:
const fs = require("fs");
const puppeteer = require('puppeteer');
(async () => {
/* (1) Load the target page */
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on("console", msg => console.log("PAGE LOG:", msg.text()));
await page.setViewport({width: 1920, height: 1080});
await page.goto('https://sheetjs.com/demos/table');
/* (2) Load the standalone SheetJS build from the CDN */
await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js' });
/* (3) Run the snippet in browser and return data */
const bin = await page.evaluate(() => {
/* NOTE: this function will be evaluated in the browser context.
`page`, `fs` and `puppeteer` are not available.
`XLSX` will be available thanks to step 2 */
/* find first table */
var table = document.body.getElementsByTagName('table')[0];
/* call table_to_book on first table */
var wb = XLSX.utils.table_to_book(table);
/* generate XLSB and return binary string */
return XLSX.write(wb, {type: "binary", bookType: "xlsb"});
});
/* (4) write data to file */
fs.writeFileSync("SheetJSPuppeteer.xlsb", bin, { encoding: "binary" });
await browser.close();
})();
演示
¥Demo
该演示在以下部署中进行了测试:
¥This demo was tested in the following deployments:
Puppeteer | 日期 |
---|---|
23.11.1 | 2024-12-31 |
22.15.0 | 2024-12-31 |
21.11.0 | 2024-12-31 |
20.9.0 | 2024-12-31 |
15.5.0 | 2024-12-31 |
10.4.0 | 2024-12-31 |
-
安装 SheetJS 和 Puppeteer:
¥Install SheetJS and Puppeteer:
- NodeJS
- BunJS
npm i --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz puppeteer@23.11.1
bun install https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz puppeteer@23.11.1
-
将
SheetJSPuppeteer.js
代码片段保存到SheetJSPuppeteer.js
。¥Save the
SheetJSPuppeteer.js
code snippet toSheetJSPuppeteer.js
. -
运行脚本:
¥Run the script:
- NodeJS
- BunJS
node SheetJSPuppeteer.js
bun SheetJSPuppeteer.js
脚本完成后,将创建文件 SheetJSPuppeteer.xlsb
。可以使用支持 XLSB 工作簿的电子表格编辑器打开此文件。
¥When the script finishes, the file SheetJSPuppeteer.xlsb
will be created.
This file can be opened with a spreadsheet editor that supports XLSB workbooks.
Playwright
Playwright 为 Chromium、WebKit 和其他浏览器提供了统一的脚本框架。它从 Puppeteer 中汲取灵感。事实上,示例代码几乎相同!
¥Playwright presents a unified scripting framework for Chromium, WebKit, and other browsers. It draws inspiration from Puppeteer. In fact, the example code is almost identical!
下面重点介绍了与 Puppeteer 示例的差异:
¥Differences from the Puppeteer example are highlighted below:
const fs = require("fs");
const { webkit } = require('playwright'); // import desired browser
(async () => {
/* (1) Load the target page */
const browser = await webkit.launch(); // launch desired browser
const page = await browser.newPage();
page.on("console", msg => console.log("PAGE LOG:", msg.text()));
await page.setViewportSize({width: 1920, height: 1080}); // different name :(
await page.goto('https://sheetjs.com/demos/table');
/* (2) Load the standalone SheetJS build from the CDN */
await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js' });
/* (3) Run the snippet in browser and return data */
const bin = await page.evaluate(() => {
/* NOTE: this function will be evaluated in the browser context.
`page`, `fs` and the browser engine are not available.
`XLSX` will be available thanks to step 2 */
/* find first table */
var table = document.body.getElementsByTagName('table')[0];
/* call table_to_book on first table */
var wb = XLSX.utils.table_to_book(table);
/* generate XLSB and return binary string */
return XLSX.write(wb, {type: "binary", bookType: "xlsb"});
});
/* (4) write data to file */
fs.writeFileSync("SheetJSPlaywright.xlsb", bin, { encoding: "binary" });
await browser.close();
})();
演示
¥Demo
该演示在以下部署中进行了测试:
¥This demo was tested in the following deployments:
Playwright | 日期 |
---|---|
1.49.1 | 2024-12-31 |
-
安装 SheetJS 和 Playwright:
¥Install SheetJS and Playwright:
- NodeJS
- BunJS
npm i --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz playwright@1.49.1
bun install https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz playwright@1.49.1
-
将
SheetJSPlaywright.js
代码片段保存到SheetJSPlaywright.js
。¥Save the
SheetJSPlaywright.js
code snippet toSheetJSPlaywright.js
. -
运行脚本
¥Run the script
- NodeJS
- BunJS
node SheetJSPlaywright.js
bun SheetJSPlaywright.js
脚本完成后,将创建文件 SheetJSPlaywright.xlsb
。可以使用支持 XLSB 工作簿的电子表格编辑器打开此文件。
¥When the script finishes, the file SheetJSPlaywright.xlsb
will be created.
This file can be opened with a spreadsheet editor that supports XLSB workbooks.
该命令可能会失败并显示如下消息:
¥The commmand may fail with a message such as:
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers: ║
║ ║
║ npx playwright install ║
║ ║
║ <3 Playwright Team ║
╚═════════════════════════════════════════════════════════════════════════╝
运行推荐的命令将下载并安装浏览器引擎:
¥Running the recommended command will download and install browser engines:
npx playwright install
安装引擎后,重新运行脚本。
¥After installing engines, re-run the script.
PhantomJS
PhantomJS 是一款由 WebKit 提供支持的无头 Web 浏览器。
¥PhantomJS is a headless web browser powered by WebKit.
此信息是为旧部署提供的。PhantomJS 开发已暂停,并且存在已知漏洞,因此新项目应该使用替代方案。对于 WebKit 自动化,新项目应使用 Playwright。
¥This information is provided for legacy deployments. PhantomJS development has been suspended and there are known vulnerabilities, so new projects should use alternatives. For WebKit automation, new projects should use Playwright.
二进制字符串是最受欢迎的数据类型。它们可以安全地从浏览器上下文传递到自动化脚本。PhantomJS 提供了一个 API 将二进制字符串写入文件(fs.write
使用模式 wb
)。
¥Binary strings are the favored data type. They can be safely passed from the
browser context to the automation script. PhantomJS provides an API to write
binary strings to file (fs.write
using mode wb
).
本 demo 在以下环境下进行了测试:
¥This demo was tested in the following environments:
架构 | PhantomJS | 日期 |
---|---|---|
darwin-x64 | 2.1.1 | 2024-12-17 |
win11-x64 | 2.1.1 | 2025-01-19 |
linux-x64 | 2.1.1 | 2025-01-07 |
Integration Details and Demo (click to show)
The steps are marked in the comments:
var page = require('webpage').create();
page.onConsoleMessage = function(msg) { console.log(msg); };
/* (1) Load the target page */
page.open('https://sheetjs.com/demos/table', function() {
/* (2) Load the standalone SheetJS build from the CDN */
page.includeJs("https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js", function() {
/* (3) Run the snippet in browser and return data */
var bin = page.evaluateJavaScript([ "function(){",
/* find first table */
"var table = document.body.getElementsByTagName('table')[0];",
/* call table_to_book on first table */
"var wb = XLSX.utils.table_to_book(table);",
/* generate XLSB file and return binary string */
"return XLSX.write(wb, {type: 'binary', bookType: 'xlsb'});",
"}" ].join(""));
/* (4) write data to file */
require("fs").write("SheetJSPhantomJS.xlsb", bin, "wb");
phantom.exit();
});
});
PhantomJS is very finicky and will hang if there are script errors. It is strongly recommended to add verbose logging and to lint scripts before use.
Demo
-
Save the
SheetJSPhantom.js
code snippet toSheetJSPhantom.js
. -
Run the
phantomjs
program and pass the script as the first argument.
For example, if the macOS Archive Utility unzipped the 2.1.1
release, binaries
will be placed in phantomjs-2.1.1-macosx/bin/
and the command will be:
./phantomjs-2.1.1-macosx/bin/phantomjs SheetJSPhantom.js
When the script finishes, the file SheetJSPhantomJS.xlsb
will be created.
This file can be opened with Excel.
When this demo was last tested on Linux, there were multiple errors.
This application failed to start because it could not find or load the Qt platform plugin "xcb".
The environment variable QT_QPA_PLATFORM=phantom
resolves the issue. There is
a different error after assignment:
140412268664640:error:25066067:DSO support routines:DLFCN_LOAD:could not load the shared library:dso_dlfcn.c:185:filename(libproviders.so): libproviders.so: cannot open shared object file: No such file or directory
140412268664640:error:25070067:DSO support routines:DSO_load:could not load the shared library:dso_lib.c:244:
140412268664640:error:0E07506E:configuration file routines:MODULE_LOAD_DSO:error loading dso:conf_mod.c:285:module=providers, path=providers
140412268664640:error:0E076071:configuration file routines:MODULE_RUN:unknown module name:conf_mod.c:222:module=providers
This error is resolved by ignoring SSL errors. The complete command is:
env OPENSSL_CONF=/dev/null QT_QPA_PLATFORM=phantom ./phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ignore-ssl-errors=true test.js
[^1]: 有关 SheetJS 工作簿对象的更多详细信息,请参阅 "工作簿对象"。
¥See "Workbook Object" for more details about the SheetJS workbook object.
[^2]: 见 "HTML" 实用程序中的 table_to_book
¥See table_to_book
in "HTML" Utilities
[^3]: 见 write
于 "写入文件"