Skip to main content

大型数据集

为了实现最大兼容性,SheetJS API 函数将整个文件读入内存并将文件写入内存。浏览器和其他 JS 引擎强制执行严格的内存限制。该库提供了优化内存使用的替代策略。

¥For maximal compatibility, SheetJS API functions read entire files into memory and write files in memory. Browsers and other JS engines enforce tight memory limits. The library offers alternate strategies to optimize for memory usage.

密集模式

¥Dense Mode

密集模式工作表 将单元格存储在数组的数组中,旨在解决 Google Chrome 性能回归问题。为了向后兼容,默认情况下不会创建密集模式工作表。

¥Dense mode worksheets, which store cells in arrays of arrays, are designed to work around Google Chrome performance regressions. For backwards compatibility, dense mode worksheets are not created by default.

readreadFileaoa_to_sheet 接受 dense 选项。启用后,这些方法将创建将单元格存储在数组数组中的工作表对象:

¥read, readFile and aoa_to_sheet accept the dense option. When enabled, the methods create worksheet objects that store cells in arrays of arrays:

var dense_wb = XLSX.read(ab, {dense: true});

var dense_sheet = XLSX.utils.aoa_to_sheet(aoa, {dense: true});
Historical Note (click to show)

The earliest versions of the library aimed for IE6+ compatibility. In early testing, both in Chrome 26 and in IE6, the most efficient worksheet storage for small sheets was a large object whose keys were cell addresses.

Over time, V8 (the engine behind Chrome and NodeJS) evolved in a way that made the array of arrays approach more efficient but reduced the performance of the large object approach.

In the interest of preserving backwards compatibility, the library opts to make the array of arrays approach available behind a special dense option.

各种 API 函数将无缝处理密集和稀疏的工作表。

¥The various API functions will seamlessly handle dense and sparse worksheets.

流式写入

¥Streaming Write

XLSX.stream 对象中提供了流式写入功能。它们采用与普通写入函数相同的参数:

¥The streaming write functions are available in the XLSX.stream object. They take the same arguments as the normal write functions:

  • XLSX.stream.to_csvXLSX.utils.sheet_to_csv 的流式版本。

    ¥XLSX.stream.to_csv is the streaming version of XLSX.utils.sheet_to_csv.

  • XLSX.stream.to_htmlXLSX.utils.sheet_to_html 的流式版本。

    ¥XLSX.stream.to_html is the streaming version of XLSX.utils.sheet_to_html.

  • XLSX.stream.to_jsonXLSX.utils.sheet_to_json 的流式版本。

    ¥XLSX.stream.to_json is the streaming version of XLSX.utils.sheet_to_json.

  • XLSX.stream.to_xlml 是流式 SpreadsheetML2003 工作簿编写器。

    ¥XLSX.stream.to_xlml is the streaming SpreadsheetML2003 workbook writer.

这些功能在 "流导出" 部分中介绍。

¥These functions are covered in the "Stream Export" section.

此功能在版本 0.20.3 中进行了扩展。强烈推荐给 升级到最新版本

¥This feature was expanded in version 0.20.3. It is strongly recommended to upgrade to the latest version.

NodeJS

在 CommonJS 上下文中,NodeJS Streams 和 fs 立即与 SheetJS 一起使用:

¥In a CommonJS context, NodeJS Streams and fs immediately work with SheetJS:

const XLSX = require("xlsx"); // "just works"
ECMAScript 模块机制

在 NodeJS ESM 中,必须手动加载依赖:

¥In NodeJS ESM, the dependency must be loaded manually:

import * as XLSX from 'xlsx';
import { Readable } from 'stream';

XLSX.stream.set_readable(Readable); // manually load stream helpers

此外,对于 NodeJS ESM 中与文件相关的操作,必须加载 fs

¥Additionally, for file-related operations in NodeJS ESM, fs must be loaded:

import * as XLSX from 'xlsx';
import * as fs from 'fs';

XLSX.set_fs(fs); // manually load fs helpers

强烈建议尽可能在 NodeJS 中使用 CommonJS。

¥It is strongly encouraged to use CommonJS in NodeJS whenever possible.

文本流

¥Text Streams

to_csvto_htmlto_xlml 触发字符串。数据可直接推送到 Writable 流。fs.createWriteStream[^1] 是 NodeJS 中流式传输到文件的推荐方法。

¥to_csv, to_html, and to_xlml emit strings. The data can be directly pushed to a Writable stream. fs.createWriteStream[^1] is the recommended approach for streaming to a file in NodeJS.

此示例读取作为参数传递给脚本的工作表,提取第一个工作表,转换为 CSV 并写入 SheetJSNodeJStream.csv

¥This example reads a worksheet passed as an argument to the script, pulls the first worksheet, converts to CSV and writes to SheetJSNodeJStream.csv:

var XLSX = require("xlsx"), fs = require("fs");

/* read file */
var wb = XLSX.readFile(process.argv[2]), {dense: true};

/* get first worksheet */
var ws = wb.Sheets[wb.SheetNames[0]];

/* create CSV stream */
var csvstream = XLSX.stream.to_csv(ws);

/* create output stream */
var ostream = fs.createWriteStream("SheetJSNodeJStream.csv");

/* write data from CSV stream to output file */
csvstream.pipe(ostream);

对象流

¥Object Streams

to_json 使用对象模式流 [^2]。Transform 流 [^3] 可用于生成文本流以流式传输到文件或屏幕。

¥to_json uses Object-mode streams[^2]. A Transform stream[^3] can be used to generate a text stream for streaming to a file or the screen.

以下示例通过写入 process.stdout 流来打印数据:

¥The following example prints data by writing to the process.stdout stream:

var XLSX = require("xlsx"), Transform = require("stream").Transform;

/* read file */
var wb = XLSX.readFile(process.argv[2], {dense: true});

/* get first worksheet */
var ws = wb.Sheets[wb.SheetNames[0]];

/* this Transform stream converts JS objects to text */
var conv = new Transform({writableObjectMode:true});
conv._transform = function(obj, e, cb){ cb(null, JSON.stringify(obj) + "\n"); };

/* pipe `to_json` -> transformer -> standard output */
XLSX.stream.to_json(ws, {raw: true}).pipe(conv).pipe(process.stdout);

BunJS

BunJS 与 NodeJS 流直接兼容。

¥BunJS is directly compatible with NodeJS streams.

Bun 支持被认为是实验性的。

优秀的开源软件会随着用户测试和报告而不断成长。任何问题都应报告给 Bun 项目以进行进一步诊断。

¥Great open source software grows with user tests and reports. Any issues should be reported to the Bun project for further diagnosis.

NodeJS 演示

¥NodeJS Demo

测试部署

该演示在以下部署中进行了测试:

¥This demo was tested in the following deployments:

节点版本日期测试时节点状态
0.12.182024-07-18生命尽头
4.9.12024-07-18生命尽头
6.17.12024-07-18生命尽头
8.17.02024-07-18生命尽头
10.24.12024-07-18生命尽头
12.22.122024-07-18生命尽头
14.21.32024-07-18生命尽头
16.20.22024-07-18生命尽头
18.20.42024-07-18维护长期支持
20.15.12024-07-18活跃的长期支持
22.5.02024-07-18当前的

虽然流方法在 NodeJS 的生命周期结束版本中工作,但生产部署应该升级到 NodeJS 的当前版本或 LTS 版本。

¥While streaming methods work in End-of-Life versions of NodeJS, production deployments should upgrade to a Current or LTS version of NodeJS.

此演示还于 2024-07-18 针对 BunJS 1.1.18 进行了测试。

¥This demo was also tested against BunJS 1.1.18 on 2024-07-18.

  1. 安装 NodeJS 模块

    ¥Install the NodeJS module

npm i --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz
  1. 下载 SheetJSNodeJStream.js

    ¥Download SheetJSNodeJStream.js:

curl -LO https://xlsx.nodejs.cn/stream/SheetJSNodeJStream.js
  1. 下载 测试文件

    ¥Download the test file:

curl -LO https://xlsx.nodejs.cn/pres.xlsx
  1. 运行脚本:

    ¥Run the script:

node SheetJSNodeJStream.js pres.xlsx
Expected Output (click to show)

The console will display a list of objects:

{"Name":"Bill Clinton","Index":42}
{"Name":"GeorgeW Bush","Index":43}
{"Name":"Barack Obama","Index":44}
{"Name":"Donald Trump","Index":45}
{"Name":"Joseph Biden","Index":46}

The script will also generate SheetJSNodeJStream.csv:

Name,Index
Bill Clinton,42
GeorgeW Bush,43
Barack Obama,44
Donald Trump,45
Joseph Biden,46

浏览器

¥Browser

测试部署

每个浏览器演示都在以下环境中进行了测试:

¥Each browser demo was tested in the following environments:

浏览器日期
Chrome 1262024-07-18
Safari 17.42024-07-18

NodeJS 流 API 在浏览器中不可用。以下函数提供与 to_csv 函数兼容的伪流对象:

¥NodeJS streaming APIs are not available in the browser. The following function supplies a pseudo stream object compatible with the to_csv function:

function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
XLSX.stream.set_readable(() => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
}));
return XLSX.stream.to_csv(ws, opts);
}

// assuming `workbook` is a workbook, stream the first sheet
const ws = workbook.Sheets[workbook.SheetNames[0]];
const strm = sheet_to_csv_cb(ws, (csv)=>{ if(csv != null) console.log(csv); });
strm.resume();

Web Worker

为了在浏览器中处理大文件,强烈建议使用 Web Workers。工作线程演示 包括使用文件系统访问 API 的示例。

¥For processing large files in the browser, it is strongly encouraged to use Web Workers. The Worker demo includes examples using the File System Access API.

Web Worker Details (click to show)

Typically, the file and stream processing occurs in the Web Worker. CSV rows can be sent back to the main thread in the callback:

worker.js
/* load standalone script from CDN */
importScripts("https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js");

function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
XLSX.stream.set_readable(() => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
}));
return XLSX.stream.to_csv(ws, opts);
}

/* this callback will run once the main context sends a message */
self.addEventListener('message', async(e) => {
try {
postMessage({state: "fetching " + e.data.url});
/* Fetch file */
const res = await fetch(e.data.url);
const ab = await res.arrayBuffer();

/* Parse file */
postMessage({state: "parsing"});
const wb = XLSX.read(ab, {dense: true});
const ws = wb.Sheets[wb.SheetNames[0]];

/* Generate CSV rows */
postMessage({state: "csv"});
const strm = sheet_to_csv_cb(ws, (csv) => {
if(csv != null) postMessage({csv});
else postMessage({state: "done"});
});
strm.resume();
} catch(e) {
/* Pass the error message back */
postMessage({error: String(e.message || e) });
}
}, false);

The main thread will receive messages with CSV rows for further processing:

main.js
worker.onmessage = function(e) {
if(e.data.error) { console.error(e.data.error); /* show an error message */ }
else if(e.data.state) { console.info(e.data.state); /* current state */ }
else {
/* e.data.csv is the row generated by the stream */
console.log(e.data.csv);
}
};

在线演示

¥Live Demo

以下在线演示在 Web Worker 中获取并解析文件。to_csv 流函数用于生成 CSV 行并传回主线程进行进一步处理。

¥The following live demo fetches and parses a file in a Web Worker. The to_csv streaming function is used to generate CSV rows and pass back to the main thread for further processing.

对于 Chromium 浏览器,文件系统访问 API 提供了一种现代的仅限工作线程的方法。Web Workers 演示 包含 CSV 流式写入的实时示例。

¥For Chromium browsers, the File System Access API provides a modern worker-only approach. The Web Workers demo includes a live example of CSV streaming write.

该演示有一个 URL 输入框。请随意更改 URL。例如,

¥The demo has a URL input box. Feel free to change the URL. For example,

https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls 是一个超过 50 MB 的 XLS 文件

¥https://raw.githubusercontent.com/SheetJS/test_files/master/large_strings.xls is an XLS file over 50 MB

https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx 是一个 XLSX 文件,包含 300000 行(约 20 MB)

¥https://raw.githubusercontent.com/SheetJS/libreoffice_test-files/master/calc/xlsx-import/perf/8-by-300000-cells.xlsx is an XLSX file with 300000 rows (approximately 20 MB)

Result
Loading...
Live Editor
function SheetJSFetchCSVStreamWorker() {
  const [__html, setHTML] = React.useState("");
  const [state, setState] = React.useState("");
  const [cnt, setCnt] = React.useState(0);
  const [url, setUrl] = React.useState("https://xlsx.nodejs.cn/test_files/large_strings.xlsx");

  return ( <>
    <b>URL: </b><input type="text" value={url} onChange={(e) => setUrl(e.target.value)} size="80"/>
    <button onClick={() => {
      /* this mantra embeds the worker source in the function */
      const worker = new Worker(URL.createObjectURL(new Blob([`\
/* load standalone script from CDN */
importScripts("https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js");

function sheet_to_csv_cb(ws, cb, opts, batch = 1000) {
  XLSX.stream.set_readable(() => ({
    __done: false,
    // this function will be assigned by the SheetJS stream methods
    _read: function() { this.__done = true; },
    // this function is called by the stream methods
    push: function(d) { if(!this.__done) cb(d); if(d == null) this.__done = true; },
    resume: function pump() { for(var i = 0; i < batch && !this.__done; ++i) this._read(); if(!this.__done) setTimeout(pump.bind(this), 0); }
  }));
  return XLSX.stream.to_csv(ws, opts);
}

/* this callback will run once the main context sends a message */
self.addEventListener('message', async(e) => {
  try {
    postMessage({state: "fetching " + e.data.url});
    /* Fetch file */
    const res = await fetch(e.data.url);
    const ab = await res.arrayBuffer();

    /* Parse file */
    let len = ab.byteLength;
    if(len < 1024) len += " bytes"; else { len /= 1024;
      if(len < 1024) len += " KB"; else { len /= 1024; len += " MB"; }
    }
    postMessage({state: "parsing " + len});
    const wb = XLSX.read(ab, {dense: true});
    const ws = wb.Sheets[wb.SheetNames[0]];

    /* Generate CSV rows */
    postMessage({state: "csv"});
    const strm = sheet_to_csv_cb(ws, (csv) => {
      if(csv != null) postMessage({csv});
      else postMessage({state: "done"});
    });
    strm.resume();
  } catch(e) {
    /* Pass the error message back */
    postMessage({error: String(e.message || e) });
  }
}, false);
      `])));
      /* when the worker sends back data, add it to the DOM */
      worker.onmessage = function(e) {
        if(e.data.error) return setHTML(e.data.error);
        else if(e.data.state) return setState(e.data.state);
        setHTML(e.data.csv);
        setCnt(cnt => cnt+1);
      };
      setCnt(0); setState("");
      /* post a message to the worker with the URL to fetch */
      worker.postMessage({url});
    }}><b>Click to Start</b></button>
    <pre>State: <b>{state}</b><br/>Number of rows: <b>{cnt}</b></pre>
    <pre dangerouslySetInnerHTML={{ __html }}/>
  </> );
}

Deno

Deno 在正常执行中不支持 NodeJS 流,因此使用了封装器:

¥Deno does not support NodeJS streams in normal execution, so a wrapper is used:

// @deno-types="https://cdn.sheetjs.com/xlsx-0.20.3/package/types/index.d.ts"
import { stream } from 'https://cdn.sheetjs.com/xlsx-0.20.3/package/xlsx.mjs';

/* Callback invoked on each row (string) and at the end (null) */
const csv_cb = (d:string|null) => {
if(d == null) return;
/* The strings include line endings, so raw write ops should be used */
Deno.stdout.write(new TextEncoder().encode(d));
};

/* Prepare `Readable` function */
const Readable = () => ({
__done: false,
// this function will be assigned by the SheetJS stream methods
_read: function() { this.__done = true; },
// this function is called by the stream methods
push: function(d: any) {
if(!this.__done) csv_cb(d);
if(d == null) this.__done = true;
},
resume: function pump() {
for(var i = 0; i < 1000 && !this.__done; ++i) this._read();
if(!this.__done) setTimeout(pump.bind(this), 0);
}
})
/* Wire up */
stream.set_readable(Readable);

/* assuming `workbook` is a workbook, stream the first sheet */
const ws = workbook.Sheets[workbook.SheetNames[0]];
stream.to_csv(wb.Sheets[wb.SheetNames[0]]).resume();
测试部署

此演示最后一次测试是在 2024-07-18,针对 Deno 1.45.2

¥This demo was last tested on 2024-07-18 against Deno 1.45.2.

SheetJSDenoStream.ts 是一个小示例脚本,用于下载 https://xlsx.nodejs.cn/pres.numbers 并打印 CSV 行对象。

¥SheetJSDenoStream.ts is a small example script that downloads https://xlsx.nodejs.cn/pres.numbers and prints CSV row objects.

  1. 运行脚本:

    ¥Run the script:

deno run -A https://xlsx.nodejs.cn/stream/SheetJSDenoStream.ts

该脚本将获取 pres.numbers 并生成 CSV 行。结果将打印到终端窗口。

¥This script will fetch pres.numbers and generate CSV rows. The result will be printed to the terminal window.

[^1]: 请参阅 NodeJS 文档中的 fs.createWriteStream

¥See fs.createWriteStream in the NodeJS documentation.

[^2]: 请参阅 NodeJS 文档中的 "对象模式"

¥See "Object mode" in the NodeJS documentation.

[^3]: 请参阅 NodeJS 文档中的 Transform

¥See Transform in the NodeJS documentation.