Skip to main content

加载器教程

许多现有系统和平台都支持从 CSV 文件加载数据。许多用户更喜欢使用电子表格软件和多表文件格式(包括 XLSX)。SheetJS 库通过将复杂的工作簿转换为简单的 CSV 数据来帮助弥补差距。

¥Many existing systems and platforms include support for loading data from CSV files. Many users prefer to work in spreadsheet software and multi-sheet file formats including XLSX. SheetJS libraries help bridge the gap by translating complex workbooks to simple CSV data.

此示例的目标是将电子表格数据加载到向量存储中,并使用大型语言模型根据英语输入生成查询。现有工具支持 CSV,但不支持真正的电子表格。

¥The goal of this example is to load spreadsheet data into a vector store and use a large language model to generate queries based on English language input. The existing tooling supports CSV but does not support real spreadsheets.

"SheetJS 转换" 中,我们将使用 SheetJS 库为 LangChain CSV 加载器生成 CSV 文件。这些转换可以在预处理步骤中运行,而不会破坏现有的 CSV 工作流程。

¥In "SheetJS Conversion", we will use SheetJS libraries to generate CSV files for the LangChain CSV loader. These conversions can be run in a preprocessing step without disrupting existing CSV workflows.

"SheetJS 加载器" 中,我们将在自定义 LoadOfSheet 数据加载器中使用 SheetJS 库直接生成文档和元数据。

¥In "SheetJS Loader", we will use SheetJS libraries in a custom LoadOfSheet data loader to directly generate documents and metadata.

"SheetJS 加载器演示" 是一个完整的演示,它使用 SheetJS Loader 根据 XLS 工作簿中的数据回答问题。

¥"SheetJS Loader Demo" is a complete demo that uses the SheetJS Loader to answer questions based on data from a XLS workbook.

测试部署

此演示在以下配置中进行了测试:

¥This demo was tested in the following configurations:

日期平台
2024-08-09NVIDIA RTX 4090(24 GB VRAM)+ i9-10910(128 GB RAM)
2024-08-09NVIDIA RTX 4080 SUPER(16 GB VRAM)+ i9-10910(128 GB RAM)
2024-07-15Apple M2 Max 12 核 CPU + 30 核 GPU(32 GB 统一内存)

SheetJS 用户已在其他配置中验证了此演示:

¥SheetJS users have verified this demo in other configurations:

Other tested configurations (click to show)
DemoPlatform
LangChainJSNVIDIA RTX 4070 Ti (12 GB VRAM) + Ryzen 7 5800x (64 GB RAM)
LangChainJSNVIDIA RTX 4060 (8 GB VRAM) + Ryzen 7 5700g (32 GB RAM)
LangChainJSNVIDIA RTX 3090 (24 GB VRAM) + Ryzen 9 3900XT (128 GB RAM)
LangChainJSNVIDIA RTX 3080 (12 GB VRAM) + Ryzen 7 5800X (32 GB RAM)
LangChainJSNVIDIA RTX 3070 (8 GB VRAM) + Ryzen Z1 Extreme (16 GB RAM)
LangChainJSNVIDIA RTX 3060 (12 GB VRAM) + i5-11400 (32 GB RAM)
LangChainJSNVIDIA RTX 2080 (12 GB VRAM) + i7-9700K (16 GB RAM)
LangChainJSNVIDIA RTX 2060 (6 GB VRAM) + Ryzen 5 3600 (32 GB RAM)
LangChainJSNVIDIA GTX 1080 (8 GB VRAM) + Ryzen 7 5800x (64 GB RAM)
LangChainJSNVIDIA GTX 1070 (8 GB VRAM) + Ryzen 7 7700x (32 GB RAM)

特别感谢:

¥Special thanks to:

CSV 加载器

¥CSV Loader

此说明已针对 LangChain 0.2 进行了验证。

¥This explanation was verified against LangChain 0.2.

文档加载器从数据源生成数据对象("documents")和相关元数据。

¥Document loaders generate data objects ("documents") and associated metadata from data sources.

LangChain 提供 CSVLoader[^1] 组件,用于从文件加载 CSV 数据:

¥LangChain offers a CSVLoader[^1] component for loading CSV data from a file:

Generating Documents from a CSV file
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";

const loader = new CSVLoader("pres.csv");
const docs = await loader.load();

console.log(docs);

CSV 加载器使用第一行确定列标题并为每个数据行生成一个文档。例如,以下 CSV 保存总统数据:

¥The CSV loader uses the first row to determine column headers and generates one document per data row. For example, the following CSV holds Presidential data:

Name,Index
Bill Clinton,42
GeorgeW Bush,43
Barack Obama,44
Donald Trump,45
Joseph Biden,46

每个数据行都转换为一个文档,其内容是属性和值的列表。例如,第三个数据行如下所示:

¥Each data row is translated to a document whose content is a list of attributes and values. For example, the third data row is shown below:

CSV RowDocument Content
Name,Index
Barack Obama,44
Name: Barack Obama
Index: 44

LangChain CSV 加载器将在文档中包含源元数据:

¥The LangChain CSV loader will include source metadata in the document:

Document generated by the CSV loader
Document {
pageContent: 'Name: Barack Obama\nIndex: 44',
metadata: { source: 'pres.csv', line: 3 }
}

SheetJS 转换

¥SheetJS Conversion

SheetJS NodeJS 模块 可以在使用 LangChain 和其他 JavaScript 库的 NodeJS 脚本中导入。

¥The SheetJS NodeJS module can be imported in NodeJS scripts that use LangChain and other JavaScript libraries.

一个简单的预处理步骤可以将工作簿转换为可以由现有 CSV 工具处理的 CSV 文件:

¥A simple pre-processing step can convert workbooks to CSV files that can be processed by the existing CSV tooling:

从文件系统解析文件

¥Parsing files from the filesystem

SheetJS readFile 方法 [^2] 可以读取工作簿文件。该方法接受路径并返回符合 SheetJS 数据模型 [^3] 的工作簿对象。

¥The SheetJS readFile method[^2] can read workbook files. The method accepts a path and returns a workbook object that conforms to the SheetJS data model[^3].

/* Load SheetJS Libraries */
import { readFile, set_fs } from 'xlsx';

/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);

/* Parse `pres.xlsx` */
const wb = readFile("pres.xlsx");

检查 SheetJS 工作簿和工作表对象

¥Inspecting SheetJS workbook and worksheet objects

Workbook 对象表示多工作表工作簿文件。它们存储单个工作表对象和其他元数据。

¥Workbook objects represent multi-sheet workbook files. They store individual worksheet objects and other metadata.

与此讨论相关,工作簿对象使用以下键 [^7]:

¥Relevant to this discussion, the workbook object uses the following keys[^7]:

  • SheetNames 是一个工作表名称数组

    ¥SheetNames is an array of worksheet names

  • Sheets 是一个对象,其键是工作表名称,其值是工作表对象。

    ¥Sheets is an object whose keys are sheet names and whose values are sheet objects.

SheetNames[0] 是第一个工作表名称,因此以下代码片段将从工作簿中提取第一个工作表:

¥SheetNames[0] is the first worksheet name, so the following snippet will pull the first worksheet from the workbook:

const first_ws = wb.Sheets[wb.SheetNames[0]];

将 SheetJS 工作表导出到 CSV

¥Exporting SheetJS worksheets to CSV

可以使用 SheetJS sheet_to_csv[^4] 方法将工作簿中的每个工作表写入 CSV 文本。该方法接受 SheetJS 工作表对象并返回字符串。

¥Each worksheet in the workbook can be written to CSV text using the SheetJS sheet_to_csv[^4] method. The method accepts a SheetJS worksheet object and returns a string.

const csv = utils.sheet_to_csv(first_ws);

完成脚本

¥Complete Script

例如,以下 NodeJS 脚本读取 pres.xlsx 并显示第一个工作表中的 CSV 行:

¥For example, the following NodeJS script reads pres.xlsx and displays CSV rows from the first worksheet:

Print CSV data from the first worksheet
/* Load SheetJS Libraries */
import { readFile, set_fs, utils } from 'xlsx';

/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);

/* Parse `pres.xlsx` */
const wb = readFile("pres.xlsx");

/* Print CSV rows from first worksheet */
const first_ws = wb.Sheets[wb.SheetNames[0]];
const csv = utils.sheet_to_csv(first_ws);
console.log(csv);

许多演示涵盖了精神上相似的工作流程:

¥A number of demos cover spiritually similar workflows:

  • StataMATLABMaple 支持 XLSX 数据导入。SheetJS 集成从用户提供的电子表格生成干净的 XLSX 工作簿。

    ¥Stata, MATLAB and Maple support XLSX data import. The SheetJS integrations generate clean XLSX workbooks from user-supplied spreadsheets.

  • TensorFlow.js熊猫Mathematica 支持 CSV 数据导入。SheetJS 集成生成干净的 CSV 并使用内置的 CSV 处理器。

    ¥TensorFlow.js, Pandas and Mathematica support CSV data import. The SheetJS integrations generate clean CSVs and use built-in CSV processors.

  • "命令行工具" 演示涵盖了制作用于文件转换的独立命令行工具的技术。

    ¥The "Command-Line Tools" demo covers techniques for making standalone command-line tools for file conversion.

单一工作表

¥Single Worksheet

对于单个工作表,SheetJS 预处理步骤可以将 CSV 行写入文件,CSVLoader 可以加载新写入的文件。

¥For a single worksheet, a SheetJS pre-processing step can write the CSV rows to file and the CSVLoader can load the newly written file.

Code example (click to hide)
Pulling data from the first worksheet of a workbook
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
import { readFile, set_fs, utils } from 'xlsx';

/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);

/* Parse `pres.xlsx`` */
const wb = readFile("pres.xlsx");

/* Generate CSV and write to `pres.xlsx.csv` */
const first_ws = wb.Sheets[wb.SheetNames[0]];
const csv = utils.sheet_to_csv(first_ws);
fs.writeFileSync("pres.xlsx.csv", csv);

/* Create documents with CSVLoader */
const loader = new CSVLoader("pres.xlsx.csv");
const docs = await loader.load();

console.log(docs);
// ...

工作簿

¥Workbook

工作簿是工作表的集合。每个工作表都可以导出到单独的 CSV。如果 CSV 写入子文件夹,DirectoryLoader[^5] 可以一步处理文件。

¥A workbook is a collection of worksheets. Each worksheet can be exported to a separate CSV. If the CSVs are written to a subfolder, a DirectoryLoader[^5] can process the files in one step.

Code example (click to hide)

在此示例中,脚本创建了一个名为 csv 的子文件夹。将处理工作簿中的每个工作表,并将生成的 CSV 存储到编号文件中。第一个工作表将存储到 csv/0.csv

¥In this example, the script creates a subfolder named csv. Each worksheet in the workbook will be processed and the generated CSV will be stored to numbered files. The first worksheet will be stored to csv/0.csv.

Pulling data from the each worksheet of a workbook
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { readFile, set_fs, utils } from 'xlsx';

/* Load 'fs' for readFile support */
import * as fs from 'fs';
set_fs(fs);

/* Parse `pres.xlsx`` */
const wb = readFile("pres.xlsx");

/* Create a folder `csv` */
try { fs.mkdirSync("csv"); } catch(e) {}

/* Generate CSV data for each worksheet */
wb.SheetNames.forEach((name, idx) => {
const ws = wb.Sheets[name];
const csv = utils.sheet_to_csv(ws);
fs.writeFileSync(`csv/${idx}.csv`, csv);
});

/* Create documents with DirectoryLoader */
const loader = new DirectoryLoader("csv", {
".csv": (path) => new CSVLoader(path)
});
const docs = await loader.load();

console.log(docs);
// ...

SheetJS 加载器

¥SheetJS Loader

LangChain 附带的 CSVLoader 不添加任何文档元数据,也不生成任何属性。自定义加载器可以绕过 CSV 工具中的限制,并可能包含没有 CSV 等效项的元数据。

¥The CSVLoader that ships with LangChain does not add any Document metadata and does not generate any attributes. A custom loader can work around limitations in the CSV tooling and potentially include metadata that has no CSV equivalent.

演示 LoadOfSheet 加载器 将在所有工作表中为每个数据行生成一个文档。它还将尝试构建元数据和属性以用于自查询检索器。

¥The demo LoadOfSheet loader will generate one Document per data row across all worksheets. It will also attempt to build metadata and attributes for use in self-querying retrievers.

Sample usage
/* read and parse `data.xlsb` */
const loader = new LoadOfSheet("./data.xlsb");

/* generate documents */
const docs = await loader.load();

/* synthesized attributes for the SelfQueryRetriever */
const attributes = loader.attributes;
Sample SheetJS Loader (click to show)

This example loader pulls data from each worksheet. It assumes each worksheet includes one header row and a number of data rows.

loadofsheet.mjs
import { Document } from "@langchain/core/documents";
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
import { read, utils } from "xlsx";

/**

* Document loader that uses SheetJS to load documents.

* * Each worksheet is parsed into an array of row objects using the SheetJS

* `sheet_to_json` method and projected to a `Document`. Metadata includes

* original sheet name, row data, and row index
*/
export default class LoadOfSheet extends BufferLoader {
/** @type {import("langchain/chains/query_constructor").AttributeInfo[]} */
attributes = [];

/**

* Document loader that uses SheetJS to load documents.

* * @param {string|Blob} filePathOrBlob Source Data
*/
constructor(filePathOrBlob) {
super(filePathOrBlob);
this.attributes = [];
}

/**

* Parse document

* * NOTE: column labels in multiple sheets are not disambiguated!

* * @param {Buffer} raw Raw data Buffer

* @param {Document["metadata"]} metadata Document metadata

* @returns {Promise<Document[]>} Array of Documents
*/
async parse(raw, metadata) {
/** @type {Document[]} */
const result = [];

this.attributes = [
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
{ name: "rowNum", description: "Row index", type: "number" }
];

const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
const fields = {};
const ws = wb.Sheets[name];
if(!ws) return;

const aoo = utils.sheet_to_json(ws);
aoo.forEach((row, idx) => {
result.push({
pageContent: "Row " + (idx + 1) + " has the following content: \n" + Object.entries(row).map(kv => `- ${kv[0]}: ${kv[1]}`).join("\n") + "\n",
metadata: {
worksheet: name,
rowNum: row["__rowNum__"],
...metadata,
...row
}
});
Object.entries(row).forEach(([k,v]) => { if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true } );
});
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
name: k, description: k, type: Object.keys(v).join(" or ")
}));
}

return result;
}
};

从文本到二进制

¥From Text to Binary

许多库和平台提供通用 "text" 加载器,这些加载器假定采用 UTF8 编码来处理文件。这会破坏许多电子表格格式,包括 XLSX、XLSB、XLSM 和 XLS。

¥Many libraries and platforms offer generic "text" loaders that process files assuming the UTF8 encoding. This corrupts many spreadsheet formats including XLSX, XLSB, XLSM and XLS.

此问题影响许多 JavaScript 工具。各种演示涵盖解决方法:

¥This issue affects many JavaScript tools. Various demos cover workarounds:

CSVLoader 扩展了一个特殊的 TextLoader,强制进行 UTF8 文本解析。

¥The CSVLoader extends a special TextLoader that forces UTF8 text parsing.

PDF 加载器使用一个单独的 BufferLoader 类,它使用 NodeJS Buffer 对象传递原始数据。

¥There is a separate BufferLoader class, used by the PDF loader, that passes the raw data using NodeJS Buffer objects.

BinaryText
pdf.ts (structure)
export class PDFLoader extends BufferLoader {
// ...
public async parse(
raw: Buffer,
metadata: Document["metadata"]
): Promise<Document[]> {
// ...
}
// ...
}
csv.ts (structure)
export class CSVLoader extends TextLoader {
// ...
protected async parse(
raw: string

): Promise<string[]> {
// ...
}
// ...
}

NodeJS 缓冲区

¥NodeJS Buffers

SheetJS read 方法直接支持 NodeJS Buffer 对象 [^6]:

¥The SheetJS read method supports NodeJS Buffer objects directly[^6]:

Parsing a workbook in a BufferLoader
import { BufferLoader } from "langchain/document_loaders/fs/buffer";
import { read, utils } from "xlsx";

export default class LoadOfSheet extends BufferLoader {
// ...
async parse(raw, metadata) {
const wb = read(raw, {type: "buffer"});
// At this point, `wb` is a SheetJS workbook object
// ...
}
}

read 方法返回 SheetJS 工作簿对象 [^7]。

¥The read method returns a SheetJS workbook object[^7].

生成内容

¥Generating Content

SheetJS sheet_to_json 方法 [^8] 返回一个数据对象数组,其键从工作表的第一行中提取。

¥The SheetJS sheet_to_json method[^8] returns an array of data objects whose keys are drawn from the first row of the worksheet.

SpreadsheetArray of Objects

pres.xlsx data

[
{ Name: "Bill Clinton", Index: 42 },
{ Name: "GeorgeW Bush", Index: 43 },
{ Name: "Barack Obama", Index: 44 },
{ Name: "Donald Trump", Index: 45 },
{ Name: "Joseph Biden", Index: 46 }
]

原始 CSVLoader 为每个键值对写入一行。可以通过循环数据行对象的键和值来生成此文本。Object.entries 辅助函数简化了转换:

¥The original CSVLoader wrote one row for each key-value pair. This text can be generated by looping over the keys and values of the data row object. The Object.entries helper function simplifies the conversion:

function make_csvloader_doc_from_row_object(row) {
return Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n");
}

生成文档

¥Generating Documents

加载器必须为工作簿中的每个工作表生成行对象。

¥The loader must generate row objects for each worksheet in the workbook.

在 SheetJS 数据模型中,工作簿对象有两个相关字段:

¥In the SheetJS data model, the workbook object has two relevant fields:

  • SheetNames 是一个工作表名称数组

    ¥SheetNames is an array of sheet names

  • Sheets 是一个对象,其键是工作表名称,其值是工作表对象。

    ¥Sheets is an object whose keys are sheet names and values are sheet objects.

for..of 循环可以遍历工作表:

¥A for..of loop can iterate across the worksheets:

Looping over a workbook (skeleton)
    const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
const ws = wb.Sheets[name];
const aoa = utils.sheet_to_json(ws);
// at this point, `aoa` is an array of objects
}

此简化的 parse 函数使用上一节中的代码片段:

¥This simplified parse function uses the snippet from the previous section:

BufferLoader parse function (skeleton)
  async parse(raw, metadata) {
/* array to hold generated documents */
const result = [];

/* read workbook */
const wb = read(raw, {type: "buffer", WTF:1});

/* loop over worksheets */
for(let name of wb.SheetNames) {
const ws = wb.Sheets[name];
const aoa = utils.sheet_to_json(ws);

/* loop over data rows */
aoa.forEach((row, idx) => {
/* generate a new document and add to the result array */
result.push({
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n")
});
});
}

return result;
}

元数据和属性

¥Metadata and Attributes

强烈建议为自查询检索应用生成额外的元数据和属性。

¥It is strongly recommended to generate additional metadata and attributes for self-query retrieval applications.

Implementation Details (click to show)

Metadata

Metadata is attached to each document object. The following example appends the raw row data to the document metadata:

Document with metadata (snippet)
        /* generate a new document and add to the result array */
result.push({
pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n"),
metadata: {
worksheet: name, // name of the worksheet
rowNum: idx, // data row index
...row // raw row data
}
});

Attributes

Each attribute object specifies three properties:

  • name corresponds to the field in the document metadata
  • description is a description of the field
  • type is a description of the data type.

While looping through data rows, a simple type check can keep track of the data type for each column:

Tracking column types (sketch)
    for(let name of wb.SheetNames) {
/* track column types */
const fields = {};
// ...

aoo.forEach((row, idx) => {
result.push({/* ... */});
/* Check each property */
Object.entries(row).forEach(([k,v]) => {
/* Update fields entry to reflect the new data point */
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
});
});
// ...
}

Attributes can be generated after writing the worksheet data. Storing attributes in a loader property will make it accessible to scripts that use the loader.

Adding Attributes to a Loader (sketch)
export default class LoadOfSheet extends BufferLoader {
attributes = [];
// ...

async parse(raw, metadata) {
// Add the worksheet name and row index attributes
this.attributes = [
{ name: "worksheet", description: "Sheet or Worksheet Name", type: "string" },
{ name: "rowNum", description: "Row index", type: "number" }
];
const wb = read(raw, {type: "buffer", WTF:1});
for(let name of wb.SheetNames) {
const fields = {};
// ...
const aoo = utils.sheet_to_json(ws);
aoo.forEach((row, idx) => {
result.push({/* ... */});
/* Check each property */
Object.entries(row).forEach(([k,v]) => {
/* Update fields entry to reflect the new data point */
if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true
});
});
/* Add one attribute per metadata field */
Object.entries(fields).forEach(([k,v]) => this.attributes.push({
name: k, description: k,
/* { number: true, string: true } -> "number or string" */
type: Object.keys(v).join(" or ")
}));
}
// ...
}

SheetJS 加载器演示

¥SheetJS Loader Demo

演示针对 示例汽车数据集 执行查询 "哪些行每加仑行驶里程超过 40 英里?" 并显示结果。

¥The demo performs the query "Which rows have over 40 miles per gallon?" against a sample cars dataset and displays the results.

SheetJS 团队成员已使用 PowerShell 和 Ollama for Windows 在 Windows 10 和 Windows 11 上测试了此演示。

¥SheetJS team members have tested this demo on Windows 10 and Windows 11 using PowerShell and Ollama for Windows.

SheetJS 用户还在 Windows Subsystem for Linux 中测试了此演示。

¥SheetJS users have also tested this demo within Windows Subsystem for Linux.

此演示使用 Ollama 中的 ChatQA-1.5 model[^9] 进行了测试。

¥This demo was tested using the ChatQA-1.5 model[^9] in Ollama.

测试模型使用了高达 9.2GB 的 VRAM。强烈建议在较新的 Apple Silicon Mac 或配备 Nvidia GPU(至少 12GB VRAM)的 PC 上运行演示。SheetJS 用户已在 VRAM 低至 6GB 的系统上测试了此演示。

¥The tested model used up to 9.2GB VRAM. It is strongly recommended to run the demo on a newer Apple Silicon Mac or a PC with an Nvidia GPU with at least 12GB VRAM. SheetJS users have tested the demo on systems with as little as 6GB VRAM.

  1. 安装先决条件:

    ¥Install pre-requisites:

Ollama 应安装在与 NodeJS 相同的平台上。如果在 WSL 中运行 NodeJS,则还应在 WSL 中安装 Ollama。

¥Ollama should be installed on the same platform as NodeJS. If NodeJS is run within WSL, Ollama should also be installed within WSL.

安装依赖后,启动新的终端会话。

¥After installing dependencies, start a new terminal session.

  1. 创建一个新项目:

    ¥Create a new project:

mkdir sheetjs-loader
cd sheetjs-loader
npm init -y
  1. 下载演示脚本:

    ¥Download the demo scripts:

curl -LO https://xlsx.nodejs.cn/loadofsheet/query.mjs
curl -LO https://xlsx.nodejs.cn/loadofsheet/loadofsheet.mjs

在 PowerShell 中,命令可能会因参数错误而失败:

¥In PowerShell, the command may fail with a parameter error:

Invoke-WebRequest : A parameter cannot be found that matches parameter name 'LO'.

curl.exe 必须直接调用:

¥curl.exe must be invoked directly:

curl.exe -LO https://xlsx.nodejs.cn/loadofsheet/query.mjs
curl.exe -LO https://xlsx.nodejs.cn/loadofsheet/loadofsheet.mjs
  1. 安装 SheetJS NodeJS 模块:

    ¥Install the SheetJS NodeJS module:

npm i --save https://sheet.lol/balls/xlsx-0.20.3.tgz
  1. 安装依赖:

    ¥Install dependencies:

npm i --save @langchain/community@0.2.18 @langchain/core@0.2.15 langchain@0.2.9 peggy@4.0.3 --force

上次测试此演示时,出现与依赖和对等依赖版本相关的错误消息。--force 标志是必需的。

¥When this demo was last tested, there were error messages relating to dependency and peer dependency versions. The --force flag was required.

  1. 下载 汽车数据集

    ¥Download the cars dataset:

curl -LO https://xlsx.nodejs.cn/cd.xls

在 PowerShell 中,命令可能会因参数错误而失败:

¥In PowerShell, the command may fail with a parameter error:

Invoke-WebRequest : A parameter cannot be found that matches parameter name 'LO'.

curl.exe 必须直接调用:

¥curl.exe must be invoked directly:

curl.exe -LO https://xlsx.nodejs.cn/cd.xls
  1. 使用 Ollama 安装 llama3-chatqa:8b-v1.5-q8_0 模型:

    ¥Install the llama3-chatqa:8b-v1.5-q8_0 model using Ollama:

ollama pull llama3-chatqa:8b-v1.5-q8_0
  1. 运行演示脚本

    ¥Run the demo script

node query.mjs

演示执行查询 "哪些行每加仑行驶里程超过 40 英里?"。它将打印以下九个结果:

¥The demo performs the query "Which rows have over 40 miles per gallon?". It will print the following nine results:

Expected output (order of lines may differ)
{ Name: 'volkswagen rabbit custom diesel', MPG: 43.1 }
{ Name: 'vw rabbit c (diesel)', MPG: 44.3 }
{ Name: 'renault lecar deluxe', MPG: 40.9 }
{ Name: 'honda civic 1500 gl', MPG: 44.6 }
{ Name: 'datsun 210', MPG: 40.8 }
{ Name: 'vw pickup', MPG: 44 }
{ Name: 'mazda glc', MPG: 46.6 }
{ Name: 'vw dasher (diesel)', MPG: 43.4 }
{ Name: 'vw rabbit', MPG: 41.5 }

一些使用较旧 GPU 的 SheetJS 用户报告了错误。

¥Some SheetJS users with older GPUs have reported errors.

如果命令失败,请尝试再次运行脚本。

¥If the command fails, please try running the script a second time.

要找到预期结果:

¥To find the expected results:

  • 在 Excel 中打开 cd.xls 电子表格

    ¥Open the cd.xls spreadsheet in Excel

  • 在功能区中选择主页 > 排序和筛选 > 筛选

    ¥Select Home > Sort & Filter > Filter in the Ribbon

  • 选择 B 列 (Miles_per_Gallon) 的筛选选项

    ¥Select the filter option for column B (Miles_per_Gallon)

  • 在弹出窗口中,在 Filter 下拉列表中选择 "大于" 并输入 40

    ¥In the popup, select "Greater Than" in the Filter dropdown and type 40

过滤后的结果应与以下屏幕截图匹配:

¥The filtered results should match the following screenshot:

Expected Results

SheetJS 模型 公开 formulae 和其他功能。

¥The SheetJS model exposes formulae and other features.

SheetJS 专业版 构建公开了单元格样式、图片、图表、表格和其他功能。

¥SheetJS Pro builds expose cell styling, images, charts, tables, and other features.

[^1]: 有关 LangChain 文档中的 "如何加载 CSV 数据"

¥See "How to load CSV data" in the LangChain documentation

[^2]: 见 readFile 于 "读取文件"

¥See readFile in "Reading Files"

[^3]: 见 "SheetJS 数据模型"

¥See "SheetJS Data Model"

[^4]: 见 sheet_to_csv 于 "CSV 和文本"

¥See sheet_to_csv in "CSV and Text"

[^5]: 有关 LangChain 文档中的 "包含多个文件的文件夹"

¥See "Folders with multiple files" in the LangChain documentation

[^6]: 见 "写入文件" 中的 "支持的输出格式" 类型

¥See "Supported Output Formats" type in "Writing Files"

[^7]: 见 "工作簿对象"

¥See "Workbook Object"

[^8]: 见 sheet_to_json 于 "实用工具"

¥See sheet_to_json in "Utilities"

[^9]: 有关 ChatQA 论文和其他模型详细信息,请参阅 官方 ChatQA 网站

¥See the official ChatQA website for the ChatQA paper and other model details.