PST 邮箱中的工作表
PST(个人存储表)是一种用于存储消息的常见文件格式。电子取证通常涉及从 PST 档案中存储的电子邮件中附加的电子表格中提取数据。
¥PST (Personal Storage Table) is a common file format for storing messages. Electronic discovery commonly involves extracting data from attached spreadsheets in e-mail messages stored in PST archives.
pst-extractor
[^1] 是一个 NodeJS 模块,设计用于从 PST 文件中提取对象。它已被用于从 Enron Corpus[^2] 和其他大型邮箱中提取电子表格。
¥pst-extractor
[^1] is a NodeJS module designed for extracting objects from PST
files. It has been used to extract spreadsheets from the Enron Corpus[^2] and
other large mailboxes.
SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。
¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.
该演示使用 pst-extractor
和 SheetJS 来读取电子表格。我们将探讨如何在 NodeJS 脚本或网站中加载 SheetJS、提取电子表格文件以及生成基础数据的 HTML 和 CSV 视图。
¥This demo uses pst-extractor
and SheetJS to read spreadsheets. We'll explore
how to load SheetJS in a NodeJS script or website, extract spreadsheets files,
and generate HTML and CSV views of the underlying data.
"在线演示" 读取 PST 文件。文件中的各个电子表格可以下载或在浏览器中预览。
¥The "Live Demo" reads PST files. Individual spreadsheets within the file can be downloaded or previewed in the browser.
该演示在以下部署中进行了测试:
¥This demo was tested in the following deployments:
平台 | 版本 | 日期 |
---|---|---|
Chrome 131 | 1.9.0 | 2024-12-22 |
NodeJS 20 | 1.10.0 | 2024-12-22 |
BunJS 1.1 | 1.10.0 | 2024-12-22 |
概述
¥Overview
可以从使用 pst-extractor
的脚本导入 SheetJS NodeJS 模块。
¥The SheetJS NodeJS module can be
imported from scripts that use pst-extractor
.
解析 PST 文件
¥Parsing PST Files
pst-extractor
模块公开了 PSTFile
类。构造函数需要一个适当的 NodeJS 缓冲区。
¥The pst-extractor
module exposes a PSTFile
class. The constructor requires a
proper NodeJS buffer.
以下代码片段从本地文件系统读取并解析 enron.pst
。fs.readFileSync
[^3] 接受文件名并返回缓冲区:
¥The following snippet reads and parses enron.pst
from the local filesystem.
fs.readFileSync
[^3] accepts a filename and returns a Buffer:
const fs = require("fs"), PSTExtractor = require("pst-extractor");
const file = fs.readFileSync("enron.pst");
const pst = new (PSTExtractor.PSTFile)(file);
走树
¥Walking the Tree
pst-extractor
提供了一个树状结构来检查 PST 文件的内容。建议使用递归函数来遍历树。
¥pst-extractor
presents a tree-like structure to inspect the contents of the
PST file. It is recommended to use recursive functions to walk the tree.
以下树步行者将收集所有 XLSX 和 XLS 附件:
¥The following tree walker will collect all XLSX and XLS attachments:
/* walk the PST file and add all attachments to the specified array */
function walk(f,arr) {
if(f.hasSubfolders) for(let sf of f.getSubFolders()) walk(sf,arr);
if(f.contentCount <= 0) return;
for(let e = f.getNextChild(); e != null; e = f.getNextChild()) {
for(let i = 0; i < e.numberOfAttachments; ++i) {
var a = e.getAttachment(i);
/* XLS spreadsheet test by filename */
if(/.xls[xmb]?$/.test(a.filename)) arr.push(a);
}
}
}
/* generate a list of attachments */
const files = [];
walk(pst.getRootFolder(), files);
生成缓冲区
¥Generating Buffers
PSTAttachment
类保存附件元数据。为了避免将所有内容加载到内存中,原始数据作为自定义流对象公开。由于 SheetJS read
函数需要 Buffer
或 Uint8Array
中的数据,因此使用辅助函数来收集数据:
¥The PSTAttachment
class holds attachment metadata. To avoid loading everything
in memory, the raw data is exposed as a custom stream object. Since the SheetJS
read
function requires data in a Buffer
or Uint8Array
, a helper function
is used to collect the data:
/* collect data from the attachment into a "Buffer" */
function collect(file) {
const strm = file.fileInputStream;
const data = Buffer.alloc(strm._length.low);
strm.readCompletely(data);
return data;
}
/* collect data from the first attachment */
const buf0 = collect(files[0]);
处理附件
¥Processing Attachments
给定 NodeJS 缓冲区,SheetJS read
方法 [^4] 解析数据并返回工作簿对象 [^5]。可以从工作簿中提取各个工作表并将其转换为 CSV[^6] 或 HTML[^7]。
¥Given a NodeJS Buffer, the SheetJS read
method[^4] parses the data and returns
a workbook object[^5]. Individual worksheets can be extracted from the workbook
and converted to CSV[^6] or HTML[^7].
以下示例以 CSV 形式打印每个工作表的内容:
¥The following example prints the contents of each worksheet in CSV form:
const XLSX = require("xlsx");
/* parse workbook and print CSV contents of each sheet */
const wb = XLSX.read(buf0);
wb.SheetNames.forEach(n => {
const ws = wb.Sheets[n];
const csv = XLSX.utils.sheet_to_csv(ws);
console.log(`#### ${file.filename} ! ${n}`);
console.log(csv);
});
浏览器注意事项
¥Browser Caveats
SheetJS 独立脚本 可以通过 SCRIPT
标签加载。
¥The SheetJS Standalone scripts
can be loaded through a SCRIPT
tag.
该演示使用 特殊的 pst-extractor
版本 网页版。
¥This demo uses a special pst-extractor
build for the web.
与 NodeJS 构建相比,浏览器脚本需要特殊的缓冲区封装器。例如,由于库不支持 ArrayBuffer
对象,以下函数将失败:
¥Compared to the NodeJS build, browser scripts require special Buffer wrappers.
For example, the following function will fail since the library does not support
ArrayBuffer
objects:
async function error_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
// this will throw an error
return new (PSTExtractor.PSTFile)(ab);
}
浏览器版本在 PSTExtractor
全局中公开 Buffer
对象:
¥The browser build exposes the Buffer
object in the PSTExtractor
global:
async function correct_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
const buf = new PSTExtractor.Buffer(ab);
return new (PSTExtractor.PSTFile)(buf);
}
浏览器构建
¥Browser Build
pst-extractor
库是为 NodeJS 设计的。该库的某些部分需要 NodeJS Buffer
,而浏览器中不存在该版本。可以在脚本中添加并公开假的 Buffer
。
¥The pst-extractor
library is designed for NodeJS. Parts of the library expect
a NodeJS Buffer
, which does not exist in the browser. A fake Buffer
can be
added and exposed in a script.
pstextractor.js
已加载到演示页面中。
¥pstextractor.js
is loaded in the demo page.
Build instructions (click to show)
- Initialize a new NodeJS project and install the dependency:
mkdir pstextract
cd pstextract
npm init -y
npm i --save pst-extractor@1.9.0
- Save the following to
shim.js
:
const PSTExtractor = require("pst-extractor");
module.exports = PSTExtractor;
module.exports.Buffer = Buffer;
- Build the script:
npx browserify@17.0.0 -s PSTExtractor -o pstextractor.js shim.js
演示
¥Demos
NodeJS
该演示将获取 测试太平洋标准时间 并提取所有嵌入的电子表格。该脚本可用于读取本地 PST 文件或从不同的 URL 提取 PST 文件。
¥This demo will fetch a test PST and extract all embedded spreadsheets. The script can be adapted to read local PST files or pull PST files from a different URL.
该演示使用 fetch
并需要 NodeJS 18 或更高版本。
¥The demo uses fetch
and requires NodeJS 18 or later.
-
初始化一个新项目:
¥Initialize a new project:
mkdir sheetjs-pst
cd sheetjs-pst
npm init -y
-
安装 SheetJS NodeJS 模块和
pst-extractor
:¥Install the SheetJS NodeJS module and
pst-extractor
:
- npm
- pnpm
- Yarn
npm i --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
pnpm install --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
yarn add https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
-
下载
SheetJSPST.js
到项目文件夹中:¥Download
SheetJSPST.js
into project folder:
curl -LO https://xlsx.nodejs.cn/pst/SheetJSPST.js
-
运行脚本:
¥Run the script:
node SheetJSPST.js
该过程将获取 测试 PST 并提取嵌入的电子表格。终端将显示导出文件的信息。
¥The process will fetch the test PST and extract the embedded spreadsheets. The terminal will display info on the exported files.
以 saving file
开头的行显示附件如何对应于文件。以下行指出第一个附件(索引 0
)最初命名为 RedRockA.xls
,并保存到文件系统上的 file0.xls
:
¥Lines starting with saving file
show how attachments correspond to files. The
following line states that the first attachment (index 0
) was originally named
RedRockA.xls
and was saved to file0.xls
on the file system:
saving file 0 |RedRockA.xls| to file0.xls
以 ####
开头的行显示附件文件名和工作表名称。以下行说明文件 RedRockA.xls
中有一个名为 "Oneok at 2500"
的工作表:
¥Lines starting with ####
show the attachment file name and the worksheet name.
The following line explains that there is a worksheet named "Oneok at 2500"
in
the file RedRockA.xls
:
#### RedRockA.xls ! Oneok at 2500
每隔一行都是指定工作表中的一个 CSV 行。例如,RedRockA.xls
中的工作表 "Oneok at 2500"
的前四行如下所示:
¥Every other line is a CSV row from the named worksheet. For example, the first
four lines of worksheet "Oneok at 2500"
in RedRockA.xls
are shown below:
#### RedRockA.xls ! Oneok at 2500
RED ROCK EXPANSION PROJECT,,,,,,,,
,,,,,,,,
,,REQUESTED,REQUESTED,,,,,
,,RECEIPT,DELIVERY,,,Allocation,,
在线演示
¥Live Demo
该演示读取 PST 邮箱。由于浏览器限制,大于 100 MB 的 PST 文件可能会使浏览器崩溃。
¥This demo reads PST mailboxes. Due to browser limitations, PST files larger than 100 MB may crash the browser.
解析 PST 文件后,"附件" 表将列出文件中附加的 XLSX 和 XLS 电子表格。"preview" 链接将显示一个包含电子表格中数据的 HTML 表。"download" 链接将下载附件。
¥After parsing the PST file, the "Attachments" table will list attached XLSX and XLS spreadsheets in the file. The "preview" link will display an HTML table with the data in the spreadsheet. The "download" link will download the attachment.
测试文件 基于 "安然语料库" 的 EDRM 干净提取,并包含一些 XLS 附件。
¥The test file was based on the EDRM clean extract from the "Enron Corpus" and includes a few XLS attachments.
如果在线演示显示一条消息
¥If the live demo shows a message
Please reload the page
请刷新页面。这是文档生成器中的一个已知错误。
¥please refresh the page. This is a known bug in the documentation generator.
function SheetJSPreviewPSTSheets() { const [ files, setFiles ] = React.useState([]); const [ __html, setHTML ] = React.useState(""); /* recursively walk PST and collect attachments */ const walk = (f,arr) => { if(f.hasSubfolders) for(let sf of f.getSubFolders()) walk(sf,arr); if(f.contentCount <= 0) return; for(let e = f.getNextChild(); e != null; e = f.getNextChild()) { for(let i = 0; i < e.numberOfAttachments; ++i) { var a = e.getAttachment(i); /* XLS spreadsheet test by filename */ if(/.xls[xmb]?$/.test(a.filename)) arr.push(a); } } } /* collect data from the attachment into a "Buffer" */ const collect = (j) => { const strm = files[j].fileInputStream; const data = new PSTExtractor.Buffer(strm._length.low); strm.readCompletely(data); return data; } /* view selected attachment */ const view = (j) => { const data = collect(j); /* parse */ const wb = XLSX.read(data); /* convert first sheet to HTML */ const ws = wb.Sheets[wb.SheetNames[0]]; setHTML(XLSX.utils.sheet_to_html(ws)); } /* process array buffer */ const process_ab = (ab) => { const pst = new (PSTExtractor.PSTFile)(new PSTExtractor.Buffer(ab)); const data = []; walk(pst.getRootFolder(), data); setFiles(data); }; /* on click, fetch and process file */ const doit = async() => { const ab = await (await fetch("/pst/enron.pst")).arrayBuffer(); process_ab(ab); }; const chg = async(e) => process_ab(await e.target.files[0].arrayBuffer()); /* download selected attachment */ const dl = (j) => { const a = document.createElement("a"); a.download = files[j].filename; a.href = URL.createObjectURL(new Blob([collect(j)])); document.body.appendChild(a); a.click(); document.body.removeChild(a); } if(typeof PSTExtractor == "undefined") return <b>Please reload the page</b>; return ( <> <p>Use the file input to select a file, or click "Use a Sample PST"</p> <input type="file" accept=".pst" onChange={chg}/> <button onClick={doit}>Use a Sample PST!</button><br/><br/> <table><thead><tr><th colspan="3">Attachments</th></tr></thead> <tbody>{files.map((f,j) => ( <tr key={j}><th>{f.filename}</th> <td><a onClick={()=>view(j)}>(preview)</a></td> <td><a onClick={()=>dl(j)}>(download)</a></td> </tr> ))}</tbody> </table> <b>Preview of first worksheet</b><br/> <div dangerouslySetInnerHTML={{__html}}></div> </> ); }
[^1]: 该项目没有官方网站。官方 repository 托管在 GitHub 上。
¥The project has no official website. The official repository is hosted on GitHub.
[^2]: 提取的电子表格为 可以在 GitHub 上找到
¥Extracted spreadsheets are available on GitHub
[^3]: 请参阅 NodeJS 文档中的 fs.readFileSync
¥See fs.readFileSync
in the NodeJS documentation
[^4]: 见 read
于 "读取文件"
[^5]: 见 "工作簿对象"
¥See "Workbook Object"
[^6]: 见 sheet_to_csv
于 "CSV 和文本"
¥See sheet_to_csv
in "CSV and Text"
[^7]: 见 sheet_to_html
于 "实用工具"