PST 邮箱中的工作表
PST(个人存储表)是一种用于存储消息的常见文件格式。电子取证通常涉及从 PST 档案中存储的电子邮件中附加的电子表格中提取数据。
¥PST (Personal Storage Table) is a common file format for storing messages. Electronic discovery commonly involves extracting data from attached spreadsheets in e-mail messages stored in PST archives.
pst-extractor
[^1] 是一个 NodeJS 模块,设计用于从 PST 文件中提取对象。它已被用于从 Enron Corpus[^2] 和其他大型邮箱中提取电子表格。
¥pst-extractor
[^1] is a NodeJS module designed for extracting objects from PST
files. It has been used to extract spreadsheets from the Enron Corpus[^2] and
other large mailboxes.
SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。
¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.
该演示使用 pst-extractor
和 SheetJS 来读取电子表格。我们将探讨如何在 NodeJS 脚本或网站中加载 SheetJS、提取电子表格文件以及生成基础数据的 HTML 和 CSV 视图。
¥This demo uses pst-extractor
and SheetJS to read spreadsheets. We'll explore
how to load SheetJS in a NodeJS script or website, extract spreadsheets files,
and generate HTML and CSV views of the underlying data.
"在线演示" 读取 PST 文件。文件中的各个电子表格可以下载或在浏览器中预览。
¥The "Live Demo" reads PST files. Individual spreadsheets within the file can be downloaded or previewed in the browser.
该演示最后一次测试是在 2024 年 3 月 11 日,针对 pst-extractor
1.9.0
¥This demo was last tested on 2024 March 11 against pst-extractor
1.9.0
概述
¥Overview
可以从使用 pst-extractor
的脚本导入 SheetJS NodeJS 模块。
¥The SheetJS NodeJS module can be
imported from scripts that use pst-extractor
.
解析 PST 文件
¥Parsing PST Files
pst-extractor
模块公开了 PSTFile
类。构造函数需要一个适当的 NodeJS 缓冲区。
¥The pst-extractor
module exposes a PSTFile
class. The constructor requires a
proper NodeJS buffer.
以下代码片段从本地文件系统读取并解析 enron.pst
。fs.readFileSync
[^3] 接受文件名并返回缓冲区:
¥The following snippet reads and parses enron.pst
from the local filesystem.
fs.readFileSync
[^3] accepts a filename and returns a Buffer:
const fs = require("fs"), PSTExtractor = require("pst-extractor");
const file = fs.readFileSync("enron.pst");
const pst = new (PSTExtractor.PSTFile)(file);
走树
¥Walking the Tree
pst-extractor
提供了一个树状结构来检查 PST 文件的内容。建议使用递归函数来遍历树。
¥pst-extractor
presents a tree-like structure to inspect the contents of the
PST file. It is recommended to use recursive functions to walk the tree.
以下树步行者将收集所有 XLSX 和 XLS 附件:
¥The following tree walker will collect all XLSX and XLS attachments:
/* walk the PST file and add all attachments to the specified array */
function walk(f,arr) {
if(f.hasSubfolders) for(let sf of f.getSubFolders()) walk(sf,arr);
if(f.contentCount <= 0) return;
for(let e = f.getNextChild(); e != null; e = f.getNextChild()) {
for(let i = 0; i < e.numberOfAttachments; ++i) {
var a = e.getAttachment(i);
/* XLS spreadsheet test by filename */
if(/.xls[xmb]?$/.test(a.filename)) arr.push(a);
}
}
}
/* generate a list of attachments */
const files = [];
walk(pst.getRootFolder(), files);
生成缓冲区
¥Generating Buffers
PSTAttachment
类保存附件元数据。为了避免将所有内容加载到内存中,原始数据作为自定义流对象公开。由于 SheetJS read
函数需要 Buffer
或 Uint8Array
中的数据,因此使用辅助函数来收集数据:
¥The PSTAttachment
class holds attachment metadata. To avoid loading everything
in memory, the raw data is exposed as a custom stream object. Since the SheetJS
read
function requires data in a Buffer
or Uint8Array
, a helper function
is used to collect the data:
/* collect data from the attachment into a "Buffer" */
function collect(file) {
const strm = file.fileInputStream;
const data = Buffer.alloc(strm._length.low);
strm.readCompletely(data);
return data;
}
/* collect data from the first attachment */
const buf0 = collect(files[0]);
处理附件
¥Processing Attachments
给定 NodeJS 缓冲区,SheetJS read
方法 [^4] 解析数据并返回工作簿对象 [^5]。可以从工作簿中提取各个工作表并将其转换为 CSV[^6] 或 HTML[^7]。
¥Given a NodeJS Buffer, the SheetJS read
method[^4] parses the data and returns
a workbook object[^5]. Individual worksheets can be extracted from the workbook
and converted to CSV[^6] or HTML[^7].
以下示例以 CSV 形式打印每个工作表的内容:
¥The following example prints the contents of each worksheet in CSV form:
const XLSX = require("xlsx");
/* parse workbook and print CSV contents of each sheet */
const wb = XLSX.read(buf0);
wb.SheetNames.forEach(n => {
const ws = wb.Sheets[n];
const csv = XLSX.utils.sheet_to_csv(ws);
console.log(`#### ${file.filename} ! ${n}`);
console.log(csv);
});
浏览器注意事项
¥Browser Caveats
SheetJS 独立脚本 可以通过 SCRIPT
标签加载。
¥The SheetJS Standalone scripts
can be loaded through a SCRIPT
tag.
该演示使用 特殊的 pst-extractor
版本 网页版。
¥This demo uses a special pst-extractor
build for the web.
与 NodeJS 构建相比,浏览器脚本需要特殊的缓冲区封装器。例如,由于库不支持 ArrayBuffer
对象,以下函数将失败:
¥Compared to the NodeJS build, browser scripts require special Buffer wrappers.
For example, the following function will fail since the library does not support
ArrayBuffer
objects:
async function error_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
// this will throw an error
return new (PSTExtractor.PSTFile)(ab);
}
浏览器版本在 PSTExtractor
全局中公开 Buffer
对象:
¥The browser build exposes the Buffer
object in the PSTExtractor
global:
async function correct_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
const buf = new PSTExtractor.Buffer(ab);
return new (PSTExtractor.PSTFile)(buf);
}
浏览器构建
¥Browser Build
pst-extractor
库是为 NodeJS 设计的。该库的某些部分需要 NodeJS Buffer
,而浏览器中不存在该版本。可以在脚本中添加并公开假的 Buffer
。
¥The pst-extractor
library is designed for NodeJS. Parts of the library expect
a NodeJS Buffer
, which does not exist in the browser. A fake Buffer
can be
added and exposed in a script.
pstextractor.js
已加载到演示页面中。
¥pstextractor.js
is loaded in the demo page.
Build instructions (click to show)
- Initialize a new NodeJS project and install the dependency:
mkdir pstextract
cd pstextract
npm init -y
npm i --save pst-extractor@1.9.0
- Save the following to
shim.js
:
const PSTExtractor = require("pst-extractor");
module.exports = PSTExtractor;
module.exports.Buffer = Buffer;
- Build the script:
npx browserify@17.0.0 -s PSTExtractor -o pstextractor.js shim.js
演示
¥Demos
NodeJS
该演示将获取 测试太平洋标准时间 并提取所有嵌入的电子表格。该脚本可用于读取本地 PST 文件或从不同的 URL 提取 PST 文件。
¥This demo will fetch a test PST and extract all embedded spreadsheets. The script can be adapted to read local PST files or pull PST files from a different URL.
该演示使用 fetch
并需要 NodeJS 18 或更高版本。
¥The demo uses fetch
and requires NodeJS 18 or later.
-
初始化一个新项目:
¥Initialize a new project:
mkdir sheetjs-pst
cd sheetjs-pst
npm init -y
-
安装 SheetJS NodeJS 模块和
pst-extractor
:¥Install the SheetJS NodeJS module and
pst-extractor
:
- npm
- pnpm
- Yarn
npm i --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
pnpm install --save https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
yarn add https://cdn.sheetjs.com/xlsx-0.20.3/xlsx-0.20.3.tgz pst-extractor
-
下载
SheetJSPST.js
到项目文件夹中:¥Download
SheetJSPST.js
into project folder:
curl -LO https://xlsx.nodejs.cn/pst/SheetJSPST.js
-
运行脚本:
¥Run the script:
node SheetJSPST.js
该过程将获取 测试 PST 并提取嵌入的电子表格。终端将显示导出文件的信息。
¥The process will fetch the test PST and extract the embedded spreadsheets. The terminal will display info on the exported files.
以 saving file
开头的行显示附件如何对应于文件。以下行指出第一个附件(索引 0
)最初命名为 RedRockA.xls
,并保存到文件系统上的 file0.xls
:
¥Lines starting with saving file
show how attachments correspond to files. The
following line states that the first attachment (index 0
) was originally named
RedRockA.xls
and was saved to file0.xls
on the file system:
saving file 0 |RedRockA.xls| to file0.xls
以 ####
开头的行显示附件文件名和工作表名称。以下行说明文件 RedRockA.xls
中有一个名为 "Oneok at 2500"
的工作表:
¥Lines starting with ####
show the attachment file name and the worksheet name.
The following line explains that there is a worksheet named "Oneok at 2500"
in
the file RedRockA.xls
:
#### RedRockA.xls ! Oneok at 2500
每隔一行都是指定工作表中的一个 CSV 行。例如,RedRockA.xls
中的工作表 "Oneok at 2500"
的前四行如下所示:
¥Every other line is a CSV row from the named worksheet. For example, the first
four lines of worksheet "Oneok at 2500"
in RedRockA.xls
are shown below:
#### RedRockA.xls ! Oneok at 2500
RED ROCK EXPANSION PROJECT,,,,,,,,
,,,,,,,,
,,REQUESTED,REQUESTED,,,,,
,,RECEIPT,DELIVERY,,,Allocation,,
在线演示
¥Live Demo
该演示读取 PST 邮箱。由于浏览器限制,大于 100 MB 的 PST 文件可能会使浏览器崩溃。
¥This demo reads PST mailboxes. Due to browser limitations, PST files larger than 100 MB may crash the browser.
解析 PST 文件后,"附件" 表将列出文件中附加的 XLSX 和 XLS 电子表格。"preview" 链接将显示一个 HTML 表格,其中包含电子表格中的数据。"download" 链接将下载附件。
¥After parsing the PST file, the "Attachments" table will list attached XLSX and XLS spreadsheets in the file. The "preview" link will display a HTML table with the data in the spreadsheet. The "download" link will download the attachment.
测试文件 基于 "安然语料库" 的 EDRM 干净提取,并包含一些 XLS 附件。
¥The test file was based on the EDRM clean extract from the "Enron Corpus" and includes a few XLS attachments.
如果在线演示显示一条消息
¥If the live demo shows a message
Please reload the page
请刷新页面。这是文档生成器中的一个已知错误。
¥please refresh the page. This is a known bug in the documentation generator.
function SheetJSPreviewPSTSheets() { const [ files, setFiles ] = React.useState([]); const [ __html, setHTML ] = React.useState(""); /* recursively walk PST and collect attachments */ const walk = (f,arr) => { if(f.hasSubfolders) for(let sf of f.getSubFolders()) walk(sf,arr); if(f.contentCount <= 0) return; for(let e = f.getNextChild(); e != null; e = f.getNextChild()) { for(let i = 0; i < e.numberOfAttachments; ++i) { var a = e.getAttachment(i); /* XLS spreadsheet test by filename */ if(/.xls[xmb]?$/.test(a.filename)) arr.push(a); } } } /* collect data from the attachment into a "Buffer" */ const collect = (j) => { const strm = files[j].fileInputStream; const data = new PSTExtractor.Buffer(strm._length.low); strm.readCompletely(data); return data; } /* view selected attachment */ const view = (j) => { const data = collect(j); /* parse */ const wb = XLSX.read(data); /* convert first sheet to HTML */ const ws = wb.Sheets[wb.SheetNames[0]]; setHTML(XLSX.utils.sheet_to_html(ws)); } /* process array buffer */ const process_ab = (ab) => { const pst = new (PSTExtractor.PSTFile)(new PSTExtractor.Buffer(ab)); const data = []; walk(pst.getRootFolder(), data); setFiles(data); }; /* on click, fetch and process file */ const doit = async() => { const ab = await (await fetch("/pst/enron.pst")).arrayBuffer(); process_ab(ab); }; const chg = async(e) => process_ab(await e.target.files[0].arrayBuffer()); /* download selected attachment */ const dl = (j) => { const a = document.createElement("a"); a.download = files[j].filename; a.href = URL.createObjectURL(new Blob([collect(j)])); document.body.appendChild(a); a.click(); document.body.removeChild(a); } if(typeof PSTExtractor == "undefined") return <b>Please reload the page</b>; return ( <> <p>Use the file input to select a file, or click "Use a Sample PST"</p> <input type="file" accept=".pst" onChange={chg}/> <button onClick={doit}>Use a Sample PST!</button><br/><br/> <table><thead><tr><th colspan="3">Attachments</th></tr></thead> <tbody>{files.map((f,j) => ( <tr key={j}><th>{f.filename}</th> <td><a onClick={()=>view(j)}>(preview)</a></td> <td><a onClick={()=>dl(j)}>(download)</a></td> </tr> ))}</tbody> </table> <b>Preview of first worksheet</b><br/> <div dangerouslySetInnerHTML={{__html}}></div> </> ); }
[^1]: 该项目没有官方网站。官方 repository 托管在 GitHub 上。
¥The project has no official website. The official repository is hosted on GitHub.
[^2]: 提取的电子表格为 可以在 GitHub 上找到
¥Extracted spreadsheets are available on GitHub
[^3]: 请参阅 NodeJS 文档中的 fs.readFileSync
¥See fs.readFileSync
in the NodeJS documentation
[^4]: 见 read
于 "读取文件"
[^5]: 见 "工作簿对象"
¥See "Workbook Object"
[^6]: 见 sheet_to_csv
于 "CSV 和文本"
¥See sheet_to_csv
in "CSV and Text"
[^7]: 见 sheet_to_html
于 "实用工具"