Skip to main content

Stata 中的现代电子表格

Stata 是一个统计软件包。它提供了一个强大的基于 C 的扩展系统。

¥Stata is a statistical software package. It offers a robust C-based extension system.

SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。

¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.

该演示使用 SheetJS 从电子表格中提取数据,以便在 Stata 中进行进一步分析。我们将创建一个 Stata 原生扩展,加载 Duktape JavaScript 引擎并使用 SheetJS 库从电子表格读取数据并转换为 Stata 友好的格式。

¥This demo uses SheetJS to pull data from a spreadsheet for further analysis within Stata. We'll create a Stata native extension that loads the Duktape JavaScript engine and uses the SheetJS library to read data from spreadsheets and converts to a Stata-friendly format.

该演示将读取 数字工作簿 并为每列生成变量。示例 Stata 会话如下所示:

¥The demo will read a Numbers workbook and generate variables for each column. A sample Stata session is shown below:

Stata commands

该演示涵盖了 Stata 扩展。为了直接处理 Stata DTA 文件,"Stata DTA 编解码器" 在浏览器或 NodeJS 中工作。

¥This demo covers Stata extensions. For directly processing Stata DTA files, the "Stata DTA Codec" works in the browser or NodeJS.

测试部署

此演示由 SheetJS 用户在以下部署中进行了测试:

¥This demo was tested by SheetJS users in the following deployments:

架构日期
darwin-x642024-04-10
win10-x642024-04-10
linux-x642024-04-25

Stata 对通过 import excel 命令 [^1] 处理电子表格的支持有限。在撰写本文时,它缺乏对 XLSB、NUMBERS 和其他常见电子表格格式的支持。

¥Stata has limited support for processing spreadsheets through the import excel command[^1]. At the time of writing, it lacked support for XLSB, NUMBERS, and other common spreadsheet formats.

SheetJS 库通过将电子表格标准化为 Stata 可以理解的形式来帮助填补这一空白。

¥SheetJS libraries help fill the gap by normalizing spreadsheets to a form that Stata can understand.

集成详情

¥Integration Details

当前的建议涉及一个原生插件,该插件可以读取任意文件并生成 Stata 可以导入的干净的 XLSX 文件。

¥The current recommendation involves a native plugin that reads arbitrary files and generates clean XLSX files that Stata can import.

扩展函数最终将 SheetJS read[^2] 和 write[^3] 方法配对,从旧文件中读取数据并写入新文件:

¥The extension function ultimately pairs the SheetJS read[^2] and write[^3] methods to read data from the old file and write a new file:

var wb = XLSX.read(original_file_data, {type: "buffer"});
var new_file_data = XLSX.write(wb, {type: "array", bookType: "xlsx"});

扩展函数 cleanfile 将采用一两个参数:

¥The extension function cleanfile will take one or two arguments:

plugin call cleanfile, "pres.numbers" 将从第一个参数 ("pres.numbers") 生成 sheetjs.tmp.xlsx 并打印加载文件的指令。

¥plugin call cleanfile, "pres.numbers" will generate sheetjs.tmp.xlsx from the first argument ("pres.numbers") and print instructions to load the file.

plugin call cleanfile, "pres.numbers" verbose 将另外打印工作簿中每个工作表的 CSV 内容。

¥plugin call cleanfile, "pres.numbers" verbose will additionally print CSV contents of each worksheet in the workbook.

C 扩展

¥C Extensions

Stata C 扩展是共享库或 DLL,它们使用特殊的 Stata 方法来解析参数和返回值。

¥Stata C extensions are shared libraries or DLLs that use special Stata methods for parsing arguments and returning values.

结构

¥Structure

参数被传递到插件中的 stata_call 函数。[^4] 函数接收参数计数和一个 C 字符串数组:

¥Arguments are passed to the stata_call function in the plugin.[^4] The function receives the argument count and an array of C strings:

stata_call declaration
STDLL stata_call(int argc, char *argv[]);

例如,以下命令中 argc 为 2,argv 有两个 C 字符串:

¥For example, argc is 2 and argv has two C strings in the following command:

plugin call cleanfile, "pres.numbers" verbose

* arguments start

* argv[0] ^^^^^^^^^^^^

* argv[1] ^^^^^^^

* argc = 2

沟通

¥Communication

SF_displaySF_error 分别显示文本和错误消息。

¥SF_display and SF_error display text and error messages respectively.

消息文本遵循 "Stata 标记和控制语言"[^5]。

¥Message text follows the "Stata Markup and Control Language"[^5].

{stata ...} 是一个特殊指令,用于显示参数并创建可单击的链接。单击该链接将运行该字符串。

¥{stata ...} is a special directive that displays the arguments and creates a clickable link. Clicking the link will run the string.

例如,插件可能会尝试打印链接:

¥For example, a plugin may attempt to print a link:

SF_display C plugin example
SF_display("{stata import excel \"sheetjs.tmp.xlsx\", firstrow} will read the first sheet and use headers\n");

该函数会将以下文本打印到终端:

¥The function will print the following text to the terminal:

import excel "sheetjs.tmp.xlsx", firstrow will read the first sheet and use headers

蓝色文本是可点击的。当用户点击文本时,命令 import excel "sheetjs.tmp.xlsx", firstrow 将被执行。

¥The blue text is clickable. When a user clicks the text, the command import excel "sheetjs.tmp.xlsx", firstrow will be executed.

Duktape JS 引擎

¥Duktape JS Engine

该演示使用 Duktape JavaScript 引擎。SheetJS + Duktape 演示更详细地介绍了引擎集成细节。

¥This demo uses the Duktape JavaScript engine. The SheetJS + Duktape demo covers engine integration details in more detail.

可以通过从文件系统读取源代码将 SheetJS 独立脚本 加载到 Duktape 中。

¥The SheetJS Standalone scripts can be loaded in Duktape by reading the source from the filesystem.

完整演示

¥Complete Demo

此演示在 Windows x64 和 macOS x64 上进行了测试。路径名称和构建命令在其他平台和操作系统中会有所不同。

¥This demo was tested in Windows x64 and macOS x64. The path names and build commands will differ in other platforms and operating systems.

cleanfile.c 扩展定义了一个插件函数。它可以与 import excel 链接:

¥The cleanfile.c extension defines one plugin function. It can be chained with import excel:

program cleanfile, plugin
plugin call cleanfile, "pres.numbers" verbose
program drop cleanfile
import excel "sheetjs.tmp.xlsx", firstrow

创建插件

¥Create Plugin

  1. 确保安装了兼容的 C 编译器(macOS 上的 Xcode)。

    ¥Ensure a compatible C compiler (Xcode on macOS) is installed.

  2. 打开 Stata 并运行以下命令:

    ¥Open Stata and run the following command:

pwd

输出将是默认数据目录。在 macOS 上,这通常是 ~/Documents/Stata

¥The output will be the default data directory. On macOS this is typically ~/Documents/Stata

  1. 打开终端窗口并在 Stata 数据目录中创建项目文件夹 sheetjs-stata

    ¥Open a terminal window and create a project folder sheetjs-stata within the Stata data directory:

# `cd` to the Stata data directory
cd ~/Documents/Stata
mkdir sheetjs-stata
cd sheetjs-stata
  1. 从 Stata 网站下载 stplugin.cstplugin.h

    ¥Download stplugin.c and stplugin.h from the Stata website:

curl -LO https://www.stata.com/plugins/stplugin.c
curl -LO https://www.stata.com/plugins/stplugin.h
  1. 下载 Duktape。在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。

    ¥Download Duktape. In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.

curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
mv duktape-2.7.0/src/*.{c,h} .
  1. 下载 cleanfile.c

    ¥Download cleanfile.c.

在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。

¥In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.

curl -LO https://xlsx.nodejs.cn/stata/cleanfile.c
  1. 观察 macOS 不需要 "Linux 子系统",然后转到步骤 7。

    ¥Observe that macOS does not need a "Linux Subsystem" and move to Step 7.

  2. 构建插件:

    ¥Build the plugin:

gcc -shared -fPIC -DSYSTEM=APPLEMAC stplugin.c duktape.c cleanfile.c -lm -std=c99 -Wall -ocleanfile.plugin

安装插件

¥Install Plugin

  1. 将插件复制到 Stata 数据目录:

    ¥Copy the plugin to the Stata data directory:

cp cleanfile.plugin ../

下载 SheetJS 脚本

¥Download SheetJS Scripts

  1. 移动到 Stata 数据目录:

    ¥Move to the Stata data directory:

cd ..
  1. 观察 macOS 不需要 "Linux 子系统",然后转到步骤 11。

    ¥Observe that macOS does not need a "Linux Subsystem" and move to Step 11.

  1. 下载 SheetJS 脚本和测试文件。

    ¥Download SheetJS scripts and the test file.

在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。

¥In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.

curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js
curl -LO https://xlsx.nodejs.cn/pres.numbers

状态测试

¥Stata Test

简介中的屏幕截图显示了步骤 13 的结果 - 19

¥The screenshot in the introduction shows the result of steps 13 - 19

  1. 如果当前未运行,请启动 Stata 应用。

    ¥If it is not currently running, start the Stata application.

  1. 在 Stata 中运行以下命令:

    ¥Run the following command in Stata:

dir

检查输出并确认 cleanfile.plugin 已列出。

¥Inspect the output and confirm that cleanfile.plugin is listed.

  1. 加载 cleanfile 插件:

    ¥Load the cleanfile plugin:

program cleanfile, plugin
  1. 读取 pres.numbers 测试文件:

    ¥Read the pres.numbers test file:

plugin call cleanfile, "pres.numbers" verbose

结果将显示 pres.numbers 的数据:

¥The result will show the data from pres.numbers:

. plugin call cleanfile, "pres.numbers" verbose Worksheet 0 Name: Sheet1 Name,Index Bill Clinton,42 GeorgeW Bush,43 Barack Obama,44 Donald Trump,45 Joseph Biden,46

Saved to sheetjs.tmp.xlsx import excel "sheetjs.tmp.xlsx", firstrow will read the first sheet and use headers for more help, see import excel

  1. 关闭插件:

    ¥Close the plugin:

program drop cleanfile
  1. 清除当前会话:

    ¥Clear the current session:

clear
  1. In the result of Step 16, click the link on import excel "sheetjs.tmp.xlsx", firstrow

或者,手动键入命令:

¥Alternatively, manually type the command:

import excel "sheetjs.tmp.xlsx", firstrow

输出将显示导入结果:

¥The output will show the import result:

. import excel "sheetjs.tmp.xlsx", firstrow (2 vars, 5 obs)

  1. 打开数据编辑器(在浏览或编辑模式下)并与屏幕截图进行比较:

    ¥Open the Data Editor (in Browse or Edit mode) and compare to the screenshot:

browse Name Index

Data Editor showing data from the file

在 Stata 的终端版本中,browse 不起作用:

¥In the terminal version of Stata, browse does not work:

. browse Name Index
command browse is unrecognized
r(199);

codebook 命令将显示详细信息。

¥The codebook command will display details.

Expected Output (click to show)
-------------------------------------------------------------------------------
Name Name
-------------------------------------------------------------------------------

Type: String (str12)

Unique values: 5 Missing "": 0/5

Tabulation: Freq. Value
1 "Barack Obama"
1 "Bill Clinton"
1 "Donald Trump"
1 "GeorgeW Bush"
1 "Joseph Biden"

Warning: Variable has embedded blanks.

-------------------------------------------------------------------------------
Index Index
-------------------------------------------------------------------------------

Type: Numeric (byte)

Range: [42,46] Units: 1
Unique values: 5 Missing .: 0/5

Tabulation: Freq. Value
1 42
1 43
1 44
1 45
1 46

[^1]: 在 Stata 中运行 help import excel 或参阅 Stata 文档中的 "导入 Excel"

¥Run help import excel in Stata or see "import excel" in the Stata documentation.

[^2]: 见 read 于 "读取文件"

¥See read in "Reading Files"

[^3]: 见 write 于 "写入文件"

¥See write in "Writing Files"

[^4]: 请参阅 Stata 网站中的 "创建和使用 Stata 插件"

¥See "Creating and using Stata plugins" in the Stata website

[^5]: 在 Stata 中运行 help smcl 或参阅 Stata 文档中的 "smcl"

¥run help smcl in Stata or see "smcl" in the Stata documentation.