Stata 中的现代电子表格
Stata 是一个统计软件包。它提供了一个强大的基于 C 的扩展系统。
¥Stata is a statistical software package. It offers a robust C-based extension system.
SheetJS 是一个用于从电子表格读取和写入数据的 JavaScript 库。
¥SheetJS is a JavaScript library for reading and writing data from spreadsheets.
该演示使用 SheetJS 从电子表格中提取数据,以便在 Stata 中进行进一步分析。我们将创建一个 Stata 原生扩展,加载 Duktape JavaScript 引擎并使用 SheetJS 库从电子表格读取数据并转换为 Stata 友好的格式。
¥This demo uses SheetJS to pull data from a spreadsheet for further analysis within Stata. We'll create a Stata native extension that loads the Duktape JavaScript engine and uses the SheetJS library to read data from spreadsheets and converts to a Stata-friendly format.
该演示将读取 数字工作簿 并为每列生成变量。示例 Stata 会话如下所示:
¥The demo will read a Numbers workbook and generate variables for each column. A sample Stata session is shown below:
该演示涵盖了 Stata 扩展。为了直接处理 Stata DTA 文件,"Stata DTA 编解码器" 在浏览器或 NodeJS 中工作。
¥This demo covers Stata extensions. For directly processing Stata DTA files, the "Stata DTA Codec" works in the browser or NodeJS.
此演示由 SheetJS 用户在以下部署中进行了测试:
¥This demo was tested by SheetJS users in the following deployments:
架构 | 日期 |
---|---|
darwin-x64 | 2024-04-10 |
win10-x64 | 2024-04-10 |
linux-x64 | 2024-04-25 |
Stata 对通过 import excel
命令 [^1] 处理电子表格的支持有限。在撰写本文时,它缺乏对 XLSB、NUMBERS 和其他常见电子表格格式的支持。
¥Stata has limited support for processing spreadsheets through the import excel
command[^1]. At the time of writing, it lacked support for XLSB, NUMBERS, and
other common spreadsheet formats.
SheetJS 库通过将电子表格标准化为 Stata 可以理解的形式来帮助填补这一空白。
¥SheetJS libraries help fill the gap by normalizing spreadsheets to a form that Stata can understand.
集成详情
¥Integration Details
当前的建议涉及一个原生插件,该插件可以读取任意文件并生成 Stata 可以导入的干净的 XLSX 文件。
¥The current recommendation involves a native plugin that reads arbitrary files and generates clean XLSX files that Stata can import.
扩展函数最终将 SheetJS read
[^2] 和 write
[^3] 方法配对,从旧文件中读取数据并写入新文件:
¥The extension function ultimately pairs the SheetJS read
[^2] and write
[^3]
methods to read data from the old file and write a new file:
var wb = XLSX.read(original_file_data, {type: "buffer"});
var new_file_data = XLSX.write(wb, {type: "array", bookType: "xlsx"});
扩展函数 cleanfile
将采用一两个参数:
¥The extension function cleanfile
will take one or two arguments:
plugin call cleanfile, "pres.numbers"
将从第一个参数 ("pres.numbers"
) 生成 sheetjs.tmp.xlsx
并打印加载文件的指令。
¥plugin call cleanfile, "pres.numbers"
will generate sheetjs.tmp.xlsx
from
the first argument ("pres.numbers"
) and print instructions to load the file.
plugin call cleanfile, "pres.numbers" verbose
将另外打印工作簿中每个工作表的 CSV 内容。
¥plugin call cleanfile, "pres.numbers" verbose
will additionally print CSV
contents of each worksheet in the workbook.
C 扩展
¥C Extensions
Stata C 扩展是共享库或 DLL,它们使用特殊的 Stata 方法来解析参数和返回值。
¥Stata C extensions are shared libraries or DLLs that use special Stata methods for parsing arguments and returning values.
结构
¥Structure
参数被传递到插件中的 stata_call
函数。[^4] 函数接收参数计数和一个 C 字符串数组:
¥Arguments are passed to the stata_call
function in the plugin.[^4] The
function receives the argument count and an array of C strings:
STDLL stata_call(int argc, char *argv[]);
例如,以下命令中 argc
为 2,argv
有两个 C 字符串:
¥For example, argc
is 2 and argv
has two C strings in the following command:
plugin call cleanfile, "pres.numbers" verbose
* arguments start
* argv[0] ^^^^^^^^^^^^
* argv[1] ^^^^^^^
* argc = 2
沟通
¥Communication
SF_display
和 SF_error
分别显示文本和错误消息。
¥SF_display
and SF_error
display text and error messages respectively.
消息文本遵循 "Stata 标记和控制语言"[^5]。
¥Message text follows the "Stata Markup and Control Language"[^5].
{stata ...}
是一个特殊指令,用于显示参数并创建可单击的链接。单击该链接将运行该字符串。
¥{stata ...}
is a special directive that displays the arguments and creates a
clickable link. Clicking the link will run the string.
例如,插件可能会尝试打印链接:
¥For example, a plugin may attempt to print a link:
SF_display("{stata import excel \"sheetjs.tmp.xlsx\", firstrow} will read the first sheet and use headers\n");
该函数会将以下文本打印到终端:
¥The function will print the following text to the terminal:
import excel "sheetjs.tmp.xlsx", firstrow will read the first sheet and use headers
蓝色文本是可点击的。当用户点击文本时,命令 import excel "sheetjs.tmp.xlsx", firstrow
将被执行。
¥The blue text is clickable. When a user clicks the text, the command
import excel "sheetjs.tmp.xlsx", firstrow
will be executed.
Duktape JS 引擎
¥Duktape JS Engine
该演示使用 Duktape JavaScript 引擎。SheetJS + Duktape 演示更详细地介绍了引擎集成细节。
¥This demo uses the Duktape JavaScript engine. The SheetJS + Duktape demo covers engine integration details in more detail.
可以通过从文件系统读取源代码将 SheetJS 独立脚本 加载到 Duktape 中。
¥The SheetJS Standalone scripts can be loaded in Duktape by reading the source from the filesystem.
完整演示
¥Complete Demo
此演示在 Windows x64 和 macOS x64 上进行了测试。路径名称和构建命令在其他平台和操作系统中会有所不同。
¥This demo was tested in Windows x64 and macOS x64. The path names and build commands will differ in other platforms and operating systems.
cleanfile.c
扩展定义了一个插件函数。它可以与 import excel
链接:
¥The cleanfile.c
extension defines one plugin
function. It can be chained with import excel
:
program cleanfile, plugin
plugin call cleanfile, "pres.numbers" verbose
program drop cleanfile
import excel "sheetjs.tmp.xlsx", firstrow
创建插件
¥Create Plugin
- Linux/MacOS
- Windows
-
确保安装了兼容的 C 编译器(macOS 上的 Xcode)。
¥Ensure a compatible C compiler (Xcode on macOS) is installed.
-
打开 Stata 并运行以下命令:
¥Open Stata and run the following command:
pwd
输出将是默认数据目录。在 macOS 上,这通常是 ~/Documents/Stata
¥The output will be the default data directory. On macOS this is typically
~/Documents/Stata
-
打开终端窗口并在 Stata 数据目录中创建项目文件夹
sheetjs-stata
:¥Open a terminal window and create a project folder
sheetjs-stata
within the Stata data directory:
# `cd` to the Stata data directory
cd ~/Documents/Stata
mkdir sheetjs-stata
cd sheetjs-stata
-
确保已安装 "Linux 的 Windows 子系统" (WSL) 和 Visual Studio。
¥Ensure "Windows Subsystem for Linux" (WSL) and Visual Studio are installed.
-
打开一个新的 "x64 原生工具命令提示符" 窗口并创建一个项目文件夹
c:\sheetjs-stata
:¥Open a new "x64 Native Tools Command Prompt" window and create a project folder
c:\sheetjs-stata
:
cd c:\
mkdir sheetjs-stata
cd sheetjs-stata
-
输入 WSL:
¥Enter WSL:
bash
-
从 Stata 网站下载
stplugin.c
和stplugin.h
:¥Download
stplugin.c
andstplugin.h
from the Stata website:
curl -LO https://www.stata.com/plugins/stplugin.c
curl -LO https://www.stata.com/plugins/stplugin.h
-
下载 Duktape。在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。
¥Download Duktape. In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
mv duktape-2.7.0/src/*.{c,h} .
-
下载
cleanfile.c
。¥Download
cleanfile.c
.
在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。
¥In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://xlsx.nodejs.cn/stata/cleanfile.c
- Linux/MacOS
- Windows
-
观察 macOS 不需要 "Linux 子系统",然后转到步骤 7。
¥Observe that macOS does not need a "Linux Subsystem" and move to Step 7.
-
构建插件:
¥Build the plugin:
gcc -shared -fPIC -DSYSTEM=APPLEMAC stplugin.c duktape.c cleanfile.c -lm -std=c99 -Wall -ocleanfile.plugin
-
退出 WSL:
¥Exit WSL:
exit
窗口将返回到命令提示符。
¥The window will return to the command prompt.
-
构建 DLL:
¥Build the DLL:
cl /LD cleanfile.c stplugin.c duktape.c
安装插件
¥Install Plugin
- Linux/MacOS
- Windows
-
将插件复制到 Stata 数据目录:
¥Copy the plugin to the Stata data directory:
cp cleanfile.plugin ../
-
将 DLL 复制到 Stata 数据目录中的
cleanfile.plugin
。例如,对于共享数据目录c:\data
:¥Copy the DLL to
cleanfile.plugin
in the Stata data directory. For example, with a shared data directoryc:\data
:
mkdir c:\data
copy cleanfile.dll c:\data\cleanfile.plugin
下载 SheetJS 脚本
¥Download SheetJS Scripts
- Linux/MacOS
- Windows
-
移动到 Stata 数据目录:
¥Move to the Stata data directory:
cd ..
-
观察 macOS 不需要 "Linux 子系统",然后转到步骤 11。
¥Observe that macOS does not need a "Linux Subsystem" and move to Step 11.
-
移动到
c:\data
目录:¥Move to the
c:\data
directory:
cd c:\data
-
进入 WSL
¥Enter WSL
bash
-
下载 SheetJS 脚本和测试文件。
¥Download SheetJS scripts and the test file.
在 Windows 中,应在 WSL 中运行以下命令。在 macOS 中,命令应在同一终端会话中运行。
¥In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js
curl -LO https://xlsx.nodejs.cn/pres.numbers
状态测试
¥Stata Test
简介中的屏幕截图显示了步骤 13 的结果 - 19
¥The screenshot in the introduction shows the result of steps 13 - 19
-
如果当前未运行,请启动 Stata 应用。
¥If it is not currently running, start the Stata application.
- Linux/MacOS
- Windows
-
在 Stata 中运行以下命令:
¥Run the following command in Stata:
dir
检查输出并确认 cleanfile.plugin
已列出。
¥Inspect the output and confirm that cleanfile.plugin
is listed.
-
移动到 Stata 中的
c:\data
目录:¥Move to the
c:\data
directory in Stata:
cd c:\data
-
加载
cleanfile
插件:¥Load the
cleanfile
plugin:
program cleanfile, plugin
-
读取
pres.numbers
测试文件:¥Read the
pres.numbers
test file:
plugin call cleanfile, "pres.numbers" verbose
结果将显示 pres.numbers
的数据:
¥The result will show the data from pres.numbers
:
. plugin call cleanfile, "pres.numbers" verbose Worksheet 0 Name: Sheet1 Name,Index Bill Clinton,42 GeorgeW Bush,43 Barack Obama,44 Donald Trump,45 Joseph Biden,46
Saved to
sheetjs.tmp.xlsx
import excel "sheetjs.tmp.xlsx", firstrow will read the first sheet and use headers for more help, see import excel
-
关闭插件:
¥Close the plugin:
program drop cleanfile
-
清除当前会话:
¥Clear the current session:
clear
In the result of Step 16, click the link on
import excel "sheetjs.tmp.xlsx", firstrow
或者,手动键入命令:
¥Alternatively, manually type the command:
import excel "sheetjs.tmp.xlsx", firstrow
输出将显示导入结果:
¥The output will show the import result:
. import excel "sheetjs.tmp.xlsx", firstrow (2 vars, 5 obs)
-
打开数据编辑器(在浏览或编辑模式下)并与屏幕截图进行比较:
¥Open the Data Editor (in Browse or Edit mode) and compare to the screenshot:
browse Name Index
在 Stata 的终端版本中,browse
不起作用:
¥In the terminal version of Stata, browse
does not work:
. browse Name Index
command browse is unrecognized
r(199);
codebook
命令将显示详细信息。
¥The codebook
command will display details.
Expected Output (click to show)
-------------------------------------------------------------------------------
Name Name
-------------------------------------------------------------------------------
Type: String (str12)
Unique values: 5 Missing "": 0/5
Tabulation: Freq. Value
1 "Barack Obama"
1 "Bill Clinton"
1 "Donald Trump"
1 "GeorgeW Bush"
1 "Joseph Biden"
Warning: Variable has embedded blanks.
-------------------------------------------------------------------------------
Index Index
-------------------------------------------------------------------------------
Type: Numeric (byte)
Range: [42,46] Units: 1
Unique values: 5 Missing .: 0/5
Tabulation: Freq. Value
1 42
1 43
1 44
1 45
1 46
[^1]: 在 Stata 中运行 help import excel
或参阅 Stata 文档中的 "导入 Excel"。
¥Run help import excel
in Stata or see "import excel" in the Stata documentation.
[^2]: 见 read
于 "读取文件"
[^3]: 见 write
于 "写入文件"
[^4]: 请参阅 Stata 网站中的 "创建和使用 Stata 插件"
¥See "Creating and using Stata plugins" in the Stata website
[^5]: 在 Stata 中运行 help smcl
或参阅 Stata 文档中的 "smcl"。
¥run help smcl
in Stata or see "smcl" in the Stata documentation.