Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

View Researcher's Other Codes

Disclaimer: The provided code links for this paper are external links. Science Nest has no responsibility for the accuracy, legality or content of these links. Also, by downloading this code(s), you agree to comply with the terms of use as set out by the author(s) of the code(s).

Authors Matthew J. Denny, Arthur Spirling
Journal/Conference Name POLITICAL ANALYSIS
Paper Category
Paper Abstract Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts. We make easy-to-use software available for this purpose. preText software available: github.com/matthewjdenny/preText ∗First version: September, 2016. This version: January 24, 2017. We thank Will Lowe and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. †mdenny@psu.edu; 203 Pond Lab, Pennsylvania State University , University Park, PA 16802 ‡arthur.spirling@nyu.edu; Office 405, 19 West 4th St., New York University, New York, NY 10012 1
Date of publication 2017
Code Programming Language R
Comment

Copyright Researcher 2022