Methodology
See recent articles
Showing new listings for Friday, 28 March 2025
- [1] arXiv:2503.20882 [pdf, other]
-
Title: Assessing Bias and Precision in State Policy Evaluations: A Comparative Analysis of Time-Varying Estimators Using Policy SimulationsMax Griswold, Beth Ann Griffin, Max Rubinstein, Mincen Liu, Megan Schuler, Elizabeth Stone, Pedro Nascimento de Lima, Bradley D. Stein, Elizabeth A. StuartSubjects: Methodology (stat.ME)
Using state-level opioid overdose mortality data from 1999-2016, we simulated four time-varying treatment scenarios, which correspond to real-world policy dynamics (ramp up, ramp down, temporary and inconsistent). We then evaluated seven commonly used policy evaluation methods: two-way fixed effects event study, debiased autoregressive model, augmented synthetic control, difference-in-differences with staggered adoption, event study with heterogeneous treatment, two-stage differences-in-differences and differences-in-differences imputation. Statistical performance was assessed by comparing bias, standard errors, coverage, and root mean squared error over 1,000 simulations.
Results Our findings indicate that estimator performance varied across policy scenarios. In settings where policy effectiveness diminished over time, synthetic control methods recovered effects with lower bias and higher variance. Difference-in-difference approaches, while offering reasonable coverage under some scenarios, struggled when effects were non-monotonic. Autoregressive methods, although demonstrating lower variability, underestimated uncertainty. Overall, a clear bias-variance tradeoff emerged, underscoring that no single method uniformly excelled across scenarios.
Conclusions This study highlights the importance of tailoring the choice of estimator to the expected trajectory of policy effects. In dynamic time-varying settings, particularly when a policy has an anticipated diminishing impact, methods like augmented synthetic controls may offer advantages despite reduced precision. Researchers should carefully consider these tradeoffs to ensure robust and credible state-policy evaluations. - [2] arXiv:2503.20935 [pdf, html, other]
-
Title: Sensitivity analysis for nonignorable missing values in blended analysis framework: a study on the effect of bariatric surgery via electronic health recordsComments: 36 pages, 11 figuresSubjects: Methodology (stat.ME)
This paper establishes a series of sensitivity analyses to investigate the impact of missing values in the electronic health records (EHR) that are possibly missing not at random (MNAR). EHRs have gained tremendous interest due to their cost-effectiveness, but their employment for research involves numerous challenges, such as selection bias due to missing data. The blended analysis has been suggested to overcome such challenges, which decomposes the data provenance into a sequence of sub-mechanisms and uses a combination of inverse-probability weighting (IPW) and multiple imputation (MI) under missing at random assumption (MAR). In this paper, we expand the blended analysis under the MNAR assumption and present a sensitivity analysis framework to investigate the effect of MNAR missing values on the analysis results. We illustrate the performance of my proposed framework via numerical studies and conclude with strategies for interpreting the results of sensitivity analyses. In addition, we present an application of our framework to the DURABLE data set, an EHR from a study examining long-term outcomes of patients who underwent bariatric surgery.
- [3] arXiv:2503.20940 [pdf, html, other]
-
Title: A Restricted Latent Class Hidden Markov Model for Polytomous Responses, Polytomous Attributes, and Covariates: Identifiability and ApplicationComments: 38 pages, 2 figures, 13 tablesSubjects: Methodology (stat.ME)
We introduce a restricted latent class exploratory model for longitudinal data with ordinal attributes and respondent-specific covariates. Responses follow a hidden Markov model where the probability of a particular latent state at a time point is conditional on values at the previous time point of the respondent's covariates and latent state. We prove that the model is identifiable, state a Bayesian formulation, and demonstrate its efficacy in a variety of scenarios through a simulation study. As a real-world demonstration, we apply the model to response data from a mathematics examination, and compare the results to a previously published confirmatory analysis.
- [4] arXiv:2503.20962 [pdf, html, other]
-
Title: Probabilistic Downscaling for Flood Hazard ModelsSubjects: Methodology (stat.ME); Applications (stat.AP)
Riverine flooding poses significant risks. Developing strategies to manage flood risks requires flood projections with decision-relevant scales and well-characterized uncertainties, often at high spatial resolutions. However, calibrating high-resolution flood models can be computationally prohibitive. To address this challenge, we propose a probabilistic downscaling approach that maps low-resolution model projections onto higher-resolution grids. The existing literature presents two distinct types of downscaling approaches: (1) probabilistic methods, which are versatile and applicable across various physics-based models, and (2) deterministic downscaling methods, specifically tailored for flood hazard models. Both types of downscaling approaches come with their own set of mutually exclusive advantages. Here we introduce a new approach, PDFlood, that combines the advantages of existing probabilistic and flood model-specific downscaling approaches, mainly (1) spatial flooding probabilities and (2) improved accuracy from approximating physical processes. Compared to the state of the art deterministic downscaling approach for flood hazard models, PDFlood allows users to consider previously neglected uncertainties while providing comparable accuracy, thereby better informing the design of risk management strategies. While we develop PDFlood for flood models, the general concepts translate to other applications such as wildfire models.
- [5] arXiv:2503.20965 [pdf, html, other]
-
Title: Least Squares as Random WalksComments: to appear in Physics Letters ASubjects: Methodology (stat.ME); Statistical Mechanics (cond-mat.stat-mech); Probability (math.PR); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
Linear least squares (LLS) is perhaps the most common method of data analysis, dating back to Legendre, Gauss and Laplace. Framed as linear regression, LLS is also a backbone of mathematical statistics. Here we report on an unexpected new connection between LLS and random walks. To that end, we introduce the notion of a random walk based on a discrete sequence of data samples (data walk). We show that the slope of a straight line which annuls the net area under a residual data walk equals the one found by LLS. For equidistant data samples this result is exact and holds for an arbitrary distribution of steps.
- [6] arXiv:2503.21081 [pdf, html, other]
-
Title: Leveraging External Controls in Clinical Trials: Estimands, Estimation, AssumptionsSubjects: Methodology (stat.ME)
It is increasingly common to augment randomized controlled trial with external controls from observational data, to evaluate the treatment effect of an intervention. Traditional approaches to treatment effect estimation involve ambiguous estimands and unrealistic or strong assumptions, such as mean exchangeability. We introduce a double-indexed notation for potential outcomes to define causal estimands transparently and clarify distinct sources of implicit bias. We show that the concurrent control arm is critical in assessing the plausibility of assumptions and providing unbiased causal estimation. We derive a consistent and locally efficient estimator for a class of weighted average treatment effect estimands that combines concurrent and external data without assuming mean exchangeability. This estimator incorporates an estimate of the systematic difference in outcomes between the concurrent and external units, of which we propose a Frish-Waugh-Lovell style partial regression method to obtain. We compare the proposed methods with existing methods using extensive simulation and applied to cardiovascular clinical trials.
- [7] arXiv:2503.21091 [pdf, html, other]
-
Title: Integrate Meta-analysis into Specific Study (InMASS) for Estimating Conditional Average Treatment EffectComments: 37 pages, 12 figuresSubjects: Methodology (stat.ME)
Randomized controlled trials are the standard method for estimating causal effects, ensuring sufficient statistical power and confidence through adequate sample sizes. However, achieving such sample sizes is often challenging. This study proposes a novel method for estimating the average treatment effect (ATE) in a target population by integrating and reconstructing information from previous trials using only summary statistics of outcomes and covariates through meta-analysis. The proposed approach combines meta-analysis, transfer learning, and weighted regression. Unlike existing methods that estimate the ATE based on the distribution of source trials, our method directly estimates the ATE for the target population. The proposed method requires only the means and variances of outcomes and covariates from the source trials and is theoretically valid under the covariate shift assumption, regardless of the covariate distribution in the source trials. Simulations and real-data analyses demonstrate that the proposed method yields a consistent estimator and achieves higher statistical power than the estimator derived solely from the target trial.
- [8] arXiv:2503.21358 [pdf, html, other]
-
Title: Inference in stochastic differential equations using the Laplace approximation: Demonstration and examplesComments: 25 pages, 6 figures, 2 tablesSubjects: Methodology (stat.ME); Probability (math.PR)
We consider the problem of estimating states and parameters in a model based on a system of coupled stochastic differential equations, based on noisy discrete-time data. Special attention is given to nonlinear dynamics and state-dependent diffusivity, where transition densities are not available in closed form. Our technique adds states between times of observations, approximates transition densities using, e.g., the Euler-Maruyama method and eliminates unobserved states using the Laplace approximation. Using case studies, we demonstrate that transition probabilities are well approximated, and that inference is computationally feasible. We discuss limitations and potential extensions of the method.
- [9] arXiv:2503.21388 [pdf, other]
-
Title: Simulation-based assessment of a Bayesian survival model with flexible baseline hazard and time-dependent effectsSubjects: Methodology (stat.ME); Computation (stat.CO)
There is increasing interest in flexible parametric models for the analysis of time-to-event data, yet Bayesian approaches that offer incorporation of prior knowledge remain underused. A flexible Bayesian parametric model has recently been proposed that uses M-splines to model the hazard function. We conducted a simulation study to assess the statistical performance of this model, which is implemented in the survextrap R package. Our simulation uses data generating mechanisms of realistic survival data based on two oncology clinical trials. Statistical performance is compared across a range of flexible models, varying the M-spline specification, smoothing procedure, priors, and other computational settings. We demonstrate good performance across realistic scenarios, including good fit of complex baseline hazard functions and time-dependent covariate effects. This work helps inform key considerations to guide model selection, as well as identifying appropriate default model settings in the software that should perform well in a broad range of applications.
- [10] arXiv:2503.21428 [pdf, html, other]
-
Title: Compositional Outcomes and Environmental Mixtures: the Dirichlet Bayesian Weighted Quantile Sum RegressionSubjects: Methodology (stat.ME); Applications (stat.AP)
Environmental mixture approaches do not accommodate compositional outcomes, consisting of vectors constrained onto the unit simplex. This limitation poses challenges in effectively evaluating the associations between multiple concurrent environmental exposures and their respective impacts on this type of outcomes. As a result, there is a pressing need for the development of analytical methods that can more accurately assess the complexity of these relationships. Here, we extend the Bayesian weighted quantile sum regression (BWQS) framework for jointly modeling compositional outcomes and environmental mixtures using a Dirichlet distribution with a multinomial logit link function. The proposed approach, named Dirichlet-BWQS (DBWQS), allows for the simultaneous estimation of mixture weights associated with each exposure mixture component as well as the association between the overall exposure mixture index and each outcome proportion. We assess the performance of DBWQS regression on extensive simulated data and a real scenario where we investigate the associations between environmental chemical mixtures and DNA methylation-derived placental cell composition, using publicly available data (GSE75248). We also compare our findings with results considering environmental mixtures and each outcome component. Finally, we developed an R package xbwqs where we made our proposed method publicly available (this https URL).
- [11] arXiv:2503.21443 [pdf, html, other]
-
Title: Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRIFelix Terhag, Philipp Knechtges, Achim Basermann, Anja Bach, Darius Gerlach, Jens Tank, Raúl TemponeSubjects: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP)
Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable.
This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance.
This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels). - [12] arXiv:2503.21534 [pdf, html, other]
-
Title: Inequality Restricted Minimum Density Power Divergence Estimation in Panel Count DataComments: 35 PAGES, 12 FIGURES, 7 TABLESSubjects: Methodology (stat.ME); Applications (stat.AP)
Analysis of panel count data has garnered a considerable amount of attention in the literature, leading to the development of multiple statistical techniques. In inferential analysis, most of the works focus on leveraging estimating equations-based techniques or conventional maximum likelihood estimation. However, the robustness of these methods is largely questionable. In this paper, we present the robust density power divergence estimation for panel count data arising from nonhomogeneous Poisson processes, correlated through a latent frailty variable. In order to cope with real-world incidents, it is often desired to impose certain inequality constraints on the parameter space, giving rise to the restricted minimum density power divergence estimator. The significant contribution of this study lies in deriving its asymptotic properties. The proposed method ensures high efficiency in the model estimation while providing reliable inference despite data contamination. Moreover, the density power divergence measure is governed by a tuning parameter \(\gamma\), which controls the trade-off between robustness and efficiency. To effectively determine the optimal value of \(\gamma\), this study employs a generalized score-matching technique, marking considerable progress in the data analysis. Simulation studies and real data examples are provided to illustrate the performance of the estimator and to substantiate the theory developed.
- [13] arXiv:2503.21715 [pdf, html, other]
-
Title: A Powerful Bootstrap Test of Independence in High DimensionsSubjects: Methodology (stat.ME); Econometrics (econ.EM)
This paper proposes a nonparametric test of independence of one random variable from a large pool of other random variables. The test statistic is the maximum of several Chatterjee's rank correlations and critical values are computed via a block multiplier bootstrap. The test is shown to asymptotically control size uniformly over a large class of data-generating processes, even when the number of variables is much larger than sample size. The test is consistent against any fixed alternative. It can be combined with a stepwise procedure for selecting those variables from the pool that violate independence, while controlling the family-wise error rate. All formal results leave the dependence among variables in the pool completely unrestricted. In simulations, we find that our test is very powerful, outperforming existing tests in most scenarios considered, particularly in high dimensions and/or when the variables in the pool are dependent.
- [14] arXiv:2503.21719 [pdf, html, other]
-
Title: The Principle of Redundant ReflectionComments: 8 pages, 0 figuresSubjects: Methodology (stat.ME); Other Statistics (stat.OT)
The fact that redundant information does not change a rational belief after Bayesian updating implies uniqueness of Bayes rule. In fact, any updating rule is uniquely specified by this principle. This is true for the classical setting, as well as settings with improper or continuous priors. We prove this result and illustrate it with two examples.
New submissions (showing 14 of 14 entries)
- [15] arXiv:2503.21298 (cross-list from eess.SP) [pdf, other]
-
Title: G{é}n{é}ration de Matrices de Corr{é}lation avec des Structures de Graphe par Optimisation ConvexeAli Fahkar (STATIFY, LJK), Kévin Polisano (SVH, LJK), Irène Gannaz (G-SCOP\_GROG, G-SCOP), Sophie Achard (STATIFY, LJK)Comments: in French languageSubjects: Signal Processing (eess.SP); Optimization and Control (math.OC); Statistics Theory (math.ST); Methodology (stat.ME)
This work deals with the generation of theoretical correlation matrices with specific sparsity patterns, associated to graph structures. We present a novel approach based on convex optimization, offering greater flexibility compared to existing techniques, notably by controlling the mean of the entry distribution in the generated correlation matrices. This allows for the generation of correlation matrices that better represent realistic data and can be used to benchmark statistical methods for graph inference.
- [16] arXiv:2503.21399 (cross-list from math.PR) [pdf, html, other]
-
Title: Transition probabilities for stochastic differential equations using the Laplace approximation: Analysis of the continuous-time limitComments: 25 pages, 2 figuresSubjects: Probability (math.PR); Methodology (stat.ME)
We recently proposed a method for estimation of states and parameters in stochastic differential equations, which included intermediate time points between observations and used the Laplace approximation to integrate out these intermediate states. In this paper, we establish a Laplace approximation for the transition probabilities in the continuous-time limit where the computational time step between intermediate states vanishes. Our technique views the driving Brownian motion as a control, casts the problem as one of minimum effort control between two states, and employs a Girsanov shift of probability measure as well as a weak noise approximation to obtain the Laplace approximation. We demonstrate the technique with examples; one where the approximation is exact due to a property of coordinate transforms, and one where contributions from non-near paths impair the approximation. We assess the order of discrete-time scheme, and demonstrate the Strang splitting leads to higher order and accuracy than Euler-type discretization. Finally, we investigate numerically how the accuracy of the approximation depends on the noise intensity and the length of the time interval.
- [17] arXiv:2503.21639 (cross-list from math.ST) [pdf, html, other]
-
Title: Locally minimax optimal and dimension-agnostic discrete argmin inferenceSubjects: Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
We revisit the discrete argmin inference problem in high-dimensional settings. Given $n$ observations from a $d$ dimensional vector, the goal is to test whether the $r$th component of the mean vector is the smallest among all components. We propose dimension-agnostic tests that maintain validity regardless of how $d$ scales with $n$, and regardless of arbitrary ties in the mean vector. Notably, our validity holds under mild moment conditions, requiring little more than finiteness of a second moment, and permitting possibly strong dependence between coordinates. In addition, we establish the local minimax separation rate for this problem, which adapts to the cardinality of a confusion set, and show that the proposed tests attain this rate. Our method uses the sample splitting and self-normalization approach of Kim and Ramdas (2024). Our tests can be easily inverted to yield confidence sets for the argmin index. Empirical results illustrate the strong performance of our approach in terms of type I error control and power compared to existing methods.
- [18] arXiv:2503.21673 (cross-list from stat.CO) [pdf, html, other]
-
Title: A friendly introduction to triangular transportComments: 46 pages, 17 figuresSubjects: Computation (stat.CO); Atmospheric and Oceanic Physics (physics.ao-ph); Methodology (stat.ME); Machine Learning (stat.ML)
Decision making under uncertainty is a cross-cutting challenge in science and engineering. Most approaches to this challenge employ probabilistic representations of uncertainty. In complicated systems accessible only via data or black-box models, however, these representations are rarely known. We discuss how to characterize and manipulate such representations using triangular transport maps, which approximate any complex probability distribution as a transformation of a simple, well-understood distribution. The particular structure of triangular transport guarantees many desirable mathematical and computational properties that translate well into solving practical problems. Triangular maps are actively used for density estimation, (conditional) generative modelling, Bayesian inference, data assimilation, optimal experimental design, and related tasks. While there is ample literature on the development and theory of triangular transport methods, this manuscript provides a detailed introduction for scientists interested in employing measure transport without assuming a formal mathematical background. We build intuition for the key foundations of triangular transport, discuss many aspects of its practical implementation, and outline the frontiers of this field.
Cross submissions (showing 4 of 4 entries)
- [19] arXiv:2102.09412 (replaced) [pdf, html, other]
-
Title: Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved ConfoundingJournal-ref: Journal of Machine Learning Research 26 (2025) 1-60Subjects: Methodology (stat.ME)
Recent work has focused on the potential and pitfalls of causal identification in observational studies with multiple simultaneous treatments. Building on previous work, we show that even if the conditional distribution of unmeasured confounders given treatments were known exactly, the causal effects would not in general be identifiable, although they may be partially identified. Given these results, we propose a sensitivity analysis method for characterizing the effects of potential unmeasured confounding, tailored to the multiple treatment setting, that can be used to characterize a range of causal effects that are compatible with the observed data. Our method is based on a copula factorization of the joint distribution of outcomes, treatments, and confounders, and can be layered on top of arbitrary observed data models. We propose a practical implementation of this approach making use of the Gaussian copula, and establish conditions under which causal effects can be bounded. We also describe approaches for reasoning about effects, including calibrating sensitivity parameters, quantifying robustness of effect estimates, and selecting models that are most consistent with prior hypotheses.
- [20] arXiv:2312.06204 (replaced) [pdf, html, other]
-
Title: Multilayer Network Regression with Eigenvector Centrality and Community StructureSubjects: Methodology (stat.ME)
In the analysis of complex networks, centrality measures and community structures play pivotal roles. For multilayer networks, a critical challenge lies in effectively integrating information across diverse layers while accounting for the dependence structures both within and between layers. We propose an innovative two-stage regression model for multilayer networks, combining eigenvector centrality and network community structure within fourth-order tensor-like multilayer networks. We develop new community-based centrality measures, integrated into a regression framework. To address the inherent noise in network data, we conduct separate analyses of centrality measures with and without measurement errors and establish consistency for the least squares estimates in the regression model. The proposed methodology is applied to the world input-output dataset, investigating how input-output network data among different countries and industries influence the gross output of each industry.
- [21] arXiv:2402.16725 (replaced) [pdf, html, other]
-
Title: Inference on the proportion of variance explained in principal component analysisSubjects: Methodology (stat.ME)
Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored.
In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset. - [22] arXiv:2407.01883 (replaced) [pdf, html, other]
-
Title: Robust Linear Mixed Models using Hierarchical Gamma-DivergenceComments: 36 pages (main) + 14 pages (supplement)Subjects: Methodology (stat.ME)
Linear mixed models (LMMs) are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to biased inference on model parameters and inaccurate prediction of random effects if the data are contaminated. We propose a new approach to robust estimation and inference for LMMs using a hierarchical gamma-divergence, which offers an automated, data-driven approach to downweight the effects of outliers occurring in both the error and the random effects, using normalized powered density weights. For estimation and inference, we develop a computationally scalable minorization-maximization algorithm for the resulting objective function, along with a clustered bootstrap method for uncertainty quantification and a Hyvarinen score criterion for selecting a tuning parameter controlling the degree of robustness. Under suitable regularity conditions, we show the resulting robust estimates can be asymptotically controlled even under a heavy level of (covariate-dependent) contamination. Simulation studies demonstrate hierarchical gamma-divergence consistently outperforms several currently available methods for robustifying LMMs. We also illustrate the proposed method using data from a multi-center AIDS cohort study.
- [23] arXiv:2410.21858 (replaced) [pdf, html, other]
-
Title: Joint Estimation of Conditional Mean and Covariance for Unbalanced PanelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
We develop a nonparametric, kernel-based joint estimator for conditional mean and covariance matrices in large and unbalanced panels. The estimator is supported by rigorous consistency results and finite-sample guarantees, ensuring its reliability for empirical applications. We apply it to an extensive panel of monthly US stock excess returns from 1962 to 2021, using macroeconomic and firm-specific covariates as conditioning variables. The estimator effectively captures time-varying cross-sectional dependencies, demonstrating robust statistical and economic performance. We find that idiosyncratic risk explains, on average, more than 75% of the cross-sectional variance.
- [24] arXiv:2502.17827 (replaced) [pdf, html, other]
-
Title: DPGLM: A Semiparametric Bayesian GLM with Inhomogeneous Normalized Random MeasuresSubjects: Methodology (stat.ME)
We introduce a novel varying-weight dependent Dirichlet process (DDP) model that extends a recently developed semi-parametric generalized linear model (SPGLM) by adding a nonparametric Bayesian prior on the baseline distribution of the GLM. We show that the resulting model takes the form of an inhomogeneous completely random measure that arises from exponential tilting of a normalized completely random measure. Building on familiar posterior sampling methods for mixtures with respect to normalized random measures, we introduce posterior simulation in the resulting model. We validate the proposed methodology through extensive simulation studies and illustrate its application using data from a speech intelligibility study.
- [25] arXiv:2412.15239 (replaced) [pdf, html, other]
-
Title: Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN); Methodology (stat.ME)
Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.