publications
2025
- Optimizing Hidden Markov Language Models: An Empirical Study of Reparameterization and Initialization TechniquesIvan Lee and Taylor Berg-KirkpatrickFindings of NAACL 2025
Hidden Markov models (HMMs) are valuable for their ability to provide exact and tractable inference. However, learning an HMM in an unsupervised manner involves a non-convex optimization problem that is plagued by poor local optima. Recent work on scaling HMMs has shown this challenge only intensifies as the number of hidden states grows. We provide a comprehensive empirical analysis of two approaches to enhance HMM optimization: reparameterization and initialization of HMM transition and emission parameters using neural networks. Through extensive experiments on language modeling, we find that (1) these techniques enable effective training of large-scale HMMs, (2) simple linear reparameterizations of HMM parameters perform as well as more complex neural ones, and (3) the two approaches are complementary, yielding the best results when combined.
@inproceedings{lee-berg-kirkpatrick-2025-optimizing, title = {Optimizing Hidden {M}arkov Language Models: An Empirical Study of Reparameterization and Initialization Techniques}, author = {Lee, Ivan and Berg-Kirkpatrick, Taylor}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL Findings)}, month = may, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-naacl.429}, doi = {10.18653/v1/2025.findings-naacl.429}, pages = {7712--7723}, }
- Readability ≠ Learnability: Rethinking the Role of Simplicity in Training Small Language ModelsOral Spotlight (top 5.7%)Ivan Lee and Taylor Berg-KirkpatrickCOLM 2025
Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability—characterized by accessible vocabulary, familiar narrative structure, and simple syntax—plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training—drawing parallels to human cognitive development without empirical basis—and argue for more precise reasoning about what properties actually support capability emergence in small models.
@inproceedings{lee-berg-kirkpatrick-2025-readability, title = {Readability ≠ Learnability: Rethinking the Role of Simplicity in Training Small Language Models}, author = {Lee, Ivan and Berg-Kirkpatrick, Taylor}, booktitle = {Conference on Language Modeling (COLM)}, month = oct, year = {2025}, publisher = {OpenReview}, }
- Pragmatic Structured Generation: A Case Study on JSON SchemaIvan Lee, Loris D’Antoni, and Taylor Berg-KirkpatrickDL4C @ NeurIPS 2025 (Workshop)
Grammar-constrained decoding—which masks invalid tokens during generation to guarantee outputs stay within a specified formal language—promises to eliminate structural errors in language model outputs. Yet when tested on JSON Schema (the most common application of grammar-constrained decoding), popular implementations achieve only 50% coverage on real-world schemas. Through experiments on 10,000 real-world JSON schemas, we find that treating validation as an external tool—using validation failures as feedback for runtime alignment—outperforms sophisticated constrained decoding methods, achieving 95% coverage with a modest latency increase (typically 1-2 additional seconds per schema). This gap stems from multiple issues: grammar-constrained decoding is theoretically limited to context-free grammars, real-world schemas often require context-sensitive validation, and even within context-free constraints, implementations struggle with token-boundary misalignment and state explosion. While our analysis focuses specifically on JSON Schema—where language models may excel due to extensive training exposure—it raises questions about whether increasingly complex decoding algorithms are the right approach. As language models improve, treating validation as a separate feedback tool in an agentic loop may prove more practical than embedding constraints into the decoding process itself.
@inproceedings{lee-etal-2025-pragmatic, title = {Pragmatic Structured Generation: A Case Study on {JSON} Schema}, author = {Lee, Ivan and D'Antoni, Loris and Berg-Kirkpatrick, Taylor}, booktitle = {Deep Learning For Code in the Agentic Era (DL4C) @ NeurIPS}, year = {2025}, pubstate = {inpress}, keywords = {workshop}, }
2024
- Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning AbilityIvan Lee, Nan Jiang, and Taylor Berg-KirkpatrickICLR 2024
What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture’s predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.
@inproceedings{lee2024exploring, title = {Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability}, author = {Lee, Ivan and Jiang, Nan and Berg-Kirkpatrick, Taylor}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2024}, }
2022
- Masked Measurement Prediction: Learning to Jointly Predict Quantities and Units from Textual ContextDaniel Spokoyny, Ivan Lee, Zhao Jin, and 1 more authorFindings of NAACL 2022
Physical measurements constitute a large portion of numbers in academic papers, engineering reports, and web tables. Current benchmarks fall short of properly evaluating numeracy of pretrained language models on measurements, hindering research on developing new methods and applying them to numerical tasks. To that end, we introduce a novel task, Masked Measurement Prediction (MMP), where a model learns to reconstruct a number together with its associated unit given masked text. MMP is useful for both training new numerically informed models as well as evaluating numeracy of existing systems. To address this task, we introduce a new Generative Masked Measurement (GeMM) model that jointly learns to predict numbers along with their units. We perform fine-grained analyses comparing our model with various ablations and baselines. We use linear probing of traditional pretrained transformer models (RoBERTa) to show that they significantly underperform jointly trained number-unit models, highlighting the difficulty of this new task and the benefits of our proposed pretraining approach. We hope this framework accelerates the progress towards building more robust numerical reasoning systems in the future.
@inproceedings{spokoyny-etal-2022-masked, title = {Masked Measurement Prediction: Learning to Jointly Predict Quantities and Units from Textual Context}, author = {Spokoyny, Daniel and Lee, Ivan and Jin, Zhao and Berg-Kirkpatrick, Taylor}, booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL Findings)}, month = jul, year = {2022}, address = {Seattle, United States}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-naacl.2}, doi = {10.18653/v1/2022.findings-naacl.2}, pages = {17--29}, }
- HeLo: Learning-Free Lookahead Decoding for Conversation InfillingIvan Lee and Taylor Berg-KirkpatrickFindings of EMNLP 2022
We propose Heuristic Guided Lookahead Decoding (HeLo), a novel decoding strategy for conversation infilling. Conversation infilling aims to generate a seamless bridge of utterances connecting a given pair of source and target utterances. HeLo does not require fine-tuning or extra models – only the generating model itself. Instead, HeLo leverages a greedy lookahead phase before committing to any token. The HeLo framework is simple and can augment conventional decoding strategies paired with any autoregressive language model. Smooth transitions between utterances are encouraged with an annealing schedule. Our experiments show HeLo outperforms several baselines when evaluated with both automatic and human evaluation metrics, which, we argue, are appropriate for the task.
@inproceedings{lee-berg-kirkpatrick-2022-helo, title = {{H}e{L}o: Learning-Free Lookahead Decoding for Conversation Infilling}, author = {Lee, Ivan and Berg-Kirkpatrick, Taylor}, editor = {Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)}, month = dec, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.findings-emnlp.367}, doi = {10.18653/v1/2022.findings-emnlp.367}, pages = {4996--5008}, }