new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 1

Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster -- the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 Newsgroups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.

  • 2 authors
·
Dec 29, 2025

The FRB20190520B Sightline Intersects Foreground Galaxy Clusters

The repeating fast radio burst FRB20190520B is an anomaly of the FRB population thanks to its high dispersion measure (DM=1205,pc/cc) despite its low redshift of z_frb=0.241. This excess has been attributed to a large host contribution of DM_{host}approx 900,pc/cc, far larger than any other known FRB. In this paper, we describe spectroscopic observations of the FRB20190520B field obtained as part of the FLIMFLAM survey, which yielded 701 galaxy redshifts in the field. We find multiple foreground galaxy groups and clusters, for which we then estimated halo masses by comparing their richness with numerical simulations. We discover two separate M_{halo} >10^{14},M_odot galaxy clusters, at z=0.1867 and z=0.2170, respectively, that are directly intersected by the FRB sightline within their characteristic halo radius . Subtracting off their estimated DM contributions as well that of the diffuse intergalactic medium, we estimate a host contribution of DM_{host}=430^{+140}_{-220},pc/cc or DM_{host}=280^{+140}_{-170},pc/cc (observed frame) depending on whether we assume the halo gas extends to r_{200} or 2times r_{200}. This significantly smaller DM_{host} -- no longer the largest known value -- is now consistent with Halpha emission measures of the host galaxy without invoking unusually high gas temperatures. Combined with the observed FRB scattering timescale, we estimate the turbulent fluctuation and geometric amplification factor of the scattering layer to be F Gapprox4.5 - 11,(pc^2;km)^{-1/3}, suggesting most of the gas is close to the FRB host. This result illustrates the importance of incorporating foreground data for FRB analyses, both for understanding the nature of FRBs and to realize their potential as a cosmological probe.

  • 10 authors
·
Jun 8, 2023