Apply particle physics clustering to embedding space of LLM

Longer description of your proposed project

I propose to investigate the properties of the LLM and its embedding space using clustering algorithms inspired by physics. Embedding is a technique of putting variable-length text into a fixed-dimension vector space. It’s well-known since the Word2Vec era (Mikolov, 2013) that embedding contains a “world model.” Current models such as CLIP embed text and/or images into the same vector space and use distance functions (e.g., cosine similarity) to find nearest neighbors for searching (Radford, 2021). However, existing approaches only use the “direction” of the unitary embedding vectors.

The anti-kT clustering algorithm (Cacciari, 2008) is widely used in particle physics. In a collision, many partons are produced, and each parton further decays into many more final-state particles. The anti-kT algorithm can accurately cluster these final-state particles from the same “source” parton. The anti-kT is effective because it cleverly uses the energy of particles in the algorithm. This leads me to propose using pre-embedding text length as “energy” for each vector and applying a modified anti-kT algorithm on these embedding vectors to probe the properties of a class of Natural Language (NL) tasks in the embedding space.

An outline of the research procedures is:

- Start with N “source” documents (e.g., news articles)

- Calculate the embedding vector for each of the source documents.

- Give LLM a task (e.g., summarize), repeat M times for each source document with varying temperature (i.e. LLM output's randomness) or target length setting.

- Calculate the embedding vector for all of N*M derived texts.

- For each of the N*(M+1) embedding vectors, assign them an “energy” component proportional to the pre-embedding token length.

- Use anti-kT or its variants for clustering, and observe clustering accuracy (% of derived texts get clustered with its “source”)

In particle physics, as the parton fragments, it could result in many low-energy particles in the collection with a high-energy core. Here, we use the intuition that the NL tasks performed by humans or computers often take snippets from or summarize a “source” document that is usually longer, corresponding to higher “energy.”

Compared to previous proposals of customizing metric space for the data domain or using LLM directly for clustering (Viswanathan, 2023). The proposed clustering scheme does not require complex changes to the embedding space. It can also cluster hundreds of thousands of text entities as long as each text fits in the embedding context window, even if the size of all text combined is much larger than the LLM context window.

I will investigate the relation between clustering accuracy and the following:

- Initial separations between the source documents in the embedding space

- The “temperature” and “summary length” parameters of the LLM summary tasks

- Parameters of the clustering algorithm

Describe why you think you're qualified to work on this

I'm pretty good at making things work in software, and the clustering algorithm comes from my domain (particle physics). I'd also give myself a passing grade as research problem solver as a 4th year PhD student.

Other ways I can learn about you

GitHub: @Moelf

ORCiD: https://orcid.org/my-orcid?orcid=0000-0002-3359-0380

Twitter: @l_II_llI

How much money do you need?

$5,000

Links to any supporting documents or information

No response.

Estimate your probability of succeeding if you get the amount of money you asked for

Success := We show this family of clustering algorithm is a viable way to do text searching and clustering beyond the currently possible text corpus size as well as if the clustering is provenance-aware.

Key software packages and their usage are understood and concrete steps with details are mapped out. I give it 85% (with C.L. 95% being +/- 5%) we can conduct the experiment and conclude one way or the other as outlined in the project proposal.