CompSys

May 26, 2024

This is the summer of getting good at Systems, and my internship on a data engineering team is underway. I want to devote myself to learning a lot on my own. Eric Zhang's reading group syllabus looks great. Each session is reasonably self-contained, so I should be able to hop around. Simon Boehm's blog is also great.
When this all started, I was vaguely interested in "interoperability." I latched on to the science, meta-science, and data the process produces. The US Year of Open Science was cool, but work was to be done.
Interoperability is a Computer Systems word. Data is field-agnostic. ML is hot. ML is significant. And then the rabbit holes began. I have written takes on specializing foundation models, and while I'm less optimistic than most on certain things, I think there is good work to be done on data management and optimization.
While I'm not one to be finessing pipelines because I'm not doing any HPC experiments, it has been interesting looking at cool things. For example, this paper on addressing hardware determinism for "verifiable training" and even thinking about multi-layer transformers and seeing whether behavioral discrepancies can be managed from a low-level perspective. On the higher levels, that verifiable training paper was looking at hardware non-determinism to apply to looking at data integrity.
Things like homomorphic authentication encryption as related to federated learning make it important that fine-tuning data (for example) is verifiable. This is not really intuitive for me because the benefit of federated learning is the heterogeneous data, but anyway, it's hard to assess the source of an output if we don't have good provenance, etc.
I don't do mechanistic interpretability, but someone was replicating Anthropic's monosemanticity work and features were constant with different random seeds. Hyperparameters for an experiment like this are negligible: for one, they used a one-layer model, and two, it's not a matter of controlling these variables because they would just be the same.
Anyways, the features were the same, but the direction of the features was unidentifiable. Now, for this, I'm like maybe there's small but stacked dependencies. Even if you turn off CUDAnn non-determinism, floating point changes, etc. are not guaranteed to be the same. And so, maybe there's just approximation errors that persist and have some significant impact. IDK.
Even if this is cool to think about, I'm not too sure this is worthwhile for this specific use-case, but at the same time low-frequency features seem sensitive enough that this sort of thing would be helpful. Performance factors like speed just exist on much larger scales, such that caring about this is not super important. Ramblings. Karan Goel's work looks cool!
This is something I care about less, but I feel like systems aren't that sexy lmao. Maybe that's why I like it. Discreteness. Nodes. Number go up, Number go down. Fast. It's optimization. It's economics. All roads lead back to economics.
A field to dedicate myself to. Can't really think of it. But I like things being better. I like things being useful. I like spotting peculiar things, knowing I'm well behind the curve, but adjacent knowledge is cool. It's fun to apply. I can chase the feeling and probably be okay. A bonis ad meliora.