TT Self Study Journal #2

This is an artifact of my self-study. I am using it to remember links and help manage my focus. As such, I don't expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say "good work/good luck". I'm hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me.

List of Acronyms

Mechanistic Interpretability (MI)
AI Alignment (AIA)
Outcome Influencing System (OIS)
Vannessa Kosoy's Learning Theoretic Agenda (VK LTA)
Large Language Model (LLM)
n-Dimensional Scatter Plot (NDSP)

My Goals for this Sprint

SSJ--1 -- Finish writing the first draft of the definition section of my OIS article. I wrote a good amount in the "Definitions and Properties" section, particularly in the “OIS: Outcome Influencing System" subsection. There's still a lot more to do in the rest of the section. I've been considering how I am defining "system", I would like to review how it is considered in other contexts.

I've been going between the desire for it to be broad enough to include objects like Outcome Pumps, VS the desire for it to be concrete enough to usefully constrain thinking and base deductive predictions on. I think in many places I am going to attempt a Bostromian exploration and naming of possibilities, that is, to make my definitions as broad as possible, but then give names for, and hopefully examples of, possible properties and categories within the broad definition.

I think this approach gives the sense of overview and context that I want, but may suffer from creating overly cumbersome names, in which case it is probably valuable to find more concise names for particularly relevant objects, even if they can be described with broader terminology.

Even though the OIS document is far from complete, I would be grateful for any early readers who want to offer input, either on the ideas, document structure, or writing grammar and flow. I think in the next sprint I will continue working on the definitions section and may also start working on the interdisciplinary section to try to help me consider how I want to approach the "system" and "substrate" sections of the definition.

SSJ--2 -- Read VK LTA and write a small summary with my thoughts. I did a small amount of reading for this. Nothing much to say yet.

SSJ--3a -- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable. I didn't get around to this, but it still seems like a good idea to me.

SSJ--3b -- Start studying Topoi and Linear Algebra textbooks. Started reading Topoi. Went through the definition of categories and started looking at the list of weird categories that give intuition and understanding of the definition. I think it is fun when they draw a digraph and ask if it commutes.

SSJ--4a -- Read Neel’s “Mech Interp Prereqs”. I have read Neel's Concrete Steps to Get Started in Transformer Mechanistic Interpretability. Neel walks us through what he thinks a MI researcher should be familiar with.

The broad categories (with my explanations): Neel then goes on to recommend three paths for exploration: ML: I think I have a fairly good base in terms of ML knowledge. Admittedly, I did fail a course on theoretical ML last semester, (luckily it wasn't required for my program,) but that was because it was focused on proving various bounds in online learning contexts, (eg, applying Rademacher complexity or Hoeffding's Lemma to prove regret bounds).

Transformers: This is one of the main focuses for SSJ--4, so it's good that it comes up here. It looks like Callum McDougall has released many educational resources. Thanks Callum! I think I will take the recommendation to go through Transformers From Scratch.

Tooling: I am very interested in tooling. Building tooling, especially for working with high dimensional data, is something I would like to contribute to. With that in mind, I feel I have shockingly shallow familiarity with existing tooling.

Neel mentions: So I'd like to review those, and maybe look around for other tools and see what people are saying about those tools and what they want out of their tools.

Current Literature

I think I'm well covered for this kind of thing with what I'm doing with SSJ--2, however, I like the suggestion of doing a literature review (combining SSJ--1 and SSJ--2), and I am fond of the following questions to keep in mind while doing a literature review:

My question from SSJ--2, “what is the current state of RSI criticality threshold knowledge” seems like it might be a good topic for a literature review, although it is outside of my normal focus area.

NDSP and My Goals

SSJ--4b -- Do some research and write a little bit about my plans for messing around with LLMs in some capacity. I think this was well covered by SSJ--4a for now: I'm going to go through Transformers From Scratch.

SSJ--5a -- Review my NDSP notes. I gathered and sorted my notes. I think going through them and keeping them in mind while also reviewing other MI Tools would be a good idea.

I didn't get around to this. On the last sprint I had trouble actually making the time to sit down and work on this. For that reason, in the next update I want to include a section logging each day I work on this with a brief explanation of what I focused on.

My goal is to have an entry for at least 4 days of the week, but more than that is even better. I'm hoping this will help motivate me, or at least let me see how I'm doing in terms of putting in the time to work on this.

HACKER_BLOG

TT SELF STUDY JOURNAL # 2

TT Self Study Journal #2

List of Acronyms

My Goals for this Sprint

Current Literature

NDSP and My Goals