Skip to content

Chunk Observations

Sam Crawford edited this page Jan 17, 2023 · 8 revisions

Migrated from #3196.


To go deeper: the 'fundamental knowledge' is not really at the Chunk level, it is what is inside chunks:

  • name
  • abbreviation
  • domains
  • term
  • symbol
  • space / type
  • constraints
  • reasonable value
  • unit
  • uncertainty
  • definition
  • notes
  • defining expression
  • and so on.

Obvious question: are these really the 'fundamental knowledge'? We don't have a good answer to that. But it has been sufficient for us up to now.

So where do Chunks come in? Well, if you consider the above as atoms, then chunks are more like molecules, i.e. collection of atoms. Like molecules, some can arise, and some cannot. So there's "order" in how things assemble. The molecules that interest us are the ones that end up getting defined. This process was quite ad hoc; when we encountered a bunch of facts about a thing we were interested in that occurred in practice a bunch of times, we named it.

The classes that arise from that allow you to see two kinds of things:

particular atoms that make sense on their own particular sub-molecules that make sense on their own The underlying theory we should be using is that of Formal Concept Analysis (FCA). The attributes here would be "has information X in it", with X from the list above. Our Chunks are then the nodes of the lattice that occur in practice. Our classes help use navigate the lattice.

Note that there are other analytical techniques (including those listed on that wikipedia page) that might make sense for us to use. FCA just makes sense to me.

An understanding of FCA also makes it clear that using Maybe is a hack: a proper concept should have an exact list of attributes that it embodies.

To bring it full-circle: there is all sorts of knowledge that exists that is well-defined, but doesn't possess an abbreviation. So we can't make abbreviation manditory, as that would undermine our whole system. But when abbreviations exist, they should be used. From a pure programming point-of-view, that screams for Maybe, doesn't it? We've learned that this is not a good solution. It seems that a better solution seems to lie at the "knowledge retrieval" stage, where we can have functions that retrieve abbreviations if they exist, and our code should deal with the fact that abbreviations are not always present.

What we need to do:

  1. settle on an analysis technique for concepts,
  2. list all the attributes we have
  3. derive the concepts we need (by using co-occurence in our actual knowledge database)
  4. give names to the concepts we've thus extracted
  5. create data-structures for those concepts
  6. create accessors for all that information.

In practice, although the above steps should be done from the beginning, I'm quite confident that a lot of what we currently have will stay as is, or with minor modifications.

In our re-design, I really like the idea of starting from the fundamental knowledge that is inside the chunks. We should also try to brainstorm knowledge that we think will be relevant in the future. We won't be able to make a perfect prediction, but I'll start a list of brainstormed thoughts below. In my list, I won't worry about whether the knowledge will end up inside a chunk, or possibly be tracked in a different way.

  • the local symbol used to represent a quantity. I think we currently "bake" the symbol into quantity making it difficult to change, but symbols aren't universal; they can change. In some cases, symbols are changed to avoid clashes between conventions when different domains are mixed (for instance sigma is used both for standard deviation, stress and the Stefan Boltzmann constant). In other cases, symbols are changed because of author/community preferences.
  • unit system. We implicitly (I believe) assume SI for everything, but we will also want to be able to use imperial units.
  • rationale information. For example for constraints, we may want to include a rationale for the constraint. Our "detailed derivations" currently provide rationale information for how we combine theories and assumptions to come up with a new theory.
  • refinement traceability information. Many theories will depend on other theories for their justification (rationale).
  • theory pre-conditions. Conditions that will need to be true to invoke a theory. That is you can only use a theory if you can satisfy the pre-conditions. The pre-conditions will be assumptions.
  • theory post-conditions. The conditions that have to be true once a theory has been invoked.

There are two places where Maybe can be used:

  1. in the data representation,
  2. in what the data accessors return.

So we'd have HasX classy-lenses and MayHaveX classy-lenses. We could have instances of MayHaveX for all sorts of things where we already know there is no X but where asking the question isn't silly. We do need to be careful to not implement MayHaveX where the question should not be asked.

From the point of view of our usage, lenses are just polymorphic getters. We want to be able to "get X" from some representation without caring how X is embedded in the data we've been handed, as long as we're promised that X is in there somewhere.

A Side Effect

Reconstructing our thinking from ~6-7 (!!!) years ago, we noticed that many important 'concepts' (where I use the term informally) had a tell-tale sign that they were more important than others: they came with an abbreviation. This was, of course, purely an observation on the sample that we had. Though it does still seem to hold. Where we seemed to have made an error was to enshrine this in our data representation.

Taking a step back, it does seem odd to enforce the existence of an abbreviation. An abbreviation really is something that may exist. Therefore, the distinction between NamedChunks and IdeaDicts might not be useful.

We really do need to go back to the blackboard (perhaps even literally!) and revisit all our chunks (their contents, their name, their intent, their constructors). An in-person design meeting is likely needed.

Beyond this, if we decide that attaching a domain at the Idea level is a code smell, then I propose we merge the two chunks. Otherwise, I think that keeping them separate makes sense: both would have a Maybe String for an abbreviation, and CI would also contain a list of domains (we could even make this more explicit by having a CI be an IdeaDict and a list of domains).

Clone this wiki locally