Date: Fri, 22 Mar 2024 03:30:34 +0000 (UTC) From: David Fuchs Subject: Re: [james@jpfennell.com: [tex-k] Phantom ligature bug in tftopl] [In response to: https://tug.org/pipermail/tex-k/2024-March/004031.html] Well, I've looked closely, and James is pretty much right, with some caveats. TLDR: I have a proposed fix and two additional textual changes down at the end. A little background: The encoding of the ligature / kern stuff is indeed a bit weird, and some of the explanation provided in the various web files (tftopl, mf, tex, and, surprisingly, gftodvi) could be worded better. And it sure doesn't help that the different programs refer to the four individual bytes in a "ligature/kern program step" differently in the code (for instance, the op_byte is variously |op_byte|, |c|, |tfm[lig_step(r)+2]| or what James uses, |b2|, which is what |op_byte| expands into by the time you Tangle things). I'm going to use skip_byte=b0, next_char=b1, op_byte=b2, and remainder=b3 to avoid complete obfuscation. Anyway, speaking slightly sloppily, each character in a font has only a single byte to point into the lig/kern table, so each character's little lig/kern program can only start at location 0 up to 255. That's potentially not enough, since each character's lig/kern program can be many steps long (one step per kern or ligature), so there's a special hack: If a character's initial pointer into the lig/kern table points to a lig/kern with skip_byte that's >128, this is a super-duper special flag that says "hey, I really mean that you immediately go to location 256*b2+b3 instead, and start there." Great, now characters' lig/kern programs can start in 64K different locations. (Foreshadowing: Note well that it's ">128", not ">=128".) Of course, one of the things that gets checked when tftopl or TeX reads in a tfm file, is that all such special hack first-instruction pointers are actually "in bounds"; that is, 256*b2+b3 has to be less than the size of the lig/kern table declared earlier in the tfm file (a value called |nl| in all the web programs, meaning "number of words in the lig/kern table"). OK, now looking at each regular (not special-hack) lig/kern step, whenever the skip_byte is >=128, then this particular lig/kern step is the final one for the character in question. Here's the part that isn't explained well, in my opinion: If skip_byte=128, then the rest of the fields encode the actual final step, which is to be considered like all of the other steps; skip_byte=128 simply means "there is no next step after this one." However, when skip_byte>128, the comments call it an "unconditional halt" which means "don't consider the rest of the fields as a potential lig/kern; do stop immediately." Essentially "well, I meant to make the previous step the final step". In fact, pltotf and mf write out tfm files that always use 128 (|stop_flag|) to indicate the end of a lig/kern program. So, what is the special >128 unconditional halt rule for? Well, Metafont does some error-correction of any malformed lig/kern steps in the mf file by using stop_flag+1 on output, but that's not the main motivation. There's a hint in the tftopl.web et. al. comments, where it says that "Any instruction with |skip_byte>128| in the |lig_kern| array must have |256*op_byte+remainder128 in the first part of the lig/kern section. But it's a little tricky to figure out how many of the first 256 slots that might be. And, what do you know, we already have a for-loop that going to check all the "regular" steps in the entire table, including any that might be in the first 256 slots. So, we can save a few bytes of object code by just having a single for-loop that goes over all the steps in the entire table, and ANY that have skip_byte>128 will be required to pass that check. See : If skip_byte ("a") is >128 we do the check, and that covers first-instruction hack cases (where it matters) and unconditional halt cases (where it doesn't), with no need to try to differentiate them. Without going into details, the left/right boundary pseudo-character slots in lig/kern table also rely on the details of this trickery. That all really begs the question of why unconditional halt is defined as an unconditional halt? I think it's only to keep things well-defined, and so we know TeX's behavior in all situations, even those that are really hard to reach. And it was easier to define the behavior, rather than preclude it by having it cause an un-loadable tfm, just because it would be hard to detect (and waste space and time to do so) in TeX. And, defined this way, lig/kern searches in TeX can be pretty optimized anyway; in and and and make_ord (where it's a little spread out), it's always the same pattern: Fetch the first instruction; if its skip_byte>stop_flag, then fetch the real first instruction; now loop around, seeing if the current instruction is a match for the present hlist character pair and if this isn't an unconditional halt (skip_byte<=stop_flag), in which case we have a hit; otherwise, if this was a final step or unconditional halt (done in a single compare of skip_byte>=stop_flag!), then exit the loop. So, at worst it only costs a single extra compare for each successful "hit", and stuff is already in registers, so costs a cycle at most, and maybe it's free on modern machines. Or, maybe we're lucky and the compiler realizes that the CPU already has the condition flags set, and so an extra compare isn't even needed. Asides: 1) Interestingly, TeX doesn't spend the code space and time to look for ligature loops and declare them evil when opening tfms like tftopl does; it just takes care to allow the user to interrupt such loops (see "infinite ligature loop" comment on a few check_interrupt calls in tex.web). This keeps code size down, I suppose; tftopl spends a bit of effort to implement it. 2) I can't help but notice that "unconditional halt" is sometimes referred to as "unconditional stop" (or is it vice-versa?) Perhaps this wants to be fixed. OK, so now what's up with tftopl? Well, while I'd quibble with his 3-case if/then/else explainer, and his "begin if tfm[k] >= stop_flag ..." has an extra "begin" and an extra "=" and also isn't quite right because |k| isn't set up to start with, James is close to a working solution. I think it would be better placed (and perhaps be more Knuthianly expressed), as: @x 1450 @ @= repeat hash_input; k:=tfm[lig_step(i)]; if k>=stop_flag then i:=nl else i:=i+1+k; until i>=nl @y @ @= repeat r:=tfm[lig_step(i)]; if r<=stop_flag then hash_input; if r>=stop_flag then i:=nl else i:=i+1+r; until i>=nl @z since that really brings out what it means to be below, at, or above stop_flag (do this step and continue, do this step and stop, don't do this step but still stop), and does so all in one spot, uniquely. (I used |r| rather than |k| since |k| gets clobbered down in hash_input.) The first of James's test tfms does have an error in it, and TeX does reject it, as does my fixed tftopl. I can't help but notice that the pl file he links to has a "2" in its name, so maybe it's the wrong one? [Note from Karl: indeed; see sibling tftopl-stop.zip for James' files.] In any case, another possible reason for confusion is that the error message is "Bad TFM file: Ligature unconditional stop command address is too big" which is a bit misleading; it should perhaps say "... Ligature/Kern ..." because the bad unconditional stop can be on either kind of step. The second one, TeX accepts, as does my modified tftopl, which creates a pl file containing one ligature and three characters. I hope I'm right about all this; I had only the vaguest background in it, so won't bet my life. In addition to his two tfm files, I also tried it out on trip.tfm, trap.tfm, and cmr10.tfm. And my two added comments for DEK stand: The "unconditional halt/stop" nomenclature should be conformized; and the "unconditional stop command address is too big" error should mention ligature/kern. Well, that was fun,-drf