LLMs are fundamentally matching the patterns they’ve seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching. The mathematical results imply that you can always find compositional tasks whose complexity lies beyond a given system’s abilities.

LLMs are still very useful in a bunch of domains but here’s an article explaining (from a paper with a novel bound on computational complexity) why improvements in reasoning seem to have run out of steam.

This was exactly a discussion I had yesterday where when teams reach higher levels of maturity, they may catch and fix issues internally before they become full blown incidents.

So how do we than make sure we are not over-indexing on purely the operational surprises where we ran the heavy incident machinery and how do we get the right learnings and improvements disseminated throughout the organisation?

(I know how. The “how” here is how to get there.)

This is a very good framework from Matt Webb for organizations how to do strategic pathfinding when it comes to AI. In Germany especially, lots of orgs would benefit from doing this instead of whatever it is that they are busy with right now.

Interesting to see the word overhang used here (originally by Nat Friedman) which I normally use when talking about tech “debt” but a capability overhang is of course also possible.

https://interconnected.org/home/2023/12/08/ai-pathfinding

[…] the abandonment of responsibility in two dimensions. Firstly, and following on from what was already happening in ‘big data’, the world stopped caring about where AI got its data — fitting in nicely with ‘surveillance capitalism. And secondly, contrary to what professional organisations like BCS and ACM had been preaching for years, the outcomes of AI algorithms were no longer viewed as the responsibility of their designers — or anybody, really.

That’s a fairer and more informed take than most. AI systems can be very useful in limited contexts but you wouldn’t want one to decide anything material about your life.

https://www.bcs.org/articles-opinion-and-research/does-current-ai-represent-a-dead-end

Artemis, fostered with Apollo, virgin who delights in arrows, far-shooting goddess, who swiftly drives her all-golden chariot through Smyrna to vine-clad Claros, I ask that you establish a loop counting from ninety-nine to one called beerLoop.

A programming language where you make things happen by invoking the gods. Nice to see that classics upbringing to be of some professional use.

https://github.com/rottytooth/Olympus

After reading about it, I started trying out Sonshi style myself. RMS is a disgusting character so let’s just pretend the guru is somebody else.

I do this during extended work sessions away from my desk.

The setup is really easy with Karabiner Elements and a new profile that disables the laptop built-in keyboard when my contra is plugged in.

https://xn--gckvb8fzb.com/sonshi-style-aka-keyboard-on-laptop/

An roller coaster of an interview with a series of FizzBuzz extensions that are honestly not even that bad, but poor form from the interviewers’ side not to recognize the absolute balling brilliance on display here.

I can just about follow along with this level of Typescript type level programming but to be able to whip this out during an interview is a testament of mastery.

https://kranga.notion.site/The-fizzbuzz-that-did-not-get-me-the-job-180e7c22ef3b80c3a386f7f8de720ac7

Do not fall into the trap of anthropomorphising Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn, you stick your hand in there and it’ll chop it off, the end. You don’t think ‘oh, the lawnmower hates me’ – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.

I think this is a well argued plea by Ken Shirriff to stop using the term “cargo cult” and I agree with it but would like to add two things.

With a non-English speaking audience that does not have the same priors, nobody will have an idea what you are talking about if you use the term “cargo cult”. You’ll be stuck explaining the term in a ham-fisted way that will fail to convey the huge amount of history and social science involved.

One problem with rejecting the term is that it lets software engineers off the hook and allows them to pretend the way they work is different from the tribal inhabitants of pacific islands. I argue that most software engineering practice is based on folklore and is deeply tribalistic.

https://www.righto.com/2025/01/its-time-to-abandon-cargo-cult-metaphor.html

No port/adapter terms to learn, no unnecessary layers of horizontal abstractions, no extraneous cognitive load.

Reducing cognitive load is continuous battle. The entropy direction of a software system is always towards more complexity.

People add a lot of this stuff out of either inexperience or because they need to look smart. The simple code that gets the job done is often looked down upon.

https://minds.md/zakirullin/cognitive

For those who want to introduce some whimsy into their programming and for whom using a variable width font in your code editor is a bit too far, there is now Comic Mono (via). It doesn’t even look all too terrible.

(After using Iosevka and Inconsolata for a long time, I’m now as are many people a happy Jetbrains Mono user.)

https://dtinth.github.io/comic-mono-font

I love Github Projects for tracking work. It’s close to the code and engineers understand it natively. I feel you can deliver features of any size with it if you work with the tool.

The only thing that’s a bit annoying is the lack of improvements from Github. There’s a bunch of quality of life features I’m used to from other tools that would really make a difference. But now with LLMs we don’t have to settle.

I asked Cursor to write me a user script that adds a “Create Follow-up Task” button (that I used a lot on Asana) to issues on Github. It did a reasonable enough job that I could tweak and then have something working for me. I could write this myself of course but the hurdle of figuring out the format and the wiring felt like a blocker.

https://github.com/alper/user-scripts/blob/main/github-followup-issue/github-followup.user.js

I’ve had to spend more time than I like thinking about how datetimes are stored in databases and even the commonly accepted practice of storing UTC does not work for all cases.

Specifically when you store something that would happen in the future, you need to store the location of the event as well. Otherwise any time of daylight savings change will shift your event around. This is not just for single events but can also happen for say ordering cut-off times which aren’t pinned to a single date.

A very useful thought experiment whenever anybody tries to pretend LLMs are ‘human’ because they sound human.

Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".

Colin (@colin-fraser.net) 2024-12-19T23:15:38.459Z

RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models

Medical QA depends heavily on domain-specific knowledge that is not always available within pre-trained models, necessitating knowledge-based retrieval from external sources.

In addition medical knowledge evolves rapidly, and new treatments or updated guidelines may not be included in the model’s pertained corpus.

The question example for the reasoning process in Figure 1 is on a multiple-choice question. That seems overly simple.

In parallel, Commonsense Question Answering shares similar complexities with Medical QA, particularly in its reliance on structured multi-step reasoning and iterative evidence retrieval.

rStar (Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers)

The rStar approach seems worth diving into. That will be the paper I read next.

Monte Carlo Tree Search

enabling the open source LLMs (LLAMA3.1) to achieve competitive performance with top closed-source LLMs like GPT-4 and GPT-4o.

We’ll come to this later in the paper. Their conclusion is that they can trick out LLAMA to get similar performance to GPT-4 in these domains.

Upper Confidence Bound applied on trees (UCT)

In contrast, rStar incorporates five distinct actions that enable more adaptive exploration:

A1: Propose a One-Step Thought. This action generates the next reasoning step based on previous steps, allowing the LLM to build the solution incrementally.

A2: Propose Remaining Thought Steps. This action enables the LLM to produce all remaining reasoning steps in one inference, similar to CoT, for simpler questions.

A3: Generate Next Sub-question and Answer. This action decomposes the main problem into a sequence of sub-questions, each solved in turn. A4: Re-answer Sub-question. This action allows the LLM to re-answer a previously generated sub- question, increasing accuracy by using few-shot prompting.

A5: Rephrase Question/Sub-question. This action rephrases the question to clarify conditions and reduce misunderstandings, enhancing the LLM’s interpretation of the problem.

I need to trace the rStar algorithm after reading the original paper. The explanation here is too short.

These queries target information that can either support or refute the content of each statement, ensuring comprehensive factual verification.

How does this approach deal with (non-)negation that LLMs often have a lot of trouble with? From a language perspective it could just as easily say I can or can’t eat grapefruit (iykyk) based on the temperature that day but especially in a medical context these kind of errors can be catastrophic.

RARE achieves substantial gains, outperforming rStar by 5.17% on MedQA, 2.19% on MedMCQA and 2.39% on MMLU-Medical.

Even if these numbers are statistically significant (which they don’t say), these increases are really modest. I would not call this in any way “substantial”.

Looking at Table 1, RARE is as much of an increase over rStar as rStar is over the next best approach so from that perspective maybe you could call it significant. The difference between worst and best framework here is around 10% across CoT, RAG, SC, rStar, RARE.

evaluated on StrategyQA (SQA), CommonsenseQA (CQA), Social IQA (SIQA) and Physical IQA (PIQA)

The main question I have is from what percentage accuracy such a system would be reasonably possible to use in a real world context. Even at 90-95% that would seem like it would be too low to rely on when the stakes are high.

By enhancing LLMs with retrieval-augmented reasoning, RARE bridges the gap between open source models and state-of-the-art proprietary systems.

The framework has only been tested on open source models like LLaMA 3.1 and not on larger proprietary models such as GPT-4. This is due to the high number of API calls required by RARE’s iterative retrieval and reasoning process, making evaluation on closed source models prohibitively costly.

So here they repeat the statement that they’ve bridged the gap but they say they haven’t used this approach with a model like GPT-4 because the number of API calls would make it too expensive.

That leaves on the table that these kind of many-call approaches are open to OpenAI because they can do these numbers of calls much more affordably from inside the house. No real gap has been closed here and it shows again how big of an advantage OpenAI has.

That raises the question: What makes GPT-4 so good? Why does it perform so much better than open source models?

RARE is designed to identify a single reasoning trajectory that leads to a correct answer but does not necessarily optimise for the best or shortest path that maximises robustness.

Any integration into medical workflows must be supervised by qualified practitioners to ensure patient safety and ethical use.

The terminal I use daily (because it’s the best really), fish, has been rewritten entirely in Rust, because it’s nice and more fun: “For one, fish is a hobby project, and that means we want it to be fun for us. Nobody is being paid to work on fish, so we need it to be fun. Being fun and interesting also attracts contributors.”

I can testify to this because when most of the code was rewritten I checked it out, built it and poked around a bunch to see how it works. I don’t think I would have done that or enjoyed doing it if it had been a C++ codebase. That was also when I was confronted with the fact that what makes a terminal really complicated is not the language in which it is programmed, but the underlying system that it is an interface to.

The story of the port and its success is legendary as far as these things go.

https://fishshell.com/blog/rustport

Another take on the old adage that writing the code is the easy part of software engineering. The real work is figuring out what has to be built and how. Once that is clear, the actual building can be done relatively quickly and linearly.

I think the notion of a dead program is useful though it’s not always that clear cut:

The death of a program happens when the programmer team possessing its theory is dissolved. A dead program may continue to be used for execution in a computer and to produce useful results.

https://olano.dev/blog/software-design-is-knowledge-building

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

A paper where they fine tune an LLM to be able to answer some questions itself and figure out for which questions it needs to use a specialized tool. Intelligent tool usage seems like it would expand the use cases for LLM driven systems much more than any kind of scaling (real or imagined).

However, scholars note that their abilities are capped at approximately high-school levels

That seems like a noteworthy statement especially if you are looking to LLMs to provide “novel thinking”. It would seem much more that high school problems are abundantly available and relatively trivial so they see a specific focus.

For numerical answers in the MATH and SciBench datasets, we consider answers correct if they fall within a small tolerance range of the true value, specifically within ±5%.

Don’t really see why you could not get exact answers in a mathematical domain.

This performance gap on public benchmarks is likely due to the larger parameter count and specific optimization of state-of-the-art models on the open-source datasets.

Same as with the high school questions. These datasets are easily available and draw attention so the models overfit on them.

The model Ours-Pn demonstrates performance comparable to Base-Pf , both showing a significant improvement over the base model. This similarity indicates successful internalization of distilled knowledge from tools. The transition from Ours-Pn to Ours-Pi showcases further improvement in answer accuracy, resulting from the model’s enhanced ability to intelligently switch to tools for harder questions.

This is the core proposition of the paper. Looking at Table 1 with the accuracy percentages there is something of an improvement but it does not really look dramatic or so convincing that you could use these systems in any critical context.

We’re looking at increases of 10-20% and an accuracy that’s still well under 90% (which I’m also not convinced would be usable).

We introduced a novel two-component fine-tuning approach to enhance Large Language Models (LLMs) in solving scientific problems of varying complexity.

One of the key issues with the paper I have is how much work the term “scientific problems” is doing. If this is published, people are going to think that the LLM is solving actual novel issues where in this case it’s just filling in relatively basic question/answer pairs that are well understood. Calling them problems is problematic.

The most interesting part of the paper is the appendix where you can see the actual questions and answers in the various datasets and the prompts they used (with example responses). The answers mostly are multiple choice which already influences how many of them you should expect to be correct.

Monolith: Real Time Recommendation System With Collisionless Embedding Table

I didn’t get that much from this paper, probably because it’s pretty high level and I don’t have a strong background in recommendation systems.

The approach is their Cuckoo Hashmap for embedding from which they can update parameters on the fly using existing data engineering pipeline technology.

Instead of reading mini-batch examples from the storage, a training worker consumes realtime data on-the-fly and updates the training PS. The training PS periodically syn- chronizes its parameters to the serving PS, which will take effect on the user side immediately. This enables our model to interactively adapt itself according to a user’s feedback in realtime.

Eight Things to Know about Large Language Models

A bunch of stuff that maybe was somewhat surprising a year ago but by now should be common knowledge for anybody even half following the developments in this field.

Some interesting bits in there but for the rest it’s a bit rah-rah because the author works at Anthropic.

In particular, models can misinterpret ambiguous prompts or incentives in unreason- able ways, including in situations that appear unambiguous to humans, leading them to behave unexpectedly.

Our techniques for controlling systems are weak and are likely to break down further when applied to highly capable models. Given all this, it is reasonable to expect a substantial increase and a substantial qualitative change in the range of misuse risks and model misbehaviors that emerge from the development and deployment of LLMs.

The recent trend toward limiting access to LLMs and treating the details of LLM training as proprietary information is also an obstacle to scientific study.

The Digital Patient Record system in Germany is built on smart cards and hardware which make it impossible to update and keep secure.

Of course a company like Gematik can’t update algorithms and keys on such a widespread heterogenous system. This is a competency that is impossible to organise except at the largest scales and even then companies like Microsoft will routinely leak their root keys.

The ‘hackers’ who made this presentation also can’t make something better than this and their culture is what led us to this point in the first place. It’s the same story with the German digital ID card which nobody uses.

The recipe is simple:

  • Demand absurd levels of security for threat models that are outlandish and paranoid
  • Have those demands complicate your architecture with security measures that look good but are impossible to maintain
  • Reap the exploits that you can run against that architecture and score publicity
  • <repeat>

It’s a great way to make sure that everybody loses in the German IT landscape.

Solution: Simplify the architecture to a server model with a normal 2FA login and keep that server secure. Done.

https://www.golem.de/news/elektronische-patientenakte-so-laesst-sich-auf-die-epa-aller-versicherten-zugreifen-2412-192003.html

The low-latency user wants Bigtable’s request queues to be (almost always) empty so that the system can process each outstanding request immediately upon arrival. (Indeed, inefficient queuing is often a cause of high tail latency.) The user concerned with offline analysis is more interested in system throughput, so that user wants request queues to never be empty. To optimize for throughput, the Bigtable system should never need to idle while waiting for its next request.

This is also at the moment my abject suffering where we have lots of shared resources which need to stay available but can also be hammered by various parties.

Good to read that in this piece by Dan Slimmon: The Latency/Throughput Tradeoff: Why Fast Services Are Slow And Vice Versa. I read the SRE book as well but that part did not register with me back then.

I use AI tools to help me program despite them being mostly very disappointing. They save me some typing once in a while.

At least, now that I have switched from Perplexity to Cursor, I can ask my questions in my editor directly without having to open a browser search tab. I pass through a lot of different technologies in a given workday, so I have a lot of questions to ask.

For my use cases, it’s rare that Cursor can do even a halfway decent code change even in domains where there is a bunch of prior art (“convert this file from using alpine.js to htmx”). I know people who say they have generated thousands of LoC using LLMs that they actively use but there the old adage comes in: “We can generate as much code as you want, if only all the code is allowed to be shit.”

The position below is one of the more charitable positions of how AI can help a programmer and even that I don’t think is particularly convincing.

https://www.geoffreylitt.com/2024/12/22/making-programming-more-fun-with-an-ai-generated-debugger.html

Attention Is All You Need

I thought I’d dive back into history and read the original paper that started it all. It’s somewhat technical about encode/decoder layouts and matrix multiplications. None of the components are super exciting for somebody who’s been looking at neural networks for the past decade.

What’s exciting is that such a simplification generates results that are that much better and how they came up with it. Unfortunately, they don’t write how they found this out.

The paper itself is a bit too abstract so I’m going to look for some of those YouTube videos that explain what is actually going on here and why it’s such a big deal. I’ll update this later.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

I came across this paper after the recent o3 high score on the ARC-AGI-PUB test. It’s a quick read and details how to scale LLMs at inference stage by generating new states at every node and so create a tree on which to perform DFS/BFS search algorithms.

A specific instantiation of ToT involves answering four questions: 1. How to decompose the intermediate process into thought steps; 2. How to generate potential thoughts from each state; 3. How to heuristically evaluate states; 4. What search algorithm to use.

For each of these steps they can deploy the LLM to generate the desired results which then scaled over the search space balloons the number of calls that need to be done (costing almost 200x the compute).

This isn’t your normal LLM stochastic parrot anymore. We’ve gone one up the abstraction chain and here we have a computer science algorithm running with LLM calls as its basic atoms.

December Adventure

So I felt I couldn’t really bring myself to do Advent of Code this year since I have more than enough other things to do (and watch and play) and with work and the kids, it’s always pretty miserable to keep up.

I saw this thing called December Adventure though and that fits in nicely with my current push to release a major update for Cuppings. If I’m going to be programming until late this month, then I’d prefer it to be on something that I can release.

I can’t promise that I won’t do any AoC (Factor is looking mighty cool) but I won’t force myself to do anything. With that, let’s get going.

1/12

I started working on the map view which clicking around looked like it could be really annoying. I found some dead ends and was afraid I’d have to hack in Leaflet support myself but I found a dioxus example hidden in the leaflet-rs repository.

Yes, I’m writing this website in Rust/WASM, why do you ask?

That example required a bunch of fiddling with the configuration and a couple of false starts, but now I have a vanilla map view.

I can say that I’m amazed that in this ecosystem 1. an example exists 2. that example works 3. it works in my project with a bit of diffing and 4. it seems to do what I need.

I raised a PR to the project to advertise this example on its README just like it does the others so that others wouldn’t have to search like I did. That PR got merged:

https://github.com/slowtec/leaflet-rs/pull/36

2/12

Today I’ll see if I can tweak the map view to show the location of the cafe we tapped and get things to a point where I can commit the change.

To do this I need to figure out how to pass information along to a router when we tap a venue. That should be easy enough but the Dioxus documentation is between 0.5 and 0.6 now and a lot of it is broken.

A tip from the Discord said I need to put the data into a context from a parent and then get it out again in a child. It’s a bit roundabout and required some refactoring, but it works.

Done on time even for a reasonable bed time.

3/12

Turns out my changes from yesterday did not make it to the staging server. I’ll fix that and manually run the job again.

That’s these annoying wasm-bindgen version errors that keep happening and that require a reinstall of this: cargo install -f wasm-bindgen-cli --version 0.2.97 and the dioxus-cli. Dioxus which by the way is preparing its long awaited 0.6.0 release.

Yes, I build this on the same Hetzner box that hosts it. So here you go: https://staging.cuppin.gs

Other than that not that much will happen today since I spent most of the evening noodling around with Factor (despite my intention not to do any weird programming). It’s a nice language that’s very similar to Uiua which I tried out a while back but not being an array programming language makes it feel somewhat more ergonomic.

4/12

I can’t describe how nice it is to wake up and not have to deal with a mediocre story line involving elves and try to find time to attack a programming problem.

After today, I’m going to need that quiet morning, because I spent until 01:30 debugging an issue: Going to a detail view from the frontpage worked, but loading a detail view directly would throw an error.

There were two issues at play here:

Leaflet maps don’t deal well with being created multiple times so either we have to call `map.remove() or we have to check whether the map has already been created and keep a reference to it somehow.

I solved it by pushing the map into a global variable:

thread_local!(static MAP: RefCell> = RefCell::new(None));

These are Rust constructs I would normally never use so that’s interesting. More interesting is that they work in one go and that they work on the WASM target.

Then the error was gone but the page was blank. Not entirely sure what was happening I poked at the DOM to see all the map elements there but simply not visible. Turns out that because of the different path, the path for the stylesheet was being added to the URL like this: http://127.0.0.1:8080/venue/176/main.css

It just has these two lines:

#map {
    width: 100%;
    height: 100vh;
}

But without a height the map is invisible.

Both issues are solved but not committed. I’ll see tomorrow whether I’m happy with the solution and how to package this up. Also I’m not sure how main.css is being served on production and whether the same fix will work there.

5/12

I couldn’t help but noodle on Advent of Code a bit. Here’s my day 1 part 1 in Factor: https://github.com/alper/advent-of-code/blob/main/2024/day-01/day-01.factor

I like Factor the programming language. It’s like Lisp or Haskell but without all the annoying bits.

The environment that’s provided with it, I’m not so keen about. It’s annoying to use and has lots of weird conventions that aren’t very ergonomic.

6/12

I’ve been bad and I’ve finished part 2 of day 1 of the Advent of Code: https://github.com/alper/advent-of-code/blob/main/2024/day-01/day-01.factor#L27

Not so December Adventure after all maybe. I’ll promise I’ll finish the mapping improvements I was working on tomorrow.

7/12

Went on my weekly long bike ride. Then in the evening I didn’t have that much energy for programming other than finishing Advent of Code day 3 part 1: https://github.com/alper/advent-of-code/commit/0a74c38e7641141e10b4c48203c9e414cc492e1c

(I looked at day 2 part 2 but that just looked very tedious.)

8/12

Got in a ton of commits on Cuppin.gs today. After fixing the map, I wanted to see what would happen if I would add all 2000 markers to the map.

Performance seems to be doable but this is probably not ideal for a webpage. Dynamically rendering the venues is something for later. For now I can probably get away with filtering for the 100-200 nearest locations by distance and dumping those into the map view.

Now I’m back debugging Github Actions. I’m splitting up the build and deploy of the backend and the frontend into separate actions. Compiling dioxus-cli takes forever which is a step I hope I can skip with cargo-binstall.

Iterating on Github Actions takes forever and there really doesn’t seem to be a better way to develop this or a better CI solution that everybody is willing to use.

10/12

Spent some hours massaging the data that goes into the app. I had to add all new venues and after that I wanted to check whether any place in our 2k venue set had closed so we can take them off the display. This is a somewhat tedious multi-step process.

I have an admin binary that calls the Google Maps API for each venue to check the venue data and the business status (CLOSED_TEMPORARILY and such). But to be able to do that you have to feed each place ID into the API. The only issue with place IDs is that they expire from time to time. There’s a free API call that you can use to refresh them.

That expiration does not happen that often. What happens more, I found, is that a place will disappear entirely of Google Maps. For some reason it will be deleted. I don’t handle that case yet so there my updaters break entirely and the quickest fix around it is to delete the venue from the database and restart.

The only data issue that I still have outstanding is when venues move their location to a different address. I have a place around here that I think is still showing on its old spot.

11/12

Tried to run Cuppings in Xcode to be met with some weird compilation errors. Turns out that there’s an Expression type in Foundation that’s overriding my SQLite.swift Expression. It’s a pretty silly reason for code to be broken: Expression – name space conflict with Xcode 16/iOS 18

Also still fighting with the frontend deployments which seem to need a --frozen passed to them to not proactively go update package versions.

14/12

Love to have a crash on startup for the Cuppings TestFlight build and then sit down today to bake a new one and upload that and for that one to work. No clue what the issue was even though I took a look at the crashlog (that I sent in myself).

I’ve also automated building the iOS app to be done by Xcode Cloud which should making new versions (whenever the database is updated) a lot easier.

16/12

Upgraded the frontend to Dioxus 0.6.0 which just came out and has lots of quality of life issues. For my case, I did not need to change a single line of code, just change some version numbers and build a new dioxus-cli.

Nice TUI for serving the frontend

I hope that maybe solves the wasm-bindgen issues on the frontend deploy. The annoying part about the build is that it takes so long that it’s very hard to iterate on.

It’s too late even for me to see what this does. I’m off to bed. You may or may not get a new version of the website by tomorrow morning.

18/12

Spent some iterations running the frontend deploy and rerunning it but now it should be working.

22/12

I spent the evening doing manual data munging and correcting some venue locations that hadn’t been updated correctly through my data life cycle.

That forced me to clarify the two name fields the venues table has.

  • name was the original name field and was pulled from the Foursquare metadata
  • google_name is the name field that’s pulled from Google Maps and was effectively leading but not updated correctly yet when refreshing the data

So to figure that out I did a bunch of auditing in the list to see venues where there was a large discrepancy between the names. Something that happens is that a place will change its name but keep the same location and Google Maps place.

I also added a label to the iOS app to indicate whether this is a DEBUG build but that messed up the layout and I guess I might as well remove it. Sometimes I get confused what I’m running, but since it’s just me running DEBUG builds on their phone, I think I can do without.

I also started a rewrite that I’m not sure I’m going to pull over the line: I wanted to remove the search dependency on Alpine.js and replace it with htmx. For this I asked Cursor to do the translation which it did a stab at but ultimately rather failed to do even the basic steps for it. Then I did it myself and while htmx is super easy to setup, the data juggling I have to do with what I get from Google Maps is very fragile and needs to be cleaned up (which I may or may not do given that things are working right now).

23/12

Working with the backend was very annoying because every time the server restarts, it would log me out. To fix that I changed the persistency of tower-sessions from MemoryStore to FileSessionStorage and that fixed it without issues. There is now a .sessions folder in the backend which needs to be ignored for cargo watch but other than that it’s a drop-in replacement.

That means I will need to write a logout view at some point.

Seeing if I can move from Arc to Vivaldi but there are half a dozen radical improvements in Arc that *make* the experience. It just shows how much innovation and solid thinking was packed in all of that frivolous design.

Vivaldi on the other hand has a million settings which mostly show that nobody knows wat this app is supposed to be doing. There are entire note taking apps and e-mail clients in there but none of them fun or nice to use.

Products truly live and die in the pixels.

I painstakingly built a bespoke Rust web application to host the Cuppings venue data and to add Google place_ids to almost 2000 Foursquare location. That’s been done for a while now but now we have the announcement of Foursquare open sourcing their location dataset.

That has two direct consequences for me:

  • I was going to scrub the Foursquare data out of the database as a clean-up but that’s something I won’t do for now. In fact, I may recode the venues so I have ids in both worlds.
  • I was toying around with the idea of building a next generation Foursquare/Dopplr on top of atproto which is something that I think is a lot more feasible now.

https://simonwillison.net/2024/Nov/20/foursquare-open-source-places/

Reading these database migration stories is usually interesting, but what I found especially noteworthy here is that all of the Django features they used to make it easy for themselves have been in there for more than a decade.

That’s the kind of maturity that maybe makes a technology like that not that appealing to work with for new developers but it is also a maturity that lets you get real work done.

https://engineeringblog.yelp.com/2024/10/migrating-from-postgres-to-mysql.html

I’ve been waiting for an updated edition of “Designing Data-Intensive Applications” and now I see that Kleppmann is working on it. Reading the first edition has given me such an outsized advantage when architecting and building systems.

https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058

MySQL encoding mistakes corrupting data in this decade?!?

Turns out I hadn’t noticed my hosting provider Vimexx has their MySQL databases on latin1 coding and this blog was running on that happily corrupting data.

Given how common an occurrence this is—MySQL very regularly will throw your shit into the street and set fire to it—I had expected there being scripts or resources to fix this. Of course nothing was to be found anywhere.

I asked the Mastodon MySQL expert who did have a resource on the exact problem: https://blog.koehntopp.info/2020/08/18/mysql-character-sets.html

The way I fixed it was a bit more manual than I’d have liked but where I got is good enough and I’m not sure I’ll go for anything perfect:

Go to the phpMyAdmin and audit all the database tables.

My tables are in a mix of InnoDB and MyISAM which seems to be weird but not really problematic. I also had some Yoast tables that were lingering there which I dropped.

Find the setting and convert all tables and their columns to the collation utf8mb4_unicode_ci. A collation implies the utf8mb4 character set that is its prefix so you don’t have to change the character set.

Now all your stuff is in UTF-8 but because of a coding error a lot of your content is messed up. A unicode character can be more than one byte but in latin1 each character can only be a single byte. So if your unicode character is two bytes, they are interpreted as two latin1 characters which is why you end up with stuff like “î“.

Maybe there would have been a clean automatic way to convert the data, but I felt it was fiddly enough as it was, so I opted for a manual fix. I identified where the corruption happened:

  • wp_posts columns post_content and post_title
  • wp_comments column comment_content
  • wp_usermeta column meta_value

Then I just ran queries to fix all the mismatches:

ü → ü
Ãœ → Ü
é → é
É → É
ÄŸ → ğ
Ç → Ç
etc.

Luckily in almost all cases the wrong coded string is unique and can simply be replaced with the right character.

Check if a string is in the column:
SELECT post_content from wp_posts where post_content LIKE BINARY '%Ç%' and post_status='publish'

Later on check for specific characters and their environment in what can be very long post bodies:
SELECT SUBSTRING(post_content, LOCATE('Ã', post_content)-15, 40), post_content from wp_posts where post_content LIKE BINARY '%Ã%' and post_status='publish'

Replace the wrong string sequence with the correct character:
UPDATE wp_posts SET post_content = REPLACE(post_content, 'Ç', 'Ç') WHERE INSTR(post_content, 'Ç') > 0

After some hours of auditing and pounding SQL most of the things should be fixed and whatever’s left I can live with.

Conclusion

The moral of this story is that the entire complex of WordPress/PHP/MySQL is a pile of shit that should be burnt off the face of the planet. The fact that we can have these kind of encoding issues in the year 2024 shows what an absolute joke these systems are. Especially with the Mullenweg meltdowns, anybody who can get out of WordPress should do so.

This blog hasn’t received a comment or other bit of interactivity in years so I think I could also rip all the content (effectively just two columns in wp_posts) out and host it on something that’s statically built. No reason to pay for a shit hosting provider like Vimexx anymore either.

Thoroughness unpacked in three dimensions like this by James Stanier is so good and gives a much better way to think and talk about issues of velocity:

Scope is what you’re building.
Scalability is how well it will work as you grow.
Sustainability is how well it will work over time.

https://theengineeringmanager.substack.com/p/scope-hmm

As somebody working in platform for the past years, I’ve become very familiar with the different dimensions of this debate around productivity and John here unpacks the topic in a way that’s really useful. I used the nails analogy just yesterday.

Even more than a nuanced understanding of why developer productivity is so challenging to improve, the last bit of the piece is even more on the money because it’s what drives decision making in most companies (tech companies are no exception):

“Can you imagine how hard it would be to walk into a meeting with investors, whoever, and say, ‘um, you thought you had a 30mpg car, and it is a 15mpg car?”

https://cutlefish.substack.com/p/tbm-304-losing-a-day-a-week-to-inefficiencies

Always nice to be able to write stuff that the team has been doing and share it with the world.

Kubernetes is a very maligned technology but if properly managed it can be part of an entirely boring infrastructure portfolio. Realistically it’s not doing that much more than running docker on a bunch of machines and pulling images. React has a similarly bad reputation which is not stopping lots of developers from getting tons of work done with it.

https://choco.com/us/stories/life-at-choco/journey-to-kubernetes

A piece about real-world Rust development that struck a cord with many people. Most of the issues listed here are valid and longstanding to the point that you have to wonder if they’ll ever be fixed.

I have a similar feeling around Rust web development where for all the good building blocks it doesn’t really seem to get off the ground. At the same time Go has been going really hard for ages. Maybe spending all your time to get the types to line up doesn’t leave room for building?

https://loglog.games/blog/leaving-rust-gamedev/

That popular open source package managers will at some point all get owned is so inevitable that it’s hardly worth mentioning.

Cocoapods in this case is a bit of an outlier because the entire setup here has been so broken to begin with. iOS development never really allowed for dependency management so Cocoapods did it in an very hacky way and it was written in Ruby, a relatively niche end-of-life language that would have no chance to be blessed by Apple and shouldn’t be used for anything serious to begin with. (Don’t even get me started on Carthage.)

Swift Package Manager has been released years ago but lots of projects of course never manage to switch. I believe the best thing a project can do in such a situation is to terminate itself for the greater good.

https://www.theregister.com/2024/07/02/cocoapods_vulns_supply_chain_potential/

More people have mentioned it and I think it should be part of every Rust tutorial to encourage people to just clone() whenever they get in a jam and get their stuff done: “keep calm, clone and move on”. I think that one thing will make it possible to onboard any team onto Rust quickly and get them shipping.

Performance will still be better than in most other languages and you can optimize this stuff out after you’ve got things to work.

https://blog.sdf.com/p/fast-development-in-rust-part-one

The Netherlands is facing similar problems where depressed salaries, lack of housing and rampant overt racism are making it difficult to attract digital talent from all over the world.

You know, countries could have promoted STEM education as a pursuit decades ago but given the state of things, nothing is getting done in technology without people from outside of Europe. Let’s see whether we make the smart choice this time round, or whether we’ll see countries ‘cutting their nose to spite their face’ as the saying goes.

https://www.golem.de/news/tech-standort-ostdeutschland-als-waere-das-image-nicht-schon-schlecht-genug-2403-182921.html

Amos’s style of software engineering historiography accompanied with snide commentary on the state of the art is both educational and entertaining. The weird factoids about Github Actions are the main act here but don’t miss out on the introduction on software delivery or the lead out on capitalism.

(Also I’m in the credits on this one!)

I haven’t tried it out yet but seeing the collaboration features in Zed described here, that sounds pretty much like my ideal workflow.

Chat channels including voice and screen sharing integrated directly into a lightning fast editor enabling seamless collaboration and visibility on who are doing something together. Unscheduled calls instead of endless calendar invites that don’t fit the shape of the work anyway.

https://registerspill.thorstenball.com/p/the-lightness-of-unscheduled-calls

This piece about moving away from CDK is a bit overly dramatic and comes down to “CloudFormation sucks” which is something that anybody who’s worked with it can testify to.

That said if you’re committed to AWS as your cloud provider, CDK is an amazing piece of technology that bridges the worlds of infrastructure operations and programming.

If the concept of bridged terraform providers for pulumi is something that proves itself, that of course would be great, but I’d say it’s still pretty uncertain.

https://sst.dev/blog/moving-away-from-cdk.html

It’s much healthier for Germany if digital issues have an answer that goes beyond “Let’s see what the CCC has to say!” The CCC is a shady organization which is good at taking things apart but does not have that much constructive to offer.

A broader social discussion would reveal that security and privacy are not the only two dimensions on which digital solutions can or should be measured.

https://www.golem.de/news/37c3-der-hackerkongress-fast-unter-ausschluss-der-oeffentlichkeit-2401-180837.html

Late to the party but I very much love this interview with Karri Saarinen, the co-founder of Linear. Their way of working, “The Linear Method”, will be waved away by companies (“we can’t do that because…”) but with leadership with the right mentality and experience I don’t think it’s that far off at all. Ask your leadership how you can work like this.

Also I already know I’m going to use the term “side quest” a lot.

We don’t use Linear but we recently moved all our stuff from Jira to Github Projects which—even though it is mostly abandoned—is Linear-enough.

Most importantly, it is right on top of our codebase which is where I believe all engineering work should happen anyway.

This article is a wild premise, a wild ride and a wild conclusion (also I’m increasingly warming to the idea of htmx).

“Every cloud-pilled, react-vue-braindead, click-to-deploy developer actually thinks web views require 7 minutes to “compile for production,”then when live require 5-15 second “skeleton loaders”on entry is just a fact of life nobody can question or ever improve on modern 5 GHz machines with 5 Gbps network connections. Developers, at the median, have been getting less capable and more focused on made up silo/cult/trendy dead-end fads for 10 years and the entire world suffers daily.”

https://matt.sh/htmx-is-a-erlang

Notion has formulas now (!) and here’s a formula to calculate a Cost of Delay column based on two other columns:

if(Value=="Killer" && Urgency=="ASAP", "1 Very High", if(Value=="Killer" && Urgency=="Soon" || Value=="Bonus" && Urgency == "ASAP", "2 High", if(Value=="Killer"&&Urgency=="Whenever"||Value=="Bonus"&&Urgency=="Soon"||Value=="Meh"&&Urgency=="ASAP", "3 Medium", if(Value=="Bonus"&&Urgency=="Whenever"||Value=="Meh"&&Urgency=="Soon", "4 Low", "5 Very Low"))))

“Squashing destroys this information. I’ll take a merge with 1000 +50/-50 commits over 1 squash every. single. day.”

I’ve been hammering on this fact as well that it’s silly to use git and then throw away so much information that you could use later. But then again most people don’t know git bisect exists.

I get asked pretty regularly what my opinion is on merge commits vs rebasing vs squashing. I’ve typed up this response so many times that I’ve decided to just put it in a gist so I can reference it whenever it comes up again.

I use merge, squash, rebase all situationally. I believe they all have their merits but their usage depends on the context. I think anyone who says any particular strategy is the right answer 100% of the time is wrong, but I think there is considerable acceptable leeway in when you use each. What follows is my personal and professional opinion:

I prefer merge and creating a merge commit because I think it best represents true history. You can see the merge point, you can see all the WIP commits the developer went through. You can revert the whole merge easily (git revert -mN ). I create merge commits more than 9 out of every 10 PRs.

I also believe having more commits makes git bisect better, as long as every commit builds. I hate hate hate when I bisect a project only to land on a single squashed commit from a single PR that is like +2000/-500. That is… not helpful at all. I want to bisect and land on a commit thats at worst like +500/-500. At worst. Ideally I land on a commit thats more like +50/-50. Then I can say “ah hah,the bug is there.” Squashing destroys this information. I’ll take a merge with 1000 +50/-50 commits over 1 squash every. single. day.

This strategy depends on good hygiene by the developer keeping every commit building. I follow this rule 99% of the time (I make mistakes, but I try very hard not to). In OSS, you can’t really control this and I’ll sometimes end up fixing up commits for people (using interactive rebase prior to making a merge commit). In a professional environment when I was an engineering leader, I would generally expect engineers I worked with to keep every commit buildable.

I do squash though when a PR has a bajillion tiny “WIP” “WIP” “WIP” commits but is really aiming towards one goal with a relatively small diff. That’s my squash use case. I’m careful when squashing to rewrite the commit message so it is descriptive. The default squash commit message created by Git and GitHub is not good (it just concatenates all the squashed commit messages, usually a series of “WIP”).

If you have a big diff AND a lot of “WIP”, then I rebase (interactively), and selectively squash and reorder commits where it makes sense. I tend to expect developers to do this and care about their commit hygiene, but unfortunately a lot of developers aren’t that comfortable with Git. In the OSS world, I do it for them. When I was an engineering manager back in the day, I’d expect engineers I worked with to have this knowledge.

On this last point, I also tend to use a Git GUI client for large interactive rebases. I’m extremely comfortable with the Git CLI but when I’m interactively rebasing a very large PR (say, 50+ commits) with a large number of changed lines, I find using a GUI to be helpful. I’m on macOS so I use Tower. This is the only situation I actually use a GUI, though.