I find limited run podcast series that dive deep into a specific topic are some of the best things the medium brings forth.

Here’s one from Escape Collective that covers the intersection of a bunch of special interests: bike manufacturing, the pandemic, supply chains and how it all went to hell.

I bought a road bike myself at the end of the pandemic and the supply chain shortages made it so that I’m on a Cube CX bike which is fine but definitely not what I would have gotten if there had been a wide range of bikes to choose from.

https://escapecollective.com/how-did-the-bike-industry-get-into-such-deep-trouble

An roller coaster of an interview with a series of FizzBuzz extensions that are honestly not even that bad, but poor form from the interviewers’ side not to recognize the absolute balling brilliance on display here.

I can just about follow along with this level of Typescript type level programming but to be able to whip this out during an interview is a testament of mastery.

https://kranga.notion.site/The-fizzbuzz-that-did-not-get-me-the-job-180e7c22ef3b80c3a386f7f8de720ac7

Do not fall into the trap of anthropomorphising Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn, you stick your hand in there and it’ll chop it off, the end. You don’t think ‘oh, the lawnmower hates me’ – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.

I think this is a well argued plea by Ken Shirriff to stop using the term “cargo cult” and I agree with it but would like to add two things.

With a non-English speaking audience that does not have the same priors, nobody will have an idea what you are talking about if you use the term “cargo cult”. You’ll be stuck explaining the term in a ham-fisted way that will fail to convey the huge amount of history and social science involved.

One problem with rejecting the term is that it lets software engineers off the hook and allows them to pretend the way they work is different from the tribal inhabitants of pacific islands. I argue that most software engineering practice is based on folklore and is deeply tribalistic.

https://www.righto.com/2025/01/its-time-to-abandon-cargo-cult-metaphor.html

No port/adapter terms to learn, no unnecessary layers of horizontal abstractions, no extraneous cognitive load.

Reducing cognitive load is continuous battle. The entropy direction of a software system is always towards more complexity.

People add a lot of this stuff out of either inexperience or because they need to look smart. The simple code that gets the job done is often looked down upon.

https://minds.md/zakirullin/cognitive

What’s happening in Flint is a good example of how the social effects of institutional lapses are much more difficult to fix than just replacing a bunch of pipes. We’re seeing the same effects happening around COVID and its after effects and we’ll see many more examples of government distrust and chaotic confusion in the coming decades.

https://www.politico.com/news/magazine/2020/12/23/flint-water-crisis-2020-post-coronavirus-america-445459

For those who want to introduce some whimsy into their programming and for whom using a variable width font in your code editor is a bit too far, there is now Comic Mono (via). It doesn’t even look all too terrible.

(After using Iosevka and Inconsolata for a long time, I’m now as are many people a happy Jetbrains Mono user.)

https://dtinth.github.io/comic-mono-font

I love Github Projects for tracking work. It’s close to the code and engineers understand it natively. I feel you can deliver features of any size with it if you work with the tool.

The only thing that’s a bit annoying is the lack of improvements from Github. There’s a bunch of quality of life features I’m used to from other tools that would really make a difference. But now with LLMs we don’t have to settle.

I asked Cursor to write me a user script that adds a “Create Follow-up Task” button (that I used a lot on Asana) to issues on Github. It did a reasonable enough job that I could tweak and then have something working for me. I could write this myself of course but the hurdle of figuring out the format and the wiring felt like a blocker.

https://github.com/alper/user-scripts/blob/main/github-followup-issue/github-followup.user.js

I think Facebook rolled back their block of Pixelfed but they’re right to be spooked. Showing a bunch of pictures in a stream hardly seems like a technological challenge. And what are all the people working at Instagram doing other than figuring out novel ways to track you and serve you ads?

You should definitely try out Pixelfed which is more than usable.

https://www.heise.de/en/news/Unprecedented-growth-Facebook-blocks-links-to-Instagram-alternative-Pixelfed-10237928.html

Maps for where you’ve been in Europe and the US (via). Stayed means having spent a night there which means the only places I’ve visited and not spent the night are Slovenia (went over the border to hike a mountain there) and the Vatican.

I should probably upgrade Turkey to lived depending on your definition: stayed in your own house or registered as a resident.

My score for the US is negligible and I don’t see this changing any time soon (maybe ever).

In Germany for many transactions you need a proof of address which a Personalausweis provides for German citizens. Us foreigners don’t get one however often we ask for it. In the Netherlands moving through the country without a battery of chip cards (OV chipkaart, Bonuskaart, OV-fiets etc.), apps and associated services is costly and annoying.

The signs have been there for a while but China seems to be pushing this much further along. The question is whether it’s a deliberate move or that the number of people affected is so small that they’re a negligible edge case for the policy makers over there.

https://substack.com/home/post/p-136339096

It’s rare to find writing in German as lithe and delightful as what Christoph Rauscher puts out. The monthly lists are one particularly good example. I’m learning new and interesting words still in most of his pieces.

I totally agree that “Writing = Design” and you should hire him for Design/Writing/Illustration: https://christophrauscher.de/writing/

I’ve had to spend more time than I like thinking about how datetimes are stored in databases and even the commonly accepted practice of storing UTC does not work for all cases.

Specifically when you store something that would happen in the future, you need to store the location of the event as well. Otherwise any time of daylight savings change will shift your event around. This is not just for single events but can also happen for say ordering cut-off times which aren’t pinned to a single date.

A very useful thought experiment whenever anybody tries to pretend LLMs are ‘human’ because they sound human.

Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".

Colin (@colin-fraser.net) 2024-12-19T23:15:38.459Z

[…] the Rands Leadership Slack is the most impactful thing I’ve built outside my family and job.

I can testify to the Rands Leadership Slack being an impactful thing. I’ve joined it a long time ago and more or less everything I know about engineering leadership I’ve learned there. I’m eternally grateful for all the hard work the people there put into making that a nice place to be.

https://randsinrepose.com/archives/just-hard-work

RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models

Medical QA depends heavily on domain-specific knowledge that is not always available within pre-trained models, necessitating knowledge-based retrieval from external sources.

In addition medical knowledge evolves rapidly, and new treatments or updated guidelines may not be included in the model’s pertained corpus.

The question example for the reasoning process in Figure 1 is on a multiple-choice question. That seems overly simple.

In parallel, Commonsense Question Answering shares similar complexities with Medical QA, particularly in its reliance on structured multi-step reasoning and iterative evidence retrieval.

rStar (Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers)

The rStar approach seems worth diving into. That will be the paper I read next.

Monte Carlo Tree Search

enabling the open source LLMs (LLAMA3.1) to achieve competitive performance with top closed-source LLMs like GPT-4 and GPT-4o.

We’ll come to this later in the paper. Their conclusion is that they can trick out LLAMA to get similar performance to GPT-4 in these domains.

Upper Confidence Bound applied on trees (UCT)

In contrast, rStar incorporates five distinct actions that enable more adaptive exploration:

A1: Propose a One-Step Thought. This action generates the next reasoning step based on previous steps, allowing the LLM to build the solution incrementally.

A2: Propose Remaining Thought Steps. This action enables the LLM to produce all remaining reasoning steps in one inference, similar to CoT, for simpler questions.

A3: Generate Next Sub-question and Answer. This action decomposes the main problem into a sequence of sub-questions, each solved in turn. A4: Re-answer Sub-question. This action allows the LLM to re-answer a previously generated sub- question, increasing accuracy by using few-shot prompting.

A5: Rephrase Question/Sub-question. This action rephrases the question to clarify conditions and reduce misunderstandings, enhancing the LLM’s interpretation of the problem.

I need to trace the rStar algorithm after reading the original paper. The explanation here is too short.

These queries target information that can either support or refute the content of each statement, ensuring comprehensive factual verification.

How does this approach deal with (non-)negation that LLMs often have a lot of trouble with? From a language perspective it could just as easily say I can or can’t eat grapefruit (iykyk) based on the temperature that day but especially in a medical context these kind of errors can be catastrophic.

RARE achieves substantial gains, outperforming rStar by 5.17% on MedQA, 2.19% on MedMCQA and 2.39% on MMLU-Medical.

Even if these numbers are statistically significant (which they don’t say), these increases are really modest. I would not call this in any way “substantial”.

Looking at Table 1, RARE is as much of an increase over rStar as rStar is over the next best approach so from that perspective maybe you could call it significant. The difference between worst and best framework here is around 10% across CoT, RAG, SC, rStar, RARE.

evaluated on StrategyQA (SQA), CommonsenseQA (CQA), Social IQA (SIQA) and Physical IQA (PIQA)

The main question I have is from what percentage accuracy such a system would be reasonably possible to use in a real world context. Even at 90-95% that would seem like it would be too low to rely on when the stakes are high.

By enhancing LLMs with retrieval-augmented reasoning, RARE bridges the gap between open source models and state-of-the-art proprietary systems.

The framework has only been tested on open source models like LLaMA 3.1 and not on larger proprietary models such as GPT-4. This is due to the high number of API calls required by RARE’s iterative retrieval and reasoning process, making evaluation on closed source models prohibitively costly.

So here they repeat the statement that they’ve bridged the gap but they say they haven’t used this approach with a model like GPT-4 because the number of API calls would make it too expensive.

That leaves on the table that these kind of many-call approaches are open to OpenAI because they can do these numbers of calls much more affordably from inside the house. No real gap has been closed here and it shows again how big of an advantage OpenAI has.

That raises the question: What makes GPT-4 so good? Why does it perform so much better than open source models?

RARE is designed to identify a single reasoning trajectory that leads to a correct answer but does not necessarily optimise for the best or shortest path that maximises robustness.

Any integration into medical workflows must be supervised by qualified practitioners to ensure patient safety and ethical use.

Year in Review 2024

It’s been a bit of a grab bag year but overall not as bad as 2023 and a bunch of things seem to be on track.

Health

I got on the neurodiversity bandwagon this year.

First I got myself a self-paid diagnosis for ADHD. This result should not surprise anybody who knows me. I’ve forced myself to be very high functioning throughout my life but it can’t be denied that there were always some underlying issues. I’m on medication from the end of the year and have gone off caffeine.

I also got myself tested for giftedness and got a positive result there as well.

Both of these results were validating if nothing else and put a lot of things that happened in my life in a different perspective.

For anybody who’s not sure whether they should pursue this, my recommendation would be: You will only know how differently you can feel if you do.

I got a mole cut out of my skin. It’s a nice scar to have.

I’m fully vaxxed against FSME and got a booster for COVID in November. That brings me to six jabs in total.

Sports and Injuries

It could have been a great year for sports. After having a great time on our yearly trip to the Alps, I came back to Berlin and badly sprained my ankle after falling off some stairs. I didn’t need any surgery, thankfully, but it did set me back some 8 weeks of physical therapy and having to build up to walking again.

That notwithstanding, I managed to participate in three road cycling group rides this year. MAAP opening up a store here and organising open weekly rides has been really cool. The cycling and the coffee were lit. 🔥

I cycled up the Brocken for my first ever mountain and clocked 4201km in 2024 on Strava.

It’s my goal to weigh 75kgs and I’m still as far away from that as I ever was.

Movies

Letterbox does a good job tracking this and it was a pretty good year for movies. I review all of them over there in detail but I can say the non-Potter kids movies we watched were nice and the Japanese cinema on the whole was excellent. I saw Evil Does Not Exist two times with the second time in the local theatre live scored by its composer Eiko Ishibashi.

  • Harry Potter and the Philosopher’s Stone
  • Dune: Part Two
  • Curious Tobi and the Treasure Hunt to the Flying Rivers
  • Glass Onion
  • Frozen
  • Tangled
  • Raya and the Last Dragon
  • Shoplifters
  • Luca
  • Harry Potter and the Chamber of Secrets
  • Yojimbo
  • Drive My Car
  • Perfect Days
  • John Wick: Chapter 4
  • Evil Does Not Exist
  • How to Blow Up a Pipeline
  • Harakiri
  • Evil Does Not Exist
  • Die Hard

Television

Trakt is doing a great job keeping track of which episodes of which television series I need to watch. It’s the only way I can possibly stay on top of this.

  • The Last of Us
  • Spy x Family S2
  • Death Note
  • Frieren
  • Tour de France: Unchained S2
  • Vigil
  • The Peripheral
  • Kaiju No 8
  • Bluey
  • Arcane S2

Looks like I’m turning into a weeb just like everybody else in the culture. I watch anime in part as light entertainment and in part as Japanese immersion. It’s very hard to find anime that has any kind of thematic depth. Frieren comes closest because of how it twists the standard fantasy trope into a story about loss and reminiscence.

Books

It was a fair though not great year for reading.

  • Sheaf Theory through Examples, Daniel Rosiak
  • Bring Up the Bodies, Hilary Mantel
  • Min kamp 2, Knausgård, Karl Ove
  • Maria Stuart, Friedrich Schiller
  • Arkada Yaylılar Çalıyor, Melikşah Altuntaş
  • My Tender Matador, Pedro Lemebel
  • Kafka Connect: Build and Run Data Pipelines, Mickail Maison
  • Let Us Believe in the Beginning of the Cold Season, Forugh Farrokzhad
  • Discipline and Punish: The Birth of the Prison, Michel Foucault
  • The Kubernetes Book: 2024 Edition, Nigel Poulton
  • Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-premises, Elad Eldor
  • Conversational Capacity: The Secret to Building Successful Teams That Perform When the Pressure Is on, Craig Weber

I’m continuing my trend of reading one Knausgård and one Mantel book each year. No reason not to do that again this year.

I picked up some poetry at Perdu during my visit to Amsterdam and have been enjoying reading that.

Every time I see Maria Stuart (which I got put on to by Past Present Future’s fantastic Great Political Fictions series) in the list, I think: “I need to read more Schiller.” but then I keep forgetting to get the files off Gutenberg. Germans sure knew how to write back in the day.

Trips

Besides the trip to the Alps, I went to the Netherlands once in 2024 for Kars’s viva and we took a trip to idyllic Hiddensee after my foot was healed. Much more travel is slated for next year!

Other Culture

I don’t go to exhibitions for lack of time. Besides seeing Evil Does Not Exist in the theatre I managed to burn a ticket to the opera and one to a dance show due to conflicting commitments and forgetfulness. I’m not sure whether I’m going to retry this.

I took the kids to see Ronja at an open air show which was fun.

Miscellaneous

I was a member of the Greens but I cancelled that because even if they’re the least bad political party in Germany, they have been doing a lot of things that I do not wish to support from the inside. I wrote about that here.

I continued to learn and maintain my Japanese level in preparation for my trip in 2025.

I learned a bunch around Kubernetes and Kafka but would have liked to do more programming. I refreshed my algorithms a bit and picked up Factor to play with.

The terminal I use daily (because it’s the best really), fish, has been rewritten entirely in Rust, because it’s nice and more fun: “For one, fish is a hobby project, and that means we want it to be fun for us. Nobody is being paid to work on fish, so we need it to be fun. Being fun and interesting also attracts contributors.”

I can testify to this because when most of the code was rewritten I checked it out, built it and poked around a bunch to see how it works. I don’t think I would have done that or enjoyed doing it if it had been a C++ codebase. That was also when I was confronted with the fact that what makes a terminal really complicated is not the language in which it is programmed, but the underlying system that it is an interface to.

The story of the port and its success is legendary as far as these things go.

https://fishshell.com/blog/rustport

There was a brief period where Foursquare based recommendations were good and drawn from your wider social graph. Now we’ve gone back to Yelp and Google Maps where reviews and ratings don’t mean anything. A lower than 4 star review on GMaps has netted me a cease-and-desist e-mail for defamation.

That puts personally curated travel docs and word of mouth back in play as Thrillist describes here. Every Dutch person has or knows somebody who has a Berlin Google Doc with all the Geheimtipps. Dutch people’s tastes are fairly predictable and pedestrian, so these’ll mostly be cheap Asian eateries in Prenzlauerberg but that’s also fine.

For me the most interesting recommendations for Berlin but also for other cities come through TikTok. The algorithm is well tuned to my type of person and in the short videos it’s pretty easy to size up whether somebody knows what they’re talking about or not.

https://www.thrillist.com/travel/nation/google-docs-are-the-ideal-travel-guides

Hans de Zwart’s end of the year media overviews are one of the highlights of what still happens on personal blogs for me. He’s a voracious reader and one of the rare people who acts on his moral clarity. Also, Hans is a great guy and I had the chance to briefly catch-up with him last year.

I’ll see if I can pull something together, but definitely go through his list. I always pick up more than a couple of interesting things to explore.

Another take on the old adage that writing the code is the easy part of software engineering. The real work is figuring out what has to be built and how. Once that is clear, the actual building can be done relatively quickly and linearly.

I think the notion of a dead program is useful though it’s not always that clear cut:

The death of a program happens when the programmer team possessing its theory is dissolved. A dead program may continue to be used for execution in a computer and to produce useful results.

https://olano.dev/blog/software-design-is-knowledge-building

Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

A paper where they fine tune an LLM to be able to answer some questions itself and figure out for which questions it needs to use a specialized tool. Intelligent tool usage seems like it would expand the use cases for LLM driven systems much more than any kind of scaling (real or imagined).

However, scholars note that their abilities are capped at approximately high-school levels

That seems like a noteworthy statement especially if you are looking to LLMs to provide “novel thinking”. It would seem much more that high school problems are abundantly available and relatively trivial so they see a specific focus.

For numerical answers in the MATH and SciBench datasets, we consider answers correct if they fall within a small tolerance range of the true value, specifically within ±5%.

Don’t really see why you could not get exact answers in a mathematical domain.

This performance gap on public benchmarks is likely due to the larger parameter count and specific optimization of state-of-the-art models on the open-source datasets.

Same as with the high school questions. These datasets are easily available and draw attention so the models overfit on them.

The model Ours-Pn demonstrates performance comparable to Base-Pf , both showing a significant improvement over the base model. This similarity indicates successful internalization of distilled knowledge from tools. The transition from Ours-Pn to Ours-Pi showcases further improvement in answer accuracy, resulting from the model’s enhanced ability to intelligently switch to tools for harder questions.

This is the core proposition of the paper. Looking at Table 1 with the accuracy percentages there is something of an improvement but it does not really look dramatic or so convincing that you could use these systems in any critical context.

We’re looking at increases of 10-20% and an accuracy that’s still well under 90% (which I’m also not convinced would be usable).

We introduced a novel two-component fine-tuning approach to enhance Large Language Models (LLMs) in solving scientific problems of varying complexity.

One of the key issues with the paper I have is how much work the term “scientific problems” is doing. If this is published, people are going to think that the LLM is solving actual novel issues where in this case it’s just filling in relatively basic question/answer pairs that are well understood. Calling them problems is problematic.

The most interesting part of the paper is the appendix where you can see the actual questions and answers in the various datasets and the prompts they used (with example responses). The answers mostly are multiple choice which already influences how many of them you should expect to be correct.