We can probably look forward to see more large American companies destabilize under a regime of financialization, deregulation and competitive pressure.
https://www.cnbc.com/2025/01/23/boeing-details-losses-from-labor-strike-production-issues.html
We can probably look forward to see more large American companies destabilize under a regime of financialization, deregulation and competitive pressure.
https://www.cnbc.com/2025/01/23/boeing-details-losses-from-labor-strike-production-issues.html
A pretty good explanation of all the weird acronyms that make modern e-mail infrastructure absolutely inscrutable.
That’s not really different from deep dives into DNS, TLS, CORS etc. The things we use to build the internet have become (out of necessity) incredibly specialized and complicated.
https://www.mythic-beasts.com/blog/2025/01/29/the-death-of-email-forwarding
I find limited run podcast series that dive deep into a specific topic are some of the best things the medium brings forth.
Here’s one from Escape Collective that covers the intersection of a bunch of special interests: bike manufacturing, the pandemic, supply chains and how it all went to hell.
I bought a road bike myself at the end of the pandemic and the supply chain shortages made it so that I’m on a Cube CX bike which is fine but definitely not what I would have gotten if there had been a wide range of bikes to choose from.
https://escapecollective.com/how-did-the-bike-industry-get-into-such-deep-trouble
An roller coaster of an interview with a series of FizzBuzz extensions that are honestly not even that bad, but poor form from the interviewers’ side not to recognize the absolute balling brilliance on display here.
I can just about follow along with this level of Typescript type level programming but to be able to whip this out during an interview is a testament of mastery.
https://kranga.notion.site/The-fizzbuzz-that-did-not-get-me-the-job-180e7c22ef3b80c3a386f7f8de720ac7
Sam Altman and OpenAI are consistently unreliable. One more example that by this point should not surprise anybody.
Let’s hope for a quick ending to Twitter.
https://www.theverge.com/2025/1/24/24351317/elon-musk-x-twitter-bank-debt-stagnant-growth
Do not fall into the trap of anthropomorphising Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn, you stick your hand in there and it’ll chop it off, the end. You don’t think ‘oh, the lawnmower hates me’ – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.
Trashfuture were really cooking when they taped “The Sulla of Suburbia” with Patrick Wyman.
They want Sulla:
They have a hierarchical view of the world:
November saying that they really want Hitler:
The op-ed departments love this stuff:
I think this is a well argued plea by Ken Shirriff to stop using the term “cargo cult” and I agree with it but would like to add two things.
With a non-English speaking audience that does not have the same priors, nobody will have an idea what you are talking about if you use the term “cargo cult”. You’ll be stuck explaining the term in a ham-fisted way that will fail to convey the huge amount of history and social science involved.
One problem with rejecting the term is that it lets software engineers off the hook and allows them to pretend the way they work is different from the tribal inhabitants of pacific islands. I argue that most software engineering practice is based on folklore and is deeply tribalistic.
https://www.righto.com/2025/01/its-time-to-abandon-cargo-cult-metaphor.html
No port/adapter terms to learn, no unnecessary layers of horizontal abstractions, no extraneous cognitive load.
Reducing cognitive load is continuous battle. The entropy direction of a software system is always towards more complexity.
People add a lot of this stuff out of either inexperience or because they need to look smart. The simple code that gets the job done is often looked down upon.
What’s happening in Flint is a good example of how the social effects of institutional lapses are much more difficult to fix than just replacing a bunch of pipes. We’re seeing the same effects happening around COVID and its after effects and we’ll see many more examples of government distrust and chaotic confusion in the coming decades.
For those who want to introduce some whimsy into their programming and for whom using a variable width font in your code editor is a bit too far, there is now Comic Mono (via). It doesn’t even look all too terrible.
(After using Iosevka and Inconsolata for a long time, I’m now as are many people a happy Jetbrains Mono user.)
I would have been surprised if Devin had performed even 1/10th as well as it was hyped. This is a good clean write-up.
Social media excitement and company valuations have minimal relationship to real-world utility. We’ve found the most reliable signal comes from detailed stories of users shipping products and services.
I love Github Projects for tracking work. It’s close to the code and engineers understand it natively. I feel you can deliver features of any size with it if you work with the tool.
The only thing that’s a bit annoying is the lack of improvements from Github. There’s a bunch of quality of life features I’m used to from other tools that would really make a difference. But now with LLMs we don’t have to settle.
I asked Cursor to write me a user script that adds a “Create Follow-up Task” button (that I used a lot on Asana) to issues on Github. It did a reasonable enough job that I could tweak and then have something working for me. I could write this myself of course but the hurdle of figuring out the format and the wiring felt like a blocker.
https://github.com/alper/user-scripts/blob/main/github-followup-issue/github-followup.user.js
As if it were possible for me to love htmx more, they post a governance statement that the project is stable and there will be no more new features.
This is so much better than the endless revamps and scope creep that other projects suffer from.
I’ll have to agree here that for normal engineers it’s better to not really engage with Glue work.
Just drop it. If it’s important, the org will figure out a way to pick it up. That’s why you are part of an org in the first place.
Imagine shipping an app that is so bad that the CEO has to step down. Sonos has made it happen.
https://www.theverge.com/2025/1/13/24342179/sonos-ceo-patrick-spence-resignation-reason-app
Running a LLM on a Nintendo Switch is a marvellous little hack and it’s also a testament to the strength of NVIDIA’s platform.
I think Facebook rolled back their block of Pixelfed but they’re right to be spooked. Showing a bunch of pictures in a stream hardly seems like a technological challenge. And what are all the people working at Instagram doing other than figuring out novel ways to track you and serve you ads?
You should definitely try out Pixelfed which is more than usable.
A lack of knowledge about queueing theory and a dash of wishful thinking make this a common trap for most developers to believe that as long as you add bigger queues you can wiggle your way out of any scaling problem.
You can’t.
Brazil still remembers that it’s a state with state power and has found a delightful habit of pouncing on social media sites. Other states should follow suit.
A more than accurate description of my entire LinkedIn experience.
https://matduggan.com/stop-trying-to-schedule-a-call-with-me
The two entities arguably most responsible for keeping Germany in the digital dark ages, the CSU and Telekom, have found each other.
The Agents chapter from Chip Huyen’s book “AI Engineering” is clear and enjoyable to read. She’s right that “the concept of an agent is fairly simple” but building something functional still looks like a massive lift.
Maps for where you’ve been in Europe and the US (via). Stayed means having spent a night there which means the only places I’ve visited and not spent the night are Slovenia (went over the border to hike a mountain there) and the Vatican.
I should probably upgrade Turkey to lived depending on your definition: stayed in your own house or registered as a resident.
My score for the US is negligible and I don’t see this changing any time soon (maybe ever).
A live coding environment to create 3D graphics using signed distance functions written in the Janet programming language. Click through and edit some of the embedded examples to get a feel for how amazing this is.
A thought-provoking article about how to counteract car bloat. I would add that driving a SUV is even worse than smoking because it mostly harms others, not the person doing the driving themselves.
In Germany for many transactions you need a proof of address which a Personalausweis provides for German citizens. Us foreigners don’t get one however often we ask for it. In the Netherlands moving through the country without a battery of chip cards (OV chipkaart, Bonuskaart, OV-fiets etc.), apps and associated services is costly and annoying.
The signs have been there for a while but China seems to be pushing this much further along. The question is whether it’s a deliberate move or that the number of people affected is so small that they’re a negligible edge case for the policy makers over there.
It’s rare to find writing in German as lithe and delightful as what Christoph Rauscher puts out. The monthly lists are one particularly good example. I’m learning new and interesting words still in most of his pieces.
I totally agree that “Writing = Design” and you should hire him for Design/Writing/Illustration: https://christophrauscher.de/writing/
A tear down of how women still get erased from narratives like Toshi here in the new Bob Dylan movie.
https://merrillmarkoe.substack.com/p/a-complete-unknown-the-ballad-of
I think Home Row Mods for your keyboard are too complicated by far to use but I’m glad these kind of comprehensive guides exist.
The lamb ad does a good job showing the madness that is online comments sections. Also it made me want to eat a nice piece of lamb.
I’ve had to spend more time than I like thinking about how datetimes are stored in databases and even the commonly accepted practice of storing UTC does not work for all cases.
Specifically when you store something that would happen in the future, you need to store the location of the event as well. Otherwise any time of daylight savings change will shift your event around. This is not just for single events but can also happen for say ordering cut-off times which aren’t pinned to a single date.
Let’s see how this develops but good to see a positive healthcare story for AIs.
A very useful thought experiment whenever anybody tries to pretend LLMs are ‘human’ because they sound human.
[…] the Rands Leadership Slack is the most impactful thing I’ve built outside my family and job.
I can testify to the Rands Leadership Slack being an impactful thing. I’ve joined it a long time ago and more or less everything I know about engineering leadership I’ve learned there. I’m eternally grateful for all the hard work the people there put into making that a nice place to be.
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models
Medical QA depends heavily on domain-specific knowledge that is not always available within pre-trained models, necessitating knowledge-based retrieval from external sources.
In addition medical knowledge evolves rapidly, and new treatments or updated guidelines may not be included in the model’s pertained corpus.
The question example for the reasoning process in Figure 1 is on a multiple-choice question. That seems overly simple.
In parallel, Commonsense Question Answering shares similar complexities with Medical QA, particularly in its reliance on structured multi-step reasoning and iterative evidence retrieval.
rStar (Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers)
The rStar approach seems worth diving into. That will be the paper I read next.
Monte Carlo Tree Search
enabling the open source LLMs (LLAMA3.1) to achieve competitive performance with top closed-source LLMs like GPT-4 and GPT-4o.
We’ll come to this later in the paper. Their conclusion is that they can trick out LLAMA to get similar performance to GPT-4 in these domains.
Upper Confidence Bound applied on trees (UCT)
In contrast, rStar incorporates five distinct actions that enable more adaptive exploration:
A1: Propose a One-Step Thought. This action generates the next reasoning step based on previous steps, allowing the LLM to build the solution incrementally.A2: Propose Remaining Thought Steps. This action enables the LLM to produce all remaining reasoning steps in one inference, similar to CoT, for simpler questions.
A3: Generate Next Sub-question and Answer. This action decomposes the main problem into a sequence of sub-questions, each solved in turn. A4: Re-answer Sub-question. This action allows the LLM to re-answer a previously generated sub- question, increasing accuracy by using few-shot prompting.
A5: Rephrase Question/Sub-question. This action rephrases the question to clarify conditions and reduce misunderstandings, enhancing the LLM’s interpretation of the problem.
I need to trace the rStar algorithm after reading the original paper. The explanation here is too short.
These queries target information that can either support or refute the content of each statement, ensuring comprehensive factual verification.
How does this approach deal with (non-)negation that LLMs often have a lot of trouble with? From a language perspective it could just as easily say I can or can’t eat grapefruit (iykyk) based on the temperature that day but especially in a medical context these kind of errors can be catastrophic.
RARE achieves substantial gains, outperforming rStar by 5.17% on MedQA, 2.19% on MedMCQA and 2.39% on MMLU-Medical.
Even if these numbers are statistically significant (which they don’t say), these increases are really modest. I would not call this in any way “substantial”.
Looking at Table 1, RARE is as much of an increase over rStar as rStar is over the next best approach so from that perspective maybe you could call it significant. The difference between worst and best framework here is around 10% across CoT, RAG, SC, rStar, RARE.
evaluated on StrategyQA (SQA), CommonsenseQA (CQA), Social IQA (SIQA) and Physical IQA (PIQA)
The main question I have is from what percentage accuracy such a system would be reasonably possible to use in a real world context. Even at 90-95% that would seem like it would be too low to rely on when the stakes are high.
By enhancing LLMs with retrieval-augmented reasoning, RARE bridges the gap between open source models and state-of-the-art proprietary systems.
The framework has only been tested on open source models like LLaMA 3.1 and not on larger proprietary models such as GPT-4. This is due to the high number of API calls required by RARE’s iterative retrieval and reasoning process, making evaluation on closed source models prohibitively costly.
So here they repeat the statement that they’ve bridged the gap but they say they haven’t used this approach with a model like GPT-4 because the number of API calls would make it too expensive.
That leaves on the table that these kind of many-call approaches are open to OpenAI because they can do these numbers of calls much more affordably from inside the house. No real gap has been closed here and it shows again how big of an advantage OpenAI has.
That raises the question: What makes GPT-4 so good? Why does it perform so much better than open source models?
RARE is designed to identify a single reasoning trajectory that leads to a correct answer but does not necessarily optimise for the best or shortest path that maximises robustness.
Any integration into medical workflows must be supervised by qualified practitioners to ensure patient safety and ethical use.
It’s been a bit of a grab bag year but overall not as bad as 2023 and a bunch of things seem to be on track.
I got on the neurodiversity bandwagon this year.
First I got myself a self-paid diagnosis for ADHD. This result should not surprise anybody who knows me. I’ve forced myself to be very high functioning throughout my life but it can’t be denied that there were always some underlying issues. I’m on medication from the end of the year and have gone off caffeine.
I also got myself tested for giftedness and got a positive result there as well.
Both of these results were validating if nothing else and put a lot of things that happened in my life in a different perspective.
For anybody who’s not sure whether they should pursue this, my recommendation would be: You will only know how differently you can feel if you do.
I got a mole cut out of my skin. It’s a nice scar to have.
I’m fully vaxxed against FSME and got a booster for COVID in November. That brings me to six jabs in total.
It could have been a great year for sports. After having a great time on our yearly trip to the Alps, I came back to Berlin and badly sprained my ankle after falling off some stairs. I didn’t need any surgery, thankfully, but it did set me back some 8 weeks of physical therapy and having to build up to walking again.
That notwithstanding, I managed to participate in three road cycling group rides this year. MAAP opening up a store here and organising open weekly rides has been really cool. The cycling and the coffee were lit. 🔥
I cycled up the Brocken for my first ever mountain and clocked 4201km in 2024 on Strava.
It’s my goal to weigh 75kgs and I’m still as far away from that as I ever was.
Letterbox does a good job tracking this and it was a pretty good year for movies. I review all of them over there in detail but I can say the non-Potter kids movies we watched were nice and the Japanese cinema on the whole was excellent. I saw Evil Does Not Exist two times with the second time in the local theatre live scored by its composer Eiko Ishibashi.
Trakt is doing a great job keeping track of which episodes of which television series I need to watch. It’s the only way I can possibly stay on top of this.
Looks like I’m turning into a weeb just like everybody else in the culture. I watch anime in part as light entertainment and in part as Japanese immersion. It’s very hard to find anime that has any kind of thematic depth. Frieren comes closest because of how it twists the standard fantasy trope into a story about loss and reminiscence.
It was a fair though not great year for reading.
I’m continuing my trend of reading one Knausgård and one Mantel book each year. No reason not to do that again this year.
I picked up some poetry at Perdu during my visit to Amsterdam and have been enjoying reading that.
Every time I see Maria Stuart (which I got put on to by Past Present Future’s fantastic Great Political Fictions series) in the list, I think: “I need to read more Schiller.” but then I keep forgetting to get the files off Gutenberg. Germans sure knew how to write back in the day.
Besides the trip to the Alps, I went to the Netherlands once in 2024 for Kars’s viva and we took a trip to idyllic Hiddensee after my foot was healed. Much more travel is slated for next year!
I don’t go to exhibitions for lack of time. Besides seeing Evil Does Not Exist in the theatre I managed to burn a ticket to the opera and one to a dance show due to conflicting commitments and forgetfulness. I’m not sure whether I’m going to retry this.
I took the kids to see Ronja at an open air show which was fun.
I was a member of the Greens but I cancelled that because even if they’re the least bad political party in Germany, they have been doing a lot of things that I do not wish to support from the inside. I wrote about that here.
I continued to learn and maintain my Japanese level in preparation for my trip in 2025.
I learned a bunch around Kubernetes and Kafka but would have liked to do more programming. I refreshed my algorithms a bit and picked up Factor to play with.
The terminal I use daily (because it’s the best really), fish, has been rewritten entirely in Rust, because it’s nice and more fun: “For one, fish is a hobby project, and that means we want it to be fun for us. Nobody is being paid to work on fish, so we need it to be fun. Being fun and interesting also attracts contributors.”
I can testify to this because when most of the code was rewritten I checked it out, built it and poked around a bunch to see how it works. I don’t think I would have done that or enjoyed doing it if it had been a C++ codebase. That was also when I was confronted with the fact that what makes a terminal really complicated is not the language in which it is programmed, but the underlying system that it is an interface to.
The story of the port and its success is legendary as far as these things go.
There was a brief period where Foursquare based recommendations were good and drawn from your wider social graph. Now we’ve gone back to Yelp and Google Maps where reviews and ratings don’t mean anything. A lower than 4 star review on GMaps has netted me a cease-and-desist e-mail for defamation.
That puts personally curated travel docs and word of mouth back in play as Thrillist describes here. Every Dutch person has or knows somebody who has a Berlin Google Doc with all the Geheimtipps. Dutch people’s tastes are fairly predictable and pedestrian, so these’ll mostly be cheap Asian eateries in Prenzlauerberg but that’s also fine.
For me the most interesting recommendations for Berlin but also for other cities come through TikTok. The algorithm is well tuned to my type of person and in the short videos it’s pretty easy to size up whether somebody knows what they’re talking about or not.
https://www.thrillist.com/travel/nation/google-docs-are-the-ideal-travel-guides
As a parent and as a social media user, I don’t buy that something that’s harmful to adults is not EVEN MORE harmful to children.
The platforms need to be curtailed and this entire situation has to be shutdown as soon as possible. We can keep our kids off smartphones but what about others?
Musks’s attack on Wikipedia is another step in getting rid of information sources where they can’t control the narrative and the “truth”. Everything they’re doing is built on lies.
https://www.citationneeded.news/elon-musk-and-the-rights-war-on-wikipedia
Hans de Zwart’s end of the year media overviews are one of the highlights of what still happens on personal blogs for me. He’s a voracious reader and one of the rare people who acts on his moral clarity. Also, Hans is a great guy and I had the chance to briefly catch-up with him last year.
I’ll see if I can pull something together, but definitely go through his list. I always pick up more than a couple of interesting things to explore.
The o3 AGI result looked so noteworthy that I dove into it. I read one of the papers that’s at the base of the approach and thought it was pretty interesting.
Turns out that it was mostly bullshit and everybody was doing another round of “let’s pretend that AGI is real”. What a shambles.
Trust Laura Olin, nobody has to stay on Twitter. It’s a bad place that’s only getting worse.
An overview of the year in databases by Andy Pavlo that does not pull any punches. I learned a bunch of things (and I somewhat keep up with this area).
https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html
I think for sure we’ve built an Erlang but even then again Erlang is such an esoteric environment that I would barely consider using it to be an alternative. Where would you start and how would you get other people onboarded?
A cautionary tale of how you can try to avoid Kubernetes to then step-by-step build the same thing yourself but poorly.
https://www.macchaffee.com/blog/2024/you-have-built-a-kubernetes
Another take on the old adage that writing the code is the easy part of software engineering. The real work is figuring out what has to be built and how. Once that is clear, the actual building can be done relatively quickly and linearly.
I think the notion of a dead program is useful though it’s not always that clear cut:
The death of a program happens when the programmer team possessing its theory is dissolved. A dead program may continue to be used for execution in a computer and to produce useful results.
https://olano.dev/blog/software-design-is-knowledge-building
The MSF year in pictures is an unsparing overview of the circumstances that the rest of the world has to live in. Grim.
A paper where they fine tune an LLM to be able to answer some questions itself and figure out for which questions it needs to use a specialized tool. Intelligent tool usage seems like it would expand the use cases for LLM driven systems much more than any kind of scaling (real or imagined).
However, scholars note that their abilities are capped at approximately high-school levels
That seems like a noteworthy statement especially if you are looking to LLMs to provide “novel thinking”. It would seem much more that high school problems are abundantly available and relatively trivial so they see a specific focus.
For numerical answers in the MATH and SciBench datasets, we consider answers correct if they fall within a small tolerance range of the true value, specifically within ±5%.
Don’t really see why you could not get exact answers in a mathematical domain.
This performance gap on public benchmarks is likely due to the larger parameter count and specific optimization of state-of-the-art models on the open-source datasets.
Same as with the high school questions. These datasets are easily available and draw attention so the models overfit on them.
The model Ours-Pn demonstrates performance comparable to Base-Pf , both showing a significant improvement over the base model. This similarity indicates successful internalization of distilled knowledge from tools. The transition from Ours-Pn to Ours-Pi showcases further improvement in answer accuracy, resulting from the model’s enhanced ability to intelligently switch to tools for harder questions.
This is the core proposition of the paper. Looking at Table 1 with the accuracy percentages there is something of an improvement but it does not really look dramatic or so convincing that you could use these systems in any critical context.
We’re looking at increases of 10-20% and an accuracy that’s still well under 90% (which I’m also not convinced would be usable).
We introduced a novel two-component fine-tuning approach to enhance Large Language Models (LLMs) in solving scientific problems of varying complexity.
One of the key issues with the paper I have is how much work the term “scientific problems” is doing. If this is published, people are going to think that the LLM is solving actual novel issues where in this case it’s just filling in relatively basic question/answer pairs that are well understood. Calling them problems is problematic.
The most interesting part of the paper is the appendix where you can see the actual questions and answers in the various datasets and the prompts they used (with example responses). The answers mostly are multiple choice which already influences how many of them you should expect to be correct.