LLMs are fundamentally matching the patterns they’ve seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching. The mathematical results imply that you can always find compositional tasks whose complexity lies beyond a given system’s abilities.
LLMs are still very useful in a bunch of domains but here’s an article explaining (from a paper with a novel bound on computational complexity) why improvements in reasoning seem to have run out of steam.
It’s a shame these people are so thoroughly psychopathic (from this epic Guardian profile) because this piece about pronatalism is nothing if not well argued.
This was exactly a discussion I had yesterday where when teams reach higher levels of maturity, they may catch and fix issues internally before they become full blown incidents.
So how do we than make sure we are not over-indexing on purely the operational surprises where we ran the heavy incident machinery and how do we get the right learnings and improvements disseminated throughout the organisation?
Even the most basic functions in foundational libraries can be iterated on and improved by reading papers and implementing novel algorithmic approaches. Here’s a breakdown of how the time crate got faster (even though “no one has complained”).
Good luck hiring talent for your struggling economy with racism becoming even more rampant than it normally is. Not that that is the only problem facing Germany, but it’s a huge distraction for all parties while nothing material has improved for decades.
I enjoy James Stanier’s newsletter on engineering management a lot. This one is about information processing in organizations, something which will make or break you.
From my observation, the correlation between high functioning people in engineering leadership and usage of Obsidian is not 1.0 but it is very high. I use a fairly tricked out setup myself, but there’s value to be had here at all levels.
It’s an interesting oversight in the OODA loop sketched out in the article, that the most important step “Orientation” is omitted. Orientation is elusive and hard to pin down which is exactly why it is so key.
SQLite is a tiny, somewhat quirky but incredibly convenient and fast database. For side projects I definitely don’t bother with a “real” database anymore and for many of them I just check the entire .sqlite file into version control for easier deployments.
Here are some tuning tips to get really blazing fast performance out of it.
After missing the previous edition, I’m ‘attending’ HYTRADBOI 2025. Databases are the bread and butter of our team and I enjoy keeping up with the cutting edge of the field while using the most boring possible one at work.
Imagine being this rich and having this much free time and using all of those capabilities to drive yourself mad with ignorance. Nothing bad has happened to these people yet they are convinced the entire world is out to get them.
As if it wasn’t obvious that the New York Times is a pile of crap, here’s them getting rid of Paul Krugman for views that aren’t Trump positive enough.
This is a very good framework from Matt Webb for organizations how to do strategic pathfinding when it comes to AI. In Germany especially, lots of orgs would benefit from doing this instead of whatever it is that they are busy with right now.
Interesting to see the word overhang used here (originally by Nat Friedman) which I normally use when talking about tech “debt” but a capability overhang is of course also possible.
I don’t think there’s been a Python developer tool that’s been adopted as quickly as uv is being right now. Dealing with dependencies and running Python code has never been easier.
Quick turnarounds on fixing vulnerabilities usually correlated with general engineering operational excellence. The best cases were clients who asked us to just give them a constant feed of anything we found, and they’d fix it right away.
Lots of interesting tidbits here from dozens of startup code audits.
[…] the abandonment of responsibility in two dimensions. Firstly, and following on from what was already happening in ‘big data’, the world stopped caring about where AI got its data — fitting in nicely with ‘surveillance capitalism. And secondly, contrary to what professional organisations like BCS and ACM had been preaching for years, the outcomes of AI algorithms were no longer viewed as the responsibility of their designers — or anybody, really.
That’s a fairer and more informed take than most. AI systems can be very useful in limited contexts but you wouldn’t want one to decide anything material about your life.
Not politicians, just career bureaucrats deep in the system. I ask them what their favorite part of the job is. They all say “stability” or “job security” as their #1. It takes 18 months to get the city to permit your shed? They. Do. Not. Care.
All of these examples are things I struggle with also on a daily basis. I live in Berlin where nobody gives a shit about anything. The place is a dump and everybody pretends that it can’t be any other way.
It’s not hard to see why people don’t care. Most people barely have the capacity to get through the day. They don’t have it in them to care about something more. The way to get people to care more is to force them with social pressure.
Artemis, fostered with Apollo, virgin who delights in arrows, far-shooting goddess, who swiftly drives her all-golden chariot through Smyrna to vine-clad Claros, I ask that you establish a loop counting from ninety-nine to one called beerLoop.
A programming language where you make things happen by invoking the gods. Nice to see that classics upbringing to be of some professional use.
Blamed on everything being difficult or complicated
With a tendency to find artificial prerequisite activities that sound plausible, but on further examination aren’t.
Things are as complex as we want to make them. Most of the time complexity is an excuse for either not wanting to do something or not knowing how to do something.
I agree with this with two additions:
Most of the time things are not at all complex, they are complicated. Those are two very different things.
Complications can be a symptom of avoidance but maybe oftentimes the work is complicated. If all the work was simple, any idiot could do it. Appreciate the real complexity of the work while you are in the act of cutting through it.
We can probably look forward to see more large American companies destabilize under a regime of financialization, deregulation and competitive pressure.
A pretty good explanation of all the weird acronyms that make modern e-mail infrastructure absolutely inscrutable.
That’s not really different from deep dives into DNS, TLS, CORS etc. The things we use to build the internet have become (out of necessity) incredibly specialized and complicated.
I find limited run podcast series that dive deep into a specific topic are some of the best things the medium brings forth.
Here’s one from Escape Collective that covers the intersection of a bunch of special interests: bike manufacturing, the pandemic, supply chains and how it all went to hell.
I bought a road bike myself at the end of the pandemic and the supply chain shortages made it so that I’m on a Cube CX bike which is fine but definitely not what I would have gotten if there had been a wide range of bikes to choose from.
An roller coaster of an interview with a series of FizzBuzz extensions that are honestly not even that bad, but poor form from the interviewers’ side not to recognize the absolute balling brilliance on display here.
I can just about follow along with this level of Typescript type level programming but to be able to whip this out during an interview is a testament of mastery.
Do not fall into the trap of anthropomorphising Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn, you stick your hand in there and it’ll chop it off, the end. You don’t think ‘oh, the lawnmower hates me’ – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.
Trashfuture were really cooking when they taped “The Sulla of Suburbia” with Patrick Wyman.
They want Sulla:
They have a hierarchical view of the world:
November saying that they really want Hitler:
The op-ed departments love this stuff:
I think this is a well argued plea by Ken Shirriff to stop using the term “cargo cult” and I agree with it but would like to add two things.
With a non-English speaking audience that does not have the same priors, nobody will have an idea what you are talking about if you use the term “cargo cult”. You’ll be stuck explaining the term in a ham-fisted way that will fail to convey the huge amount of history and social science involved.
One problem with rejecting the term is that it lets software engineers off the hook and allows them to pretend the way they work is different from the tribal inhabitants of pacific islands. I argue that most software engineering practice is based on folklore and is deeply tribalistic.
No port/adapter terms to learn, no unnecessary layers of horizontal abstractions, no extraneous cognitive load.
Reducing cognitive load is continuous battle. The entropy direction of a software system is always towards more complexity.
People add a lot of this stuff out of either inexperience or because they need to look smart. The simple code that gets the job done is often looked down upon.
What’s happening in Flint is a good example of how the social effects of institutional lapses are much more difficult to fix than just replacing a bunch of pipes. We’re seeing the same effects happening around COVID and its after effects and we’ll see many more examples of government distrust and chaotic confusion in the coming decades.
For those who want to introduce some whimsy into their programming and for whom using a variable width font in your code editor is a bit too far, there is now Comic Mono (via). It doesn’t even look all too terrible.
(After using Iosevka and Inconsolata for a long time, I’m now as are many people a happy Jetbrains Mono user.)
I would have been surprised if Devin had performed even 1/10th as well as it was hyped. This is a good clean write-up.
Social media excitement and company valuations have minimal relationship to real-world utility. We’ve found the most reliable signal comes from detailed stories of users shipping products and services.
I love Github Projects for tracking work. It’s close to the code and engineers understand it natively. I feel you can deliver features of any size with it if you work with the tool.
The only thing that’s a bit annoying is the lack of improvements from Github. There’s a bunch of quality of life features I’m used to from other tools that would really make a difference. But now with LLMs we don’t have to settle.
I asked Cursor to write me a user script that adds a “Create Follow-up Task” button (that I used a lot on Asana) to issues on Github. It did a reasonable enough job that I could tweak and then have something working for me. I could write this myself of course but the hurdle of figuring out the format and the wiring felt like a blocker.
I think Facebook rolled back their block of Pixelfed but they’re right to be spooked. Showing a bunch of pictures in a stream hardly seems like a technological challenge. And what are all the people working at Instagram doing other than figuring out novel ways to track you and serve you ads?
You should definitely try out Pixelfed which is more than usable.
A lack of knowledge about queueing theory and a dash of wishful thinking make this a common trap for most developers to believe that as long as you add bigger queues you can wiggle your way out of any scaling problem.
Brazil still remembers that it’s a state with state power and has found a delightful habit of pouncing on social media sites. Other states should follow suit.
The Agents chapter from Chip Huyen’s book “AI Engineering” is clear and enjoyable to read. She’s right that “the concept of an agent is fairly simple” but building something functional still looks like a massive lift.
Maps for where you’ve been in Europe and the US (via). Stayed means having spent a night there which means the only places I’ve visited and not spent the night are Slovenia (went over the border to hike a mountain there) and the Vatican.
I should probably upgrade Turkey to lived depending on your definition: stayed in your own house or registered as a resident.
My score for the US is negligible and I don’t see this changing any time soon (maybe ever).
A live coding environment to create 3D graphics using signed distance functions written in the Janet programming language. Click through and edit some of the embedded examples to get a feel for how amazing this is.
A thought-provoking article about how to counteract car bloat. I would add that driving a SUV is even worse than smoking because it mostly harms others, not the person doing the driving themselves.
In Germany for many transactions you need a proof of address which a Personalausweis provides for German citizens. Us foreigners don’t get one however often we ask for it. In the Netherlands moving through the country without a battery of chip cards (OV chipkaart, Bonuskaart, OV-fiets etc.), apps and associated services is costly and annoying.
The signs have been there for a while but China seems to be pushing this much further along. The question is whether it’s a deliberate move or that the number of people affected is so small that they’re a negligible edge case for the policy makers over there.
It’s rare to find writing in German as lithe and delightful as what Christoph Rauscher puts out. The monthly lists are one particularly good example. I’m learning new and interesting words still in most of his pieces.
The lamb ad does a good job showing the madness that is online comments sections. Also it made me want to eat a nice piece of lamb.
I’ve had to spend more time than I like thinking about how datetimes are stored in databases and even the commonly accepted practice of storing UTC does not work for all cases.
Specifically when you store something that would happen in the future, you need to store the location of the event as well. Otherwise any time of daylight savings change will shift your event around. This is not just for single events but can also happen for say ordering cut-off times which aren’t pinned to a single date.
A very useful thought experiment whenever anybody tries to pretend LLMs are ‘human’ because they sound human.
Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".
[…] the Rands Leadership Slack is the most impactful thing I’ve built outside my family and job.
I can testify to the Rands Leadership Slack being an impactful thing. I’ve joined it a long time ago and more or less everything I know about engineering leadership I’ve learned there. I’m eternally grateful for all the hard work the people there put into making that a nice place to be.
Medical QA depends heavily on domain-specific knowledge that is not always available within pre-trained models, necessitating knowledge-based retrieval from external sources.
In addition medical knowledge evolves rapidly, and new treatments or updated guidelines may not be included in the model’s pertained corpus.
The question example for the reasoning process in Figure 1 is on a multiple-choice question. That seems overly simple.
In parallel, Commonsense Question Answering shares similar complexities with Medical QA, particularly in its reliance on structured multi-step reasoning and iterative evidence retrieval.
The rStar approach seems worth diving into. That will be the paper I read next.
Monte Carlo Tree Search
enabling the open source LLMs (LLAMA3.1) to achieve competitive performance with top closed-source LLMs like GPT-4 and GPT-4o.
We’ll come to this later in the paper. Their conclusion is that they can trick out LLAMA to get similar performance to GPT-4 in these domains.
Upper Confidence Bound applied on trees (UCT)
In contrast, rStar incorporates five distinct actions that enable more adaptive exploration:
A1: Propose a One-Step Thought. This action generates the next reasoning step based on previous steps, allowing the LLM to build the solution incrementally.
A2: Propose Remaining Thought Steps. This action enables the LLM to produce all remaining reasoning steps in one inference, similar to CoT, for simpler questions.
A3: Generate Next Sub-question and Answer. This action decomposes the main problem into a sequence of sub-questions, each solved in turn. A4: Re-answer Sub-question. This action allows the LLM to re-answer a previously generated sub- question, increasing accuracy by using few-shot prompting.
A5: Rephrase Question/Sub-question. This action rephrases the question to clarify conditions and reduce misunderstandings, enhancing the LLM’s interpretation of the problem.
I need to trace the rStar algorithm after reading the original paper. The explanation here is too short.
These queries target information that can either support or refute the content of each statement, ensuring comprehensive factual verification.
How does this approach deal with (non-)negation that LLMs often have a lot of trouble with? From a language perspective it could just as easily say I can or can’t eat grapefruit (iykyk) based on the temperature that day but especially in a medical context these kind of errors can be catastrophic.
RARE achieves substantial gains, outperforming rStar by 5.17% on MedQA, 2.19% on MedMCQA and 2.39% on MMLU-Medical.
Even if these numbers are statistically significant (which they don’t say), these increases are really modest. I would not call this in any way “substantial”.
Looking at Table 1, RARE is as much of an increase over rStar as rStar is over the next best approach so from that perspective maybe you could call it significant. The difference between worst and best framework here is around 10% across CoT, RAG, SC, rStar, RARE.
evaluated on StrategyQA (SQA), CommonsenseQA (CQA), Social IQA (SIQA) and Physical IQA (PIQA)
The main question I have is from what percentage accuracy such a system would be reasonably possible to use in a real world context. Even at 90-95% that would seem like it would be too low to rely on when the stakes are high.
By enhancing LLMs with retrieval-augmented reasoning, RARE bridges the gap between open source models and state-of-the-art proprietary systems.
The framework has only been tested on open source models like LLaMA 3.1 and not on larger proprietary models such as GPT-4. This is due to the high number of API calls required by RARE’s iterative retrieval and reasoning process, making evaluation on closed source models prohibitively costly.
So here they repeat the statement that they’ve bridged the gap but they say they haven’t used this approach with a model like GPT-4 because the number of API calls would make it too expensive.
That leaves on the table that these kind of many-call approaches are open to OpenAI because they can do these numbers of calls much more affordably from inside the house. No real gap has been closed here and it shows again how big of an advantage OpenAI has.
That raises the question: What makes GPT-4 so good? Why does it perform so much better than open source models?
RARE is designed to identify a single reasoning trajectory that leads to a correct answer but does not necessarily optimise for the best or shortest path that maximises robustness.
Any integration into medical workflows must be supervised by qualified practitioners to ensure patient safety and ethical use.
It’s been a bit of a grab bag year but overall not as bad as 2023 and a bunch of things seem to be on track.
Health
I got on the neurodiversity bandwagon this year.
First I got myself a self-paid diagnosis for ADHD. This result should not surprise anybody who knows me. I’ve forced myself to be very high functioning throughout my life but it can’t be denied that there were always some underlying issues. I’m on medication from the end of the year and have gone off caffeine.
I also got myself tested for giftedness and got a positive result there as well.
Both of these results were validating if nothing else and put a lot of things that happened in my life in a different perspective.
For anybody who’s not sure whether they should pursue this, my recommendation would be: You will only know how differently you can feel if you do.
I got a mole cut out of my skin. It’s a nice scar to have.
I’m fully vaxxed against FSME and got a booster for COVID in November. That brings me to six jabs in total.
Sports and Injuries
It could have been a great year for sports. After having a great time on our yearly trip to the Alps, I came back to Berlin and badly sprained my ankle after falling off some stairs. I didn’t need any surgery, thankfully, but it did set me back some 8 weeks of physical therapy and having to build up to walking again.
That notwithstanding, I managed to participate in three road cycling group rides this year. MAAP opening up a store here and organising open weekly rides has been really cool. The cycling and the coffee were lit. 🔥
I cycled up the Brocken for my first ever mountain and clocked 4201km in 2024 on Strava.
It’s my goal to weigh 75kgs and I’m still as far away from that as I ever was.
Movies
Letterbox does a good job tracking this and it was a pretty good year for movies. I review all of them over there in detail but I can say the non-Potter kids movies we watched were nice and the Japanese cinema on the whole was excellent. I saw Evil Does Not Exist two times with the second time in the local theatre live scored by its composer Eiko Ishibashi.
Harry Potter and the Philosopher’s Stone
Dune: Part Two
Curious Tobi and the Treasure Hunt to the Flying Rivers
Glass Onion
Frozen
Tangled
Raya and the Last Dragon
Shoplifters
Luca
Harry Potter and the Chamber of Secrets
Yojimbo
Drive My Car
Perfect Days
John Wick: Chapter 4
Evil Does Not Exist
How to Blow Up a Pipeline
Harakiri
Evil Does Not Exist
Die Hard
Television
Trakt is doing a great job keeping track of which episodes of which television series I need to watch. It’s the only way I can possibly stay on top of this.
The Last of Us
Spy x Family S2
Death Note
Frieren
Tour de France: Unchained S2
Vigil
The Peripheral
Kaiju No 8
Bluey
Arcane S2
Looks like I’m turning into a weeb just like everybody else in the culture. I watch anime in part as light entertainment and in part as Japanese immersion. It’s very hard to find anime that has any kind of thematic depth. Frieren comes closest because of how it twists the standard fantasy trope into a story about loss and reminiscence.
Books
It was a fair though not great year for reading.
Sheaf Theory through Examples, Daniel Rosiak
Bring Up the Bodies, Hilary Mantel
Min kamp 2, Knausgård, Karl Ove
Maria Stuart, Friedrich Schiller
Arkada Yaylılar Çalıyor, Melikşah Altuntaş
My Tender Matador, Pedro Lemebel
Kafka Connect: Build and Run Data Pipelines, Mickail Maison
Let Us Believe in the Beginning of the Cold Season, Forugh Farrokzhad
Discipline and Punish: The Birth of the Prison, Michel Foucault
The Kubernetes Book: 2024 Edition, Nigel Poulton
Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-premises, Elad Eldor
Conversational Capacity: The Secret to Building Successful Teams That Perform When the Pressure Is on, Craig Weber
I’m continuing my trend of reading one Knausgård and one Mantel book each year. No reason not to do that again this year.
I picked up some poetry at Perdu during my visit to Amsterdam and have been enjoying reading that.
Every time I see Maria Stuart (which I got put on to by Past Present Future’s fantastic Great Political Fictions series) in the list, I think: “I need to read more Schiller.” but then I keep forgetting to get the files off Gutenberg. Germans sure knew how to write back in the day.
Trips
Besides the trip to the Alps, I went to the Netherlands once in 2024 for Kars’s viva and we took a trip to idyllic Hiddensee after my foot was healed. Much more travel is slated for next year!
Other Culture
I don’t go to exhibitions for lack of time. Besides seeing Evil Does Not Exist in the theatre I managed to burn a ticket to the opera and one to a dance show due to conflicting commitments and forgetfulness. I’m not sure whether I’m going to retry this.
I took the kids to see Ronja at an open air show which was fun.
Miscellaneous
I was a member of the Greens but I cancelled that because even if they’re the least bad political party in Germany, they have been doing a lot of things that I do not wish to support from the inside. I wrote about that here.
I continued to learn and maintain my Japanese level in preparation for my trip in 2025.
I learned a bunch around Kubernetes and Kafka but would have liked to do more programming. I refreshed my algorithms a bit and picked up Factor to play with.
The terminal I use daily (because it’s the best really), fish, has been rewritten entirely in Rust, because it’s nice and more fun: “For one, fish is a hobby project, and that means we want it to be fun for us. Nobody is being paid to work on fish, so we need it to be fun. Being fun and interesting also attracts contributors.”
I can testify to this because when most of the code was rewritten I checked it out, built it and poked around a bunch to see how it works. I don’t think I would have done that or enjoyed doing it if it had been a C++ codebase. That was also when I was confronted with the fact that what makes a terminal really complicated is not the language in which it is programmed, but the underlying system that it is an interface to.
The story of the port and its success is legendary as far as these things go.
There was a brief period where Foursquare based recommendations were good and drawn from your wider social graph. Now we’ve gone back to Yelp and Google Maps where reviews and ratings don’t mean anything. A lower than 4 star review on GMaps has netted me a cease-and-desist e-mail for defamation.
That puts personally curated travel docs and word of mouth back in play as Thrillist describes here. Every Dutch person has or knows somebody who has a Berlin Google Doc with all the Geheimtipps. Dutch people’s tastes are fairly predictable and pedestrian, so these’ll mostly be cheap Asian eateries in Prenzlauerberg but that’s also fine.
For me the most interesting recommendations for Berlin but also for other cities come through TikTok. The algorithm is well tuned to my type of person and in the short videos it’s pretty easy to size up whether somebody knows what they’re talking about or not.
As a parent and as a social media user, I don’t buy that something that’s harmful to adults is not EVEN MORE harmful to children.
The platforms need to be curtailed and this entire situation has to be shutdown as soon as possible. We can keep our kids off smartphones but what about others?
Musks’s attack on Wikipedia is another step in getting rid of information sources where they can’t control the narrative and the “truth”. Everything they’re doing is built on lies.
Hans de Zwart’s end of the year media overviews are one of the highlights of what still happens on personal blogs for me. He’s a voracious reader and one of the rare people who acts on his moral clarity. Also, Hans is a great guy and I had the chance to briefly catch-up with him last year.
I’ll see if I can pull something together, but definitely go through his list. I always pick up more than a couple of interesting things to explore.
The o3 AGI result looked so noteworthy that I dove into it. I read one of the papers that’s at the base of the approach and thought it was pretty interesting.
Turns out that it was mostly bullshit and everybody was doing another round of “let’s pretend that AGI is real”. What a shambles.
An overview of the year in databases by Andy Pavlo that does not pull any punches. I learned a bunch of things (and I somewhat keep up with this area).
I think for sure we’ve built an Erlang but even then again Erlang is such an esoteric environment that I would barely consider using it to be an alternative. Where would you start and how would you get other people onboarded?
Another take on the old adage that writing the code is the easy part of software engineering. The real work is figuring out what has to be built and how. Once that is clear, the actual building can be done relatively quickly and linearly.
I think the notion of a dead program is useful though it’s not always that clear cut:
The death of a program happens when the programmer team possessing its theory is dissolved. A dead program may continue to be used for execution in a computer and to produce useful results.
A paper where they fine tune an LLM to be able to answer some questions itself and figure out for which questions it needs to use a specialized tool. Intelligent tool usage seems like it would expand the use cases for LLM driven systems much more than any kind of scaling (real or imagined).
However, scholars note that their abilities are capped at approximately high-school levels
That seems like a noteworthy statement especially if you are looking to LLMs to provide “novel thinking”. It would seem much more that high school problems are abundantly available and relatively trivial so they see a specific focus.
For numerical answers in the MATH and SciBench datasets, we consider answers correct if they fall within a small tolerance range of the true value, specifically within ±5%.
Don’t really see why you could not get exact answers in a mathematical domain.
This performance gap on public benchmarks is likely due to the larger parameter count and specific optimization of state-of-the-art models on the open-source datasets.
Same as with the high school questions. These datasets are easily available and draw attention so the models overfit on them.
The model Ours-Pn demonstrates performance comparable to Base-Pf , both showing a significant improvement over the base model. This similarity indicates successful internalization of distilled knowledge from tools. The transition from Ours-Pn to Ours-Pi showcases further improvement in answer accuracy, resulting from the model’s enhanced ability to intelligently switch to tools for harder questions.
This is the core proposition of the paper. Looking at Table 1 with the accuracy percentages there is something of an improvement but it does not really look dramatic or so convincing that you could use these systems in any critical context.
We’re looking at increases of 10-20% and an accuracy that’s still well under 90% (which I’m also not convinced would be usable).
We introduced a novel two-component fine-tuning approach to enhance Large Language Models (LLMs) in solving scientificproblems of varying complexity.
One of the key issues with the paper I have is how much work the term “scientific problems” is doing. If this is published, people are going to think that the LLM is solving actual novel issues where in this case it’s just filling in relatively basic question/answer pairs that are well understood. Calling them problems is problematic.
The most interesting part of the paper is the appendix where you can see the actual questions and answers in the various datasets and the prompts they used (with example responses). The answers mostly are multiple choice which already influences how many of them you should expect to be correct.
I didn’t get that much from this paper, probably because it’s pretty high level and I don’t have a strong background in recommendation systems.
The approach is their Cuckoo Hashmap for embedding from which they can update parameters on the fly using existing data engineering pipeline technology.
Instead of reading mini-batch examples from the storage, a training worker consumes realtime data on-the-fly and updates the training PS. The training PS periodically syn- chronizes its parameters to the serving PS, which will take effect on the user side immediately. This enables our model to interactively adapt itself according to a user’s feedback in realtime.
A bunch of stuff that maybe was somewhat surprising a year ago but by now should be common knowledge for anybody even half following the developments in this field.
Some interesting bits in there but for the rest it’s a bit rah-rah because the author works at Anthropic.
In particular, models can misinterpret ambiguous prompts or incentives in unreason- able ways, including in situations that appear unambiguous to humans, leading them to behave unexpectedly.
Our techniques for controlling systems are weak and are likely to break down further when applied to highly capable models. Given all this, it is reasonable to expect a substantial increase and a substantial qualitative change in the range of misuse risks and model misbehaviors that emerge from the development and deployment of LLMs.
The recent trend toward limiting access to LLMs and treating the details of LLM training as proprietary information is also an obstacle to scientific study.
Wikipedia does not really come to mind when I think of a place that’s really left-wing, but maybe that’s just me?
I do something along similar lines here. I share links to various things that I find interesting and try to add what I think is interesting about them. From here I then schedule posts to Bluesky, Mastodon and LinkedIn using Buffer.
I’m not sure who reads my stuff here but I know for sure that people see the exhaust on those platforms. The main reason why I blog them here is to have my own repository of knowledge and links for if I ever have to refer back to it. For that I annotate things in a way where I hopefully can find it again and use site search to find ‘that one link about X I shared a while back’.
So yes, WordPress works just fine as a personal knowledge management system.
The Digital Patient Record system in Germany is built on smart cards and hardware which make it impossible to update and keep secure.
Of course a company like Gematik can’t update algorithms and keys on such a widespread heterogenous system. This is a competency that is impossible to organise except at the largest scales and even then companies like Microsoft will routinely leak their root keys.
The ‘hackers’ who made this presentation also can’t make something better than this and their culture is what led us to this point in the first place. It’s the same story with the German digital ID card which nobody uses.
The recipe is simple:
Demand absurd levels of security for threat models that are outlandish and paranoid
Have those demands complicate your architecture with security measures that look good but are impossible to maintain
Reap the exploits that you can run against that architecture and score publicity
<repeat>
It’s a great way to make sure that everybody loses in the German IT landscape.
Solution: Simplify the architecture to a server model with a normal 2FA login and keep that server secure. Done.
Riemann-Roch’scher Satz: der letzte Schrei: der Diagramm […] ist kommutatif! Um dieser Aussage über f:X->Y einen approximativen Sinn zu geben, musste ich nahezu zwei Stunden lang die Geduld der Zuhörer missbrauchen. Schwartz auf weiss (in Springer Lecture Notes) nimmt’s wohl an die 400,500 Seiten. Ein packendes Beispiel dafür, wie unser Wissens und Entdeckungsdrang sich immer mehr in einem lebensentrückten logischen Delirium auslebt, während das Lebens selbst auf Tausendfache Art zum Teufel geht – und mit endgültiger Vernichtung bedroht ist. Höchste Zeit unsern Kurs zu ändern!
—Alexander Grothendieck
The low-latency user wants Bigtable’s request queues to be (almost always) empty so that the system can process each outstanding request immediately upon arrival. (Indeed, inefficient queuing is often a cause of high tail latency.) The user concerned with offline analysis is more interested in system throughput, so that user wants request queues to never be empty. To optimize for throughput, the Bigtable system should never need to idle while waiting for its next request.
This is also at the moment my abject suffering where we have lots of shared resources which need to stay available but can also be hammered by various parties.
I use AI tools to help me program despite them being mostly very disappointing. They save me some typing once in a while.
At least, now that I have switched from Perplexity to Cursor, I can ask my questions in my editor directly without having to open a browser search tab. I pass through a lot of different technologies in a given workday, so I have a lot of questions to ask.
For my use cases, it’s rare that Cursor can do even a halfway decent code change even in domains where there is a bunch of prior art (“convert this file from using alpine.js to htmx”). I know people who say they have generated thousands of LoC using LLMs that they actively use but there the old adage comes in: “We can generate as much code as you want, if only all the code is allowed to be shit.”
The position below is one of the more charitable positions of how AI can help a programmer and even that I don’t think is particularly convincing.
I thought I’d dive back into history and read the original paper that started it all. It’s somewhat technical about encode/decoder layouts and matrix multiplications. None of the components are super exciting for somebody who’s been looking at neural networks for the past decade.
What’s exciting is that such a simplification generates results that are that much better and how they came up with it. Unfortunately, they don’t write how they found this out.
The paper itself is a bit too abstract so I’m going to look for some of those YouTube videos that explain what is actually going on here and why it’s such a big deal. I’ll update this later.
I came across this paper after the recent o3 high score on the ARC-AGI-PUB test. It’s a quick read and details how to scale LLMs at inference stage by generating new states at every node and so create a tree on which to perform DFS/BFS search algorithms.
A specific instantiation of ToT involves answering four questions: 1. How to decompose the intermediate process into thought steps; 2. How to generate potential thoughts from each state; 3. How to heuristically evaluate states; 4. What search algorithm to use.
For each of these steps they can deploy the LLM to generate the desired results which then scaled over the search space balloons the number of calls that need to be done (costing almost 200x the compute).
This isn’t your normal LLM stochastic parrot anymore. We’ve gone one up the abstraction chain and here we have a computer science algorithm running with LLM calls as its basic atoms.
I get to have a lot of conversations around compliance and this is as good a “SOC2 for tech people” guide as I could have asked for by the good people at Fly.
Something that Ed Zitron has already mentioned, internet users are being trained very rapidly to suspect everything to be a scam. In the case of honey, it’s a Paypal owned online shopping extension that lies and deceives.
I made the mistake of opening the desktop Spotify app which does not really work anymore. The UI is broken and there’s lots of irrelevant stuff going on.
Technology should become more useful, not more exploitative. It’s a simple thing to ask for.
Judging from this article, it will probably not be good for the acceptance of new technologies long term to build data centers in drought struck areas.
Postgres seems to reign as the database solution of choice but there are lots of new specialised databases that are worth looking at. All of these can be used in production at scale for the right application.
That’s an amazing overview of all the things that can and will go wrong in online PvP gaming. It covers the range from networking exploits to all the in-game ways that people try to grief or abuse others.
So I felt I couldn’t really bring myself to do Advent of Code this year since I have more than enough other things to do (and watch and play) and with work and the kids, it’s always pretty miserable to keep up.
I saw this thing called December Adventure though and that fits in nicely with my current push to release a major update for Cuppings. If I’m going to be programming until late this month, then I’d prefer it to be on something that I can release.
I can’t promise that I won’t do any AoC (Factor is looking mighty cool) but I won’t force myself to do anything. With that, let’s get going.
1/12
I started working on the map view which clicking around looked like it could be really annoying. I found some dead ends and was afraid I’d have to hack in Leaflet support myself but I found a dioxus example hidden in the leaflet-rs repository.
Yes, I’m writing this website in Rust/WASM, why do you ask?
That example required a bunch of fiddling with the configuration and a couple of false starts, but now I have a vanilla map view.
I can say that I’m amazed that in this ecosystem 1. an example exists 2. that example works 3. it works in my project with a bit of diffing and 4. it seems to do what I need.
I raised a PR to the project to advertise this example on its README just like it does the others so that others wouldn’t have to search like I did. That PR got merged:
Today I’ll see if I can tweak the map view to show the location of the cafe we tapped and get things to a point where I can commit the change.
To do this I need to figure out how to pass information along to a router when we tap a venue. That should be easy enough but the Dioxus documentation is between 0.5 and 0.6 now and a lot of it is broken.
A tip from the Discord said I need to put the data into a context from a parent and then get it out again in a child. It’s a bit roundabout and required some refactoring, but it works.
Done on time even for a reasonable bed time.
3/12
Turns out my changes from yesterday did not make it to the staging server. I’ll fix that and manually run the job again.
That’s these annoying wasm-bindgen version errors that keep happening and that require a reinstall of this: cargo install -f wasm-bindgen-cli --version 0.2.97 and the dioxus-cli. Dioxus which by the way is preparing its long awaited 0.6.0 release.
Other than that not that much will happen today since I spent most of the evening noodling around with Factor (despite my intention not to do any weird programming). It’s a nice language that’s very similar to Uiua which I tried out a while back but not being an array programming language makes it feel somewhat more ergonomic.
4/12
I can’t describe how nice it is to wake up and not have to deal with a mediocre story line involving elves and try to find time to attack a programming problem.
After today, I’m going to need that quiet morning, because I spent until 01:30 debugging an issue: Going to a detail view from the frontpage worked, but loading a detail view directly would throw an error.
There were two issues at play here:
Leaflet maps don’t deal well with being created multiple times so either we have to call `map.remove() or we have to check whether the map has already been created and keep a reference to it somehow.
I solved it by pushing the map into a global variable:
These are Rust constructs I would normally never use so that’s interesting. More interesting is that they work in one go and that they work on the WASM target.
Then the error was gone but the page was blank. Not entirely sure what was happening I poked at the DOM to see all the map elements there but simply not visible. Turns out that because of the different path, the path for the stylesheet was being added to the URL like this: http://127.0.0.1:8080/venue/176/main.css
It just has these two lines:
#map {
width: 100%;
height: 100vh;
}
But without a height the map is invisible.
Both issues are solved but not committed. I’ll see tomorrow whether I’m happy with the solution and how to package this up. Also I’m not sure how main.css is being served on production and whether the same fix will work there.
(I looked at day 2 part 2 but that just looked very tedious.)
8/12
Got in a ton of commits on Cuppin.gs today. After fixing the map, I wanted to see what would happen if I would add all 2000 markers to the map.
Performance seems to be doable but this is probably not ideal for a webpage. Dynamically rendering the venues is something for later. For now I can probably get away with filtering for the 100-200 nearest locations by distance and dumping those into the map view.
Now I’m back debugging Github Actions. I’m splitting up the build and deploy of the backend and the frontend into separate actions. Compiling dioxus-cli takes forever which is a step I hope I can skip with cargo-binstall.
Iterating on Github Actions takes forever and there really doesn’t seem to be a better way to develop this or a better CI solution that everybody is willing to use.
10/12
Spent some hours massaging the data that goes into the app. I had to add all new venues and after that I wanted to check whether any place in our 2k venue set had closed so we can take them off the display. This is a somewhat tedious multi-step process.
I have an admin binary that calls the Google Maps API for each venue to check the venue data and the business status (CLOSED_TEMPORARILY and such). But to be able to do that you have to feed each place ID into the API. The only issue with place IDs is that they expire from time to time. There’s a free API call that you can use to refresh them.
That expiration does not happen that often. What happens more, I found, is that a place will disappear entirely of Google Maps. For some reason it will be deleted. I don’t handle that case yet so there my updaters break entirely and the quickest fix around it is to delete the venue from the database and restart.
The only data issue that I still have outstanding is when venues move their location to a different address. I have a place around here that I think is still showing on its old spot.
11/12
Tried to run Cuppings in Xcode to be met with some weird compilation errors. Turns out that there’s an Expression type in Foundation that’s overriding my SQLite.swift Expression. It’s a pretty silly reason for code to be broken: Expression – name space conflict with Xcode 16/iOS 18
Also still fighting with the frontend deployments which seem to need a --frozen passed to them to not proactively go update package versions.
14/12
Love to have a crash on startup for the Cuppings TestFlight build and then sit down today to bake a new one and upload that and for that one to work. No clue what the issue was even though I took a look at the crashlog (that I sent in myself).
I’ve also automated building the iOS app to be done by Xcode Cloud which should making new versions (whenever the database is updated) a lot easier.
16/12
Upgraded the frontend to Dioxus 0.6.0 which just came out and has lots of quality of life issues. For my case, I did not need to change a single line of code, just change some version numbers and build a new dioxus-cli.
Nice TUI for serving the frontend
I hope that maybe solves the wasm-bindgen issues on the frontend deploy. The annoying part about the build is that it takes so long that it’s very hard to iterate on.
It’s too late even for me to see what this does. I’m off to bed. You may or may not get a new version of the website by tomorrow morning.
18/12
Spent some iterations running the frontend deploy and rerunning it but now it should be working.
22/12
I spent the evening doing manual data munging and correcting some venue locations that hadn’t been updated correctly through my data life cycle.
That forced me to clarify the two name fields the venues table has.
name was the original name field and was pulled from the Foursquare metadata
google_name is the name field that’s pulled from Google Maps and was effectively leading but not updated correctly yet when refreshing the data
So to figure that out I did a bunch of auditing in the list to see venues where there was a large discrepancy between the names. Something that happens is that a place will change its name but keep the same location and Google Maps place.
I also added a label to the iOS app to indicate whether this is a DEBUG build but that messed up the layout and I guess I might as well remove it. Sometimes I get confused what I’m running, but since it’s just me running DEBUG builds on their phone, I think I can do without.
I also started a rewrite that I’m not sure I’m going to pull over the line: I wanted to remove the search dependency on Alpine.js and replace it with htmx. For this I asked Cursor to do the translation which it did a stab at but ultimately rather failed to do even the basic steps for it. Then I did it myself and while htmx is super easy to setup, the data juggling I have to do with what I get from Google Maps is very fragile and needs to be cleaned up (which I may or may not do given that things are working right now).
23/12
Working with the backend was very annoying because every time the server restarts, it would log me out. To fix that I changed the persistency of tower-sessions from MemoryStore to FileSessionStorage and that fixed it without issues. There is now a .sessions folder in the backend which needs to be ignored for cargo watch but other than that it’s a drop-in replacement.
That means I will need to write a logout view at some point.
Old people’s brains have been entirely cooked by the slop feeds that Meta produces. No parent worth their salt would trust these people to protect their kids. The EU should follow suit.
the law may infringe on the rights of young people and reduce their ability to participate in society
Since when is being spoon fed the worst advertising and content created by awful people looking to make a quick buck “participating in society”?
Seeing if I can move from Arc to Vivaldi but there are half a dozen radical improvements in Arc that *make* the experience. It just shows how much innovation and solid thinking was packed in all of that frivolous design.
Vivaldi on the other hand has a million settings which mostly show that nobody knows wat this app is supposed to be doing. There are entire note taking apps and e-mail clients in there but none of them fun or nice to use.
Products truly live and die in the pixels.
Chuffed by the strength of my team that we have most of the easy decisions from these Node.js pillars in place and we are making headway on all of them to the extent that’s appropriate for us.
What China has done in industry after industry is to flatten the supply curve by subsidizing hordes of producers. This spurs innovation, increases output and crushes margins. Value is not being destroyed; it’s accruing to consumers as lower prices, higher quality and/or more innovative products and services.
If you are looking for returns in the financial statements of China’s subsidized companies, you are doing it wrong. If China’s subsidized industries are generating massive profits, policymakers should be investigated for corruption.
A piece well worth reading about China’s economic policies if only for the fact that their flattening of supply curves is the only thing that is really fighting climate change.