Matt Stockton

A Year Of Building With LLMs - The Paper You Need To Read

2024-09-25T00:00:00+00:00

If you have an hour, click the below link and then just close out my website. If you don’t have an hour, you can read below, but it wont be as good.

What We’ve Learned From A Year of Building with LLMs

The above paper is such a fantastic paper that I am almost hesitant to post my summary notes. It’s well-worth your time to read the paper itself, and you’ll get much less value from my very rough bullet-points below. But I’ve found I get a lot more personal value and information recall when I take notes, and if I have the notes — why not share them here?

I am grateful to all of the authors of this paper for sharing their hard-won insights for free, in public. This paper is super-approachable, yet only takes an hour or so to read. If you’re technical and looking for a hook to dig into LLMs more deeply, this is the way.

Take-aways

Learn how to prompt effectively. If you learn anything, learn about “N-shot prompting” (which is really just providing examples inline), and “Chain Of Thought” (which is really just providing a detailed list of steps you want the LLM to take, sort of like you would to a brand new college grad at your company)
Structure and label your prompts. Typing a big paragraph wont work as well as labeling different parts of your query. Providing some background information and then a source document? Label it as such using headers within your prompt
Keep prompts small, focused on one task. Just like in traditional software, break things into chunks that do one thing. If your prompt is doing a lot of things, break it into multiple prompts
There’s this famous quote that I just looked up. I hope it’s correct. It’s good advice for writing prompts. Just because it’s a computer, doesn’t mean it wants all your extra fluff: Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
There are some really naive ways to do RAG, and you shouldn’t ignore tried-and-true retrieval techniques like keyword-based search, even if it’s really ‘old’ technology. Vector embeddings are really cool and useful, but have their failure modes, and are best used along with techniques like BM25, not as a replacement for it.
Fine-tuning is useful, and is getting easier and cheaper to do — but really refining your prompts, attempting RAG, and checking different models — is usually a better place to start. Also, you need to have a systematic way to evaluate your prompt’s effectiveness. Don’t just type stuff into the UI, look at the outputs, and assume it’s going to work well for all cases.
Long context windows are really cool, but it’s not an excuse to be diligent about what you put in the window. Just because you add more stuff doesn’t mean the result is going to be better. Also, stuffing the context full is going to make things slow at inference time.
This is a bit of a repeat, but it’s really good advice. Break problems down into units, and design your system around those units. I loved this quote from the paper:

The most successful agent builders may be those with strong experience managing junior engineers because the process of generating plans is similar to how we instruct and manage juniors. We give juniors clear goals and concrete plans, instead of vague open-ended directions, and we should do the same for our agents too.
You need to test your prompts. The baseline is sample inputs and outputs, with unit-test-like assertions. Just because the output is unstructured text doesn’t mean you cant build meaningful assertions and automate their execution.
LLMs can be used for testing as well (e.g. feed the inputs / outputs of a prompt into another prompt that ‘judges’ how good the output was). This is a valid approach, but there’s some diligence you need to apply to the judge prompt, and you should dig into the valid techniques here.
LLMs seem like magic, but they aren’t. If you’re asking it to do something that an intern couldn’t do with the data you gave it, then you might need to revise your approach.
Feedback loops in your system are an absolute necessity. Figure out a way to get your users to label and provide feedback on your outputs. Also, having a user compare two things (e.g is A better than B?) is more reliable than asking them to rate something (e.g. rate A on a scale of 1-5)
LLMs will almost never say no, even when we want them to. This is where you can get into the most trouble with them making stuff up, so test your systems accordingly. Build it into the evals.
Log your prompt inputs and outputs - You’re going to want to look at your data, and you need the lowest-level data - so orchestrate your system such that you have easy access to the actual API inputs and outputs - and then look at it.
Look at your data - This is almost a meme at this point. But it’s true. It’s not always fun, but — look at your data, see what looks good, see what looks bad — stare at it for a while, and then figure out what’s going wrong and what you need to change.
There are some really good ways to get models to return structured outputs like JSON. Instructor is an awesome library to do this, and it’s really easy to integrate
You can’t just change models, even if it’s a model upgrade. Things break. Stuff that used to work breaks, and vice-versa. This is why you need a systematic way to test your prompts. You should be able to swap out the model, and then have a quantitative view on how this new model affects your specific use-cases’ performance. If you don’t have that, you gotta get there. There are some really light-weight tools that are easy to start with for Evals, like PromptFoo
Inference is getting cheaper, but it’s not free. It’s worth experimenting with the smallest model that gets the job done.
You cant just let LLMs go wild, they need some help from people. The best experience right now is a Human In The Loop approach. This means that you find the most seamless, least friction, most valuable way to integrate the LLM capabilities into the user’s application experience. Without duplicating a bunch of stuff here, this part of the paper gives a great example
Like anything you’re building, you gotta know what your priorities are. You can’t have everything, and if you think you can — you’re gonna build a bad product. Interesting priority dimensions to consider for LLM products are: Reliability, Harmlessness, Factual Consistency, Usefulness, Scalability, Cost
Process over Tools - There are like 1000 new tools being launched every day in this space. So many shiny objects. But what are you really trying to do? What problem do you need to solve? Ignore the shiny objects and solve the problems, in the most pragmatic way possible.
Experimentation is more necessary than ever. We’re all still figuring out how a lot of these models really work — so sometimes you need to go wide to figure out the best approach. The best way to do this is to have systematic ways to test your experiments (got evals? or no? go get them…)
Get everyone involved across the entire org. Domain expertise has never been more valuable. Now that the form factor for instructing computers is just english, many more people at your company can do it. Show them the way, don’t gate-keep. This is how your team will win.
Don’t build something that is everything to everyone - it’ll be nothing to no one instead. Find a niche and crush that niche first - I think this is true for any software product, but probably more true for LLMs, because if you reduce the quality of the response by expanding the target use-cases, it becomes apparent very quickly. Focus is everything. Find your beach-head first, and make it awesome.
Look for tools you can assemble, don’t build them yourselves - There are so many great building blocks in this space - treat them like legos — don’t become a lego factory. What’s your core value proposition or defensible moat? Make sure you’re building that, not a thing someone else is better at building. Just use their stuff.
Prompting and Evals and In-Application Data Flywheels - This is where to focus your time.

Testing Your Prompts - Writing LLM Evals

2024-09-24T00:00:00+00:00

Anthropic created a set of python notebooks detailing how to create prompt evaluations. The GitHub repo is here. If you’re using a language model in your application, you should test the results of your prompts in a systematic way. It’s not much work to do this, and it’s tremendously important to do so. There are tools to help you run the evaluations, and they don’t take much time to set up or use. I’d encourage folks to read through the notebooks linked above to build appropriate context. You can read through all of them within an hour or two. As a quick overview, I wrote a few take-aways below.

The key point to make is that like any software you’re building, you need to test it. Testing is a bit different than classical unit / regression testing, but the patterns are similar. The patterns and tools to create evals are approachable and easy to understand. If you’re building with LLMs, it’s worth your time to understand and apply these concepts — it’ll save you a ton of time in the long-run!

My Notes

The key components within prompt evaluation are (1) the example input (e.g the prompt to the model), (2) the golden answer (e.g. what is considered a correct answer from a model), (3) what the model actually returned, and (4) the grade (e.g. given the golden answer and the model output, how would you grade the result?)
There are a few categories of grading. The two most approachable categories are (1) code-based grading, where you make assertions in your code (e.g. exact string matches — this is analogous to unit testing in classical software development), and (2) Using another LLM to score the output (e.g. build a different prompt that ‘grades’ the result of the first prompt)
Anthropic provides a workbench tool that allows you to prototype evaluations. It’s a good start, but there are also custom-built tools that give you more customization and options.
Prompt evaluation is iterative. The point of having a systematic way to score your prompts with a quick feedback loop is to give you a clear path to improve your prompts - It also allows you to test different models quickly and easy (e.g. want to swap out Claude 3.5 for GPT4o?, the right prompt evaluation framework makes that really easy)
There are a bunch of tools to help you build evaluations. A decent, light-weight open-source option is PromptFoo. PromptFoo requires very little customization to get started, and relies on approachable configuration patterns. You can do almost everything in a yaml file if you’d like, but there are lots of hooks to do things like write custom evaluation logic. The UI is quite nice as well. When you run an evaluation, it’s really easy to dig into specific examples (e.g. if a specific output didn’t align with the expected result, it’s easy to click around and see why)
For code-graded evaluations, PromptFoo has a bunch of built-in metrics. For example, if you want the model output to classify some text, and each piece of text can have multiple labels, there is a built-in contains-all construct.
In addition to the built-in metrics, it’s easy to write custom grading functionality. You can define a python function that takes in the ‘current sample’, as well as the model output, and you can return your own score. The example given in the notebook is if you wanted the LLM to use a specific word an exact number of times in its output, you could process the output text from the model, and assert that the word appeared that number of times.
Lots of LLM use-cases require more nuanced evaluation. For example, you might want to assert that a model response has a certain tone, or that the model response is very concise. For this use-case, PromptFoo has a built-in llm-rubric evaluation, where you can specify a written criteria to judge the model output, and PromptFoo will use a separate LLM prompt to check this criteria. As an example, this is the PromptFoo yaml that you can use to ensure that a model response isn’t too apologetic:
```
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: Is not apologetic
```
When using llm-rubric you can click into specific samples to see why a sample passed or failed the rubric. This allows you to iterate on prompts and approaches.
PromptFoo allows you to test multiple prompts against multiple models at the same time. So this allows you to compare prompt performance, whether you’re changing the underlying model, changing the prompt, or both.
In addition to the built-in llm-rubric , there are hooks within PromptFoo to construct your own evaluation prompt. So if you have very detailed instructions on how a separate LLM prompt should be used to judge the output of the prompt you’re testing, you can code this up in python, and have PromptFoo use it.

Final Thoughts

LLMs can feel like magic and can do things that were impossible a few years ago. They are an incredible tool to augment classical software development and machine learning. LLMs are also, by design, non-deterministic. There are some great patterns and tools to systematically test your prompting, and it doesn’t take a ton of time to get started. It’s worth your time to spin through a tutorial or two, and just give one of these tools a try.

A Firehose of LLM / AI Tidbits

2024-09-19T00:00:00+00:00

I’ve been pretty far down the rabbit hole of Large Language Models for a while, yet there’s still so much for me to learn. I feel so early in the journey. At the same time, I realize I’ve built up a lot of knowledge worthy of sharing. I enjoy writing about what I’m learning here on my site. It’s been a fun way to experiment with tools, get some writing practice in, and connect with others who are curious too. I have a large backlog of topics I’d like to write about. For this post, I’m trying something new. Instead of going deep on a single thing, here’s a firehose of quick tidbits. My goal in sharing this is twofold: (1) to categorize and solidify things I’ve found interesting over the past month or so, and (2) to inspire a few of you to go deeper in this space. Maybe you’ll find something here that is the nudge you need to go deeper. Happy building and exploring.

The Firehose

When you’re doing something manual or time-consuming, always be thinking about how ChatGPT might help. Like this guy who created calendar invites from a picture of his kid’s school calendar - I tried this too and it worked. You can use these tools for tedious things. Sometimes they just work.

LLMs will continue to get better at dealing with ambiguity in data inputs. But it’s never a bad idea to structure your input data intentionally. Intentional design around your company’s documentation formats, and making them ‘LLM friendly’, will make adopting these tools easier for you in the long run.

LLM hallucination is a feature, not a bug. It’s by design non-deterministic. It can do amazing things for you, if you work with it the right way. Don’t reject it too early, and learn about what it’s good at and why. There are techniques to make it more deterministic and more reliable. Learning these techniques are valuable, but it’s ultimately more valuable to accept the uncertainty and build with it in mind.

Don’t think of AI tools as ‘it will do the work for me’ - think of them as thought partners to get better at your work. Treat it like a work partner, an intern, a teacher. It doesn’t do your work, it helps you get better at it.

The AI ecosystem is like a set of lego building blocks. There is a huge set of tools to consider using, at multiple abstraction levels (e.g. build from scratch, use some libraries, use one tool, use a suite of tools), and also with different modality focuses (e.g. text, video, voice, etc.). It’s really easy to feel overwhelmed, but building a mental model for what you use when, and why — is an extremely valuable exercise. It’s also iterative.

There are truly amazing breakthroughs happening in the vision and image space. Things that weren’t possible a year or two ago are trivial now. This is going to lead to some wildly novel products — and understanding the ‘new possible’ is a really valuable thing. What can you build with it now? So much new opportunity space. As example, this is wild to me: predicting rainfall not based on weather data, but on actual past radar weather maps

There’s no excuse now to get started learning. There is a huge amount of high quality, free content to help you learn . Go to bensbites.com → Course Catalog → Filter For Free → Choose Difficulty Beginner. 68 courses to choose from. Choose one that catches your eye and just get started.

The big players like Anthropic Claude have fantastic documentation for stuff like learning how to prompt. Better yet, there’s incredibly talented people who are summarizing the key take-aways for you - Find the right people and follow them on Twitter.

It’s not just Twitter though. There are some truly incredible people sharing their knowledge for free on their personal site. For example, if you spent a day or two reading all of what Eugene Yan writes, you’d be so much further ahead than almost everybody. You don’t even have to read everything. The first link on this page is ‘Lessons from a year of Building with LLMs’ - Read that alone and you’re probably ahead of 90% of people in terms of knowledge. Find the ‘Eugene Yans’ that write about what you’re trying to learn.

You can get a lot of interesting ideas for products to build by just looking at different datasets. LLMs give us an entirely new way to use data. If you’re itching to build a product, spend a day just looking through free datasets with the background thread of ‘what product could I build with this?’

Most people I know who are excited about the space and are eager to learn more had 1 or 2 - ‘Holy Sh*t, you can do that?’ moments. It’s actually not that hard to find one, where you can try it out end-to-end yourself. Maybe that’s the nudge you need to get started. As an example, here’s a quick tutorial on how to make AI images of yourself using Flux

Circling back to video and image stuff. The things you can do out of the box now are just mind boggling. Multi-modal models can just do things. Incredible things

If someone is talking to you about why AI is all good or all bad, don’t listen. Like anything, there is nuance. Lacking nuance while learning is just going to block your exploration. Like any tool, understanding how it works, when it should be used, and what it’s good at — is going to make you better as using it.

I said above that you can think of AI tools as an intern, a teacher, a co-worker, a professor. Another way to think about it: It’s a map

There are so many tools in this space, and it’s easy to get bogged down, but it’s also good to know what’s out there. Let other people do the work for you, and keep a pulse on what others are using - But don’t make that the focus. You could spend all your time keeping track of all the things.

There’s a lot of noise on prompting techniques, but there’s some gold in the hills. Rely on the documentation provided by the big players (e.g. Anthropic), and also people that really have deep experience playing with these tools. There’s huge alpha in learning prompting, and continuing to learn and iterate on what works for you

Writing Evals is a big topic right now for folks building LLM systems. There’s some really interesting and helpful emerging content on this. But the key thing to remember is that you’re building software, and you need to test it — just like every other piece of software you build. LLM outputs are probabilistic, but there are still plenty of techniques you can use to build robust and tested systems.

You should try out NotebookLM to create a podcast. If you haven’t had your ‘Holy Sh*t’ moment, let this be it. It’s free and takes 10 minutes

Keep close tabs on OpenAI’s o1 models. It feels like we’re really on the verge of a step-level change in what’s possible. Here are a few interesting tidbits about that:

Writing code from a technical paper
Oscar Health’s experience so far
What an extremely smart person has learned about it so far about the different models
Some pretty cool (and short) video demos

If you build stuff, you should try out Cursor or Replit. I know there’s been some amount of resistance to using these tools, particularly for experienced folks — but this is where things are going. My honest take is that these tools are even more valuable for experienced folks, because they still can use all that hard-won knowledge to fix the errors the tools make.

Finally, it’s very hard to keep up in this space. You could accidentally make that your full-time job. You need to find the balance of ‘keeping up’ but also actually trying things and building. My hard and fast rule: Don’t overthink it too much, don’t be too anxious if you feel like you’re missing new things, and if you are uncertain about how to spend your time, bias towards building. The best way to learn is to put in the reps.

In terms of keeping up, you’ll notice that almost all of the above links are to Twitter. Twitter is by far the best way to keep up on the space. You just have to follow the right people, ignore the noise, and know when to put it aside.

What Are You Building? How Can I Help?

Hope you found this helpful. If you have good resources, notes, knowledge to share, send them my way. If you or your company need help building or learning more — I can help. More about how you can work with me here

Building small-scale Custom Software with LLMs and Gumloop

2024-09-16T00:00:00+00:00

The cost to prototype custom software that solves niche problems can be substantially lower with LLM capabilities. Aspects of a software system that previously required deep technical knowledge to build can now often be implemented cheaply and easily with the help of LLMs.

Not every problem you or your company has is worthy of being solved by software. The normal filters still apply, like: “How much time/money will we save?”, or “How does the quality of the output change?”, or “How long is it going to take to diligence vendors, de-risking that they can actually solve this problem?” - These aren’t the only questions, but the point is that there are base costs for consideration when building a thing, and often times these considerations lead you to just do what you were always doing in the first place.

The default assumption until recently is that a problem’s scale or impact has to be somewhat large before you consider trying to build something yourself. With emerging LLM capabilities and the tools being built around them, that cost equation is rapidly changing. The base costs have been drastically reduced. It makes more sense now to prototype things sooner, because its easier to see if they work well in a very short period of time.

Here’s a small (somewhat toy) example, but it’s a real problem that I want to solve in my life.

Problem

I listen to a lot of podcasts on my daily run. Most podcasts are technical in nature, and often have a few key take-aways or concepts I want to follow up on. I try to jot down notes on my run, but this doesn’t always work well. I want a succinct way to keep track of the key highlights from these podcasts.

What I did

I used a tool called Gumloop to protoype this. Gumloop is a tool that enables you to automate AI workflows with a drag-and-drop UX. I built a simple workflow that did the following:

Scraped the website of a recent podcast I listened to (the one I tried was this Latent Space episode about Brightwave, which was fantastic)
Extracted the full transcript from the website, using a Gumloop node that made an LLM query to GPT-4o
Used Claude to generate an LLM prompt that gave detailed instructions on how to summarize the transcript
Plugged that prompt in as a new Gumloop node, asking it to run this summarization prompt
Added a Gumloop ‘Generate File’ node, to take the output from the above step and create a PDF from it.

You can see the final result here in PDF form, which is a pretty good summary of the key take-aways from the episode (given the constraints I wanted in the prompt generated in step 3 above)

Pretty amazing that you can prototype all of this in under an hour.

Video Walk-Through

I added a lot more context and narrative to what I did in the below YouTube video. If you watch it, you’ll get a much better feel for how tools like Gumloop work, and how you might be able to apply it to problems you’re trying to solve.

Time To Build

It’s never been easier to get started building. I’m eager to hear what you’re trying to build, and where you could use some assistance. I love hearing how others are getting value out of these tools, and what they are struggling with. Let me know what you’re working on, and how I can help. I offer consulting services to companies who want to figure out how to adopt these emerging LLM capabilities. If you want to learn more about that, check out this page, and reach out to me.

Custom Podcasts with NotebookLM

2024-09-12T00:00:00+00:00

I’ve written before that self-guided learning is an area where LLM-based applications excel. It’s never been easier to simply open up ChatGPT and have a conversation about a topic you’re curious about – on your own terms, using your own language, knowledge, and workflow. I’ve recently found myself using the voice functionality of ChatGPT – with lots of back and forth – on my phone. I can do it anywhere – I can be on a walk, working in the yard, or just between other tasks. The form factor is amazing and convenient. There’s now really no such thing as a dumb question, as long as you’re willing to keep the conversation going.

I also love the podcast form factor. When I go for my daily run, I’m generally listening to a podcast about a topic I want to learn more about. I have a solid queue of episodes to listen to, but often times find myself looking for content on very specific topics. Recently, Google updated their NotebookLM product to let you create custom audio based on a set of documents you provide. I tried it out, and it’s fantastic.

What I did to create the podcast

I wanted to learn more about Nuclear Energy
I found a few different documents via web search that seemed high quality / a good starting point
I uploaded them to NotebookLM
I got a podcast

The Podcast

This took all of about 5 minutes. If you want to listen to ‘the podcast’, here it is:

Your browser does not support the audio element.

How I Did It

If you want to see a video of how I did it, here that is as well. In the video, I also show how I used the tool to create a quick FAQ doc from the weekly newsletter my kid’s school sends out. NotebookLM has a bunch of other use-cases that are evident once you dig into the tool – it’s very easy to use. It’s free to use, and another key benefit is that it just focuses on the content that you give to it (e.g. it’s not going to hallucinate an answer — or at least will be more accurate than just using a typical LLM tool)

Learning and Building

It’s never been easier to learn on your own terms. Whatever content you want, whatever medium you want – it’s there for you to assemble. I know it sounds a bit hyperbolic, but I truly think we’re living in the future right now. People who are willing to engage with these tools, and find ways that these tools help their workflows, while identifying their rough edges and working around those rough edges, are going to win.

I love hearing how others are getting value out of these tools, and what they are struggling with. Let me know what you’re working on, and how I can help. I offer consulting services to companies who want to figure out how to adopt these emerging LLM capabilities. If you want to learn more about that, check out this page, and reach out to me.

You can build a RAG system too

2024-08-22T00:00:00+00:00

In software engineering, we have the tendency to invent complex terminology to describe relatively simple concepts. I’ve experienced that as I’ve learned more about Large Language Models and what we can do with them. Here’s an example: Retrieval Augmented Generation (RAG). What’s that? To me, it is: Using a custom search engine to find information, and including that information in your initial model prompt before running it.

Even better than my definition — you can actually use LLMs to explain stuff like this to you. I sent the following prompt to ChatGPT:

Prompt: “Explain RAG to me in 2 sentences assuming I work in a technology company but am not a software engineer. Explain it simply, without any acronyms, and in a casual manner”

Response: “RAG is a smart way for AI to first look up the best information from a huge collection of data, and then use that info to create accurate and detailed responses. It’s like having a supercharged assistant that quickly finds and shares the right answers, making tasks like support or writing much easier.”

Not bad. This use-case of ‘explain it to me like this…’ is one I love to use for concepts I’m eager to learn about. You can have fun with it too:

Prompt: “Explain how the planets move. I am 5 years old’

Response: “The planets move around the Sun like a big merry-go-round. They keep spinning in circles, each one on its own path, and they never bump into each other because they stay in their own lanes.”

It’s always refreshing to me when I find content online that describes complex subjects in very accessible ways. For RAG, I found one recently, that I absolutely love. It’s a 50 minute YouTube video called ‘A gentle introduction to RAG’ paired with a Github Repo. It’s by Santiago - who puts out a lot of other high-quality content as well.

It walks you through step-by-step on how to build a baseline RAG demo. It’s under 100 lines of code. It walks you through all of the key building blocks, with great diagrams. The explanations are clear and succinct. If you do it, you’ll even learn about other tools like Llama, LangChain, Vector Stores, and Embeddings. You are well on your way to building something in under an hour — so much signal to noise here. I truly think anyone, with enough persistence, could work through this example in an hour or two. Like I always say, if you want to learn - Just Get Started.

The cool thing is, once you learn the baselines and build some curiosity — it opens up all of these other paths to explore. Watched the video, and think RAG is cool? You can go as far down the rabbit hole as you’d like to.

For example, this is a recent Podcast I listened to: Contrarian Guide to AI: Jason Liu on Betting Against Agents while Doubling Down on RAG & Fine-Tuning

Jason is truly an expert in this space, and shares a ton of great knowledge on how you can improve your RAG implementation. He briefly mentions Hypothetical Document Embeddings in the podcast. Sounds scary right? Let me check:

Prompt: “What does Hypothetical Document Embeddings mean for RAG? I am new to LLMs”

Response: “Hypothetical Document Embeddings in the context of RAG (Retrieval-Augmented Generation) refers to a technique where the AI creates “imaginary” summaries or representations of possible documents, even if they don’t exist in the database. These summaries help the AI think about what information could be out there and improve its ability to find or generate accurate and relevant responses, especially when the exact data isn’t readily available. It’s like giving the AI a better sense of what to look for when answering complex questions.”

Okay, I think I understand it a bit better now! And this podcast has like 5 other concepts that you might be eager to explore once you listen. It all branches out if you have curiosity and time. Jason also says something that I firmly believe:

There is a discussion about what is constant about skills that will be valuable, regardless of the current tech hype (I don’t recall the exact context but it was something like this). The response from Jason was something like: Can you clearly describe what you want? Can you write well? - I think this is more true than ever — even more so with the emergence of LLMs. Honestly, they make it more accessible for a broad range of skill-sets to build a thing. If you think you can’t get started because you don’t have the background of the skills, I think you’re wrong. Go try a thing, you’ll be pleasantly surprised.

It’s an amazing time to be building in this space. Are you or your company curious? Do you need some help, advice, or some cheering on to get you started? Don’t hesitate to reach out: mattstockton@gmail.com

Hello - You can build now

2024-08-21T00:00:00+00:00

There’s a video making waves this week in the AI rabbit hole community. If you haven’t yet taken the time to fully understand what’s become possible in the last year or so, maybe this video will be the nudge you need. I’ll summarize here, but I think it’s worth watching the video. It’s only 10 minutes.

Please Watch This

An 8-year old builds a legit working Harry Potter chatbot app in 45 minutes, from scratch.
The app connects to external APIs, and uses LLMs to chat back with you
She walks you through how she built it in real-time
She incrementally makes it more interesting and feature-rich, iterating on it and tweaking it to align with her vision.
Again, this is an 8 year old — building working software

So how is this happening? It’s because Large Language Models have unlocked an entirely new set of building blocks for people to make cool stuff — and these building blocks are accessible to an audience that — 2 years ago — would have experienced too much friction to get started. That friction is entirely gone. It’s never been easier to just get started.

She is using a tool called Cursor.AI, connected to a Language Model called Claude Sonnet 3.5. For about $40/month, you could be doing the exact same experimentation she is doing — and you can even experiment for free. When I say you, I really mean you — you, the one reading this.

I have so many thoughts on this, but overall — it distills down to excitement, awe, and just being incredibly inspired by what’s happening. There’s never been a better time to start building.

Here are some other reasons why I think this is exciting

People with problems to solve ‘in the real world’ have often been constrained by the tools accessible to them to solve these problems. Building software is now accessible to many more people. Thus, people with the true domain expertise and ability to deeply understand the problems, can now build tools to solve them. This ‘translation layer’ between domain experts and people who can build software has always caused friction. This puts more tools in the hands of domain experts.
The base costs for building a thing have been drastically reduced. In the past, some problems weren’t even worth solving with software, because getting started was too costly. These new technologies reduce the base costs, thus they change the cost equation related to when and why software can / should be built.
This opens up an entirely new way to learn. Many people are turned off from trying to build software, because ‘starting with code’ is just too overwhelming. Everyone’s brain works differently. Being able to learn in an applied way with no background knowledge necessary is incredible. You just need to be able to write clearly - that’s the first step. And this first step is accessible / achievable to so many more people.
It changes the skills that are valuable in the workforce. I’ve been in software for 20 years, but I’ll be the first to admit that there is a lot of unintentional (and some intentional) gate-keeping through complexity and obfuscation. I’ve always really valued communication skills. Being able to think and write clearly give you a huge advantage. Now, people who work on and build these skills can directly apply the skill themselves, instead of having to work through others.
The world is changing, and jobs are changing – and the major failure mode is not being willing to understand how.

I could go on and on here, but I hope you get the main point. A whole world of opportunity has been opened up for people to build cool stuff. Maybe this is you. If it is, then my only question is — did you watch the video? It’s 10 minutes, just get started.

I’m eager to hear how you or your company is trying to use these new tools. Need help getting started? Want some advice? I’d love to hear from you: mattstockton@gmail.com

Analyzing companies with LLMs - A simulated investor conversation

2024-08-20T00:00:00+00:00

Note: I’m trying something new in this post. I’ve included a PDF document at the bottom that contains some full input and output prompts from the experiments described below. It’s a lot of text, so I didn’t want to include it inline. Let me know if you read through that piece, and what you think!

Previous Experiment With Company Annual Report PDFs

LLM Long context windows allow you to analyze unstructured data with limited effort. As I’ve written before, I’m interested to research how this capability enables us to summarize and extract signals from company annual reports for investing purposes. In this previous post, I wrote about how I:

Used an LLM to build an attribute-based stock screening criteria
Identified 10 stocks matching the criteria using a free screening tool
Downloaded the annual report PDFs for the stocks
Used an LLM to build a detailed prompt to analyze the reports
Used Google Gemini’s Long Context Window capabilities to run the prompt and make some recommendations

Why you should care about LLMs - creativity

LLMs can be helpful in summarizing dense information and extracting knowledge. That’s essentially what the above is doing. This is useful, but I think the really cool stuff is within more creative utilizations of these tools. There are a few things I love about LLMs:

Their ‘brain’ has read mostly everything on the internet, and can use that information to generate responses
They can ‘think’ faster than everyone else
They are up to almost any challenge, however unique or absurd it is, and will never say no. They are willing to try anything.

Getting Creative With Company Analysis

To me, #3 is where the real fun stuff can happen. It’s just up to you to play along and have a bit of creativity as well. With that viewpoint, I followed up on the Small Cap Investing exercise and did the following:

I focused on the company that Gemini was most excited about (Esquire Financial Holdings)
I used Claude Sonnet to build a detailed LLM prompt that simulates a conversation about ESQ amongst well-known investors. Using prompts to build prompts continues to be one of my favorite use-cases. Here’s the prompt I used:

You are an expert prompt designer. I want you to design a prompt for the following problem:

I want to be able to upload the annual report of a company. I want to build a prompt that instructs the large language model to pretend to have a conversation between Warren Buffett and Joel Greenblatt. The large language model should take on the persona of these two very notable investors. The goal of the conversation between them would be to analyze and talk about the company from the annual report. Ultimately, I want them to come up with a recommendation to either invest or not invest in the company and give their reasonings why
I used the output of step 2 above, along with the ESQ annual report PDF, and plugged it into Gemini. The results were interesting. Here’s a snippet (and you can see the full conversation in the assets attached below)

Greenblatt: I agree. Their balance sheet is also quite robust, with a healthy liquidity position, low debt, and strong capital reserves. Their Tier 1 capital ratio is 14.13%, well above the regulatory requirements. That’s a good indicator of their financial strength and their ability to withstand economic shocks. But let’s see if their business has any staying power.

Buffett: That’s where I start getting curious. What’s their competitive advantage? Do they have an economic moat?

Greenblatt: They’ve got a couple. Firstly, they’ve carved out a niche within the legal industry, focusing on litigation-related lending. This gives them a specialized expertise and a deep understanding of their client base, which has allowed them to build strong relationships and capture a large market share. Second, their payment processing business is also profitable and provides a stable source of revenue. I’d say they have a decent moat in those areas.

What Else You Can Try

Fascinating stuff. And this is just scraping the surface with 10 minutes of work. There are many directions you can go from here. As an example, I asked the LLM to re-generate the conversation with Howard Marks and David Einhorn included as participants, which added some differeing views. As another quick follow-on, you could consider building a new prompt with the following instructions:

You are an expert prompt designer. I have used the prompt below (between double quotes) to generate a conversation between well-known investors about a small company. I have the output of the conversation, and I want to generate a new LLM prompt to analyze the conversation. The analysis should be conducted by a third-party, who is an expert at picking small companies to invest in. They should analyze the conversation between the well-known investors and try to poke holes in their analysis. They should be as detailed as possible and ultimately give their own assessment based on both the feedback of these key investors and the information they know about the company from the annual report

I tried the above, and had interesting results — but I leave it as an exercise to the reader instead of posting the outputs here. Because ultimately, the best way to truly learn about these tools and what they are capable of, is to use them. As I’ve said before — if you are curious about how LLMs can help you: Just Get Started

Need help with these technologies? Curious but not sure where to start? Think your company could use some help? Send me a note: mattstockton@gmail.com

Prompt Inputs / Output

I built a PDF that contains some of the inputs and outputs I mentioned above, you can download it here:

Stock Research Inputs and Ouputs

Disclaimer

The content provided in this post is for informational and educational purposes only and should not be construed as financial or investment advice. The opinions and analyses presented are those of the author and do not constitute recommendations to buy, sell, or hold any securities or investments. Please consult with a qualified financial advisor or conduct your own research before making any investment decisions. Investing involves risk, including the potential loss of principal.

Investing With Language Models

2024-08-19T00:00:00+00:00

How we can use information found in unstructured data to drive investment ideas is an area I’m really interested in. When used appropriately, language models can be great at summarization and categorization tasks. For those who want to build a better mental model of ‘what LLMs are good at’, I highly recommend Andrew Ng’s Prompt Engineering course or just take a spin through this Prompt Engineering Guide

I’ve written a few pieces about how we can use LLMs for investing over the last few months. One example I am excited about is leveraging long context windows to process unstructured company reports. Another example is the ability to spin up tools to visualize economic data, without needing to know how to code.

Today, I saw a really compelling use case, highlighted by Verdad in their newsletter. It’s a quick read so I encourage you to read it in full — but in summary:

Verdad Overview / What They Did

Japanese stocks have been undervalued for a long period of time, using standardized financial metrics like Price / Book. The drivers behind this are numerous and detailed. That’s a book in and of itself.
The Japanese government and stock exchanges have an interest to change this (lots to be read on this as well) — and took action, requiring companies to define their plans to increase company valuation.
Thousands of companies submitted plans on what they will do to increase valuation.
The plans submitted by companies — although likely structured differently — have a finite set of ‘menu options’ that range from ‘passive / do nothing’ to something more assertive and meaningful (e.g. allocating a specific dollar amount for dividends and repurchases)
Verdad used Language Model tools to:
1. Scrape, extract, and summarize information from the several thousand reports submitted to the stock exchange
2. Determine the type of action planned by each company (e.g. do nothing, repurchase shares, issue dividends, sell strategic holdings)
3. Assess the ‘robustness’ of the claim by the company (for example: is the repurchase plan actually meaningful in $ amount compared to company market cap?), and use this robustness to potentially re-categorize (e.g. if the company says it’s going to re-purchase, but the plans are too high-level, don’t give them ‘credit’ for being in this category)
4. Build a final categorization of the several thousand companies, and calculate the average stock return for each company in each category since the plans were filed

Verdad’s Findings

Not surprisingly, the companies with the most detailed and robust plans had the highest average returns. The companies who failed to disclose or lacked clarity in their actual plan had the lowest average returns. In hindsight, this result seems really predictable.

Why It Matters

So why does this matter? Because you could easily imagine having this investment hypothesis on day 1, when the above reports were submitted. It’s not a far leap to think that companies with more specific and detailed plans to increase valuation might perform better than companies that just say ‘we’re not sure yet.’ So if you had this hypothesis on day 1, and wanted to invest on it, how would you do it? This is why Language Models are so interesting. When orchestrated appropriately, Language Models can be great at the following, and can do them incredibly quickly:

Translation
Summarization
Categorization

These are exactly the 3 core steps that Verdad had to do in this analysis. Perhaps the companies in the highest returning bucket (e.g. ‘selling cross-share and strategic holdings’) could have been identified as soon as the reports were submitted. The average return is meaningful — the top bucket is 56%, the bottom bucket 21%

Where We Go Next

Before Language Models, you could have done this type of analysis (it’s not impossible) - I just think it would have taken a ton more time and would have involved a lot of manual / tedious work. I am not sure how long this analysis took, but if I had to place a bet, I think the ‘time-to-value’ here is a lot shorter than most people think. These models can be incredibly good at extracting information from large unstructured data, given good prompting techniques. You could imagine going a step further here as well — after identifying the companies with the most detailed and actionable plans, you could prompt the Language Model to do a discrete ranking of the companies based on some other investment hypothesis you wanted to mix in. There are so many other ways to augment this core idea.

I think we’re just scraping the surface of how we can apply these technologies, and it’s super-fun to see folks like Verdad using these tools, and getting meaningful / interesting results. What are you building? What are you interested in related to AI and Language Models? How can I help you? Reach out at mattstockton@gmail.com

Inviting AI to the table - Building Software Edition

2024-08-15T00:00:00+00:00

I’ve written in the past on how I love Ethan Mollick’s idea of “always inviting AI to to the table”. He wrote the following in this post:

I know this may not seem particularly profound, but “always invite AI to the table” is the principle in my book that people tell me had the biggest impact on them. You won’t know what AI can (and can’t) do for you until you try to use it for everything you do.

It’s energizing to come across people who are really trying to live this idea — particularly people that are building software. I was speaking with Alex Bartling recently, and he explained to me a way he and his team are using LLMs to help build software. The software itself is really incredible, and is making a huge positive impact for mental health professionals. It’s an AI-driven scribe tool to help providers optimize and structure the note-taking process. I am not well-versed in this space, but based on the customer-testimonials I’ve learned about, it is an incredible product. It fundamentally changes how a mental health professional can spend their time. It allows them to spend more time on the deep activities that really matter with their patients — and less time on operations and overhead. It’s a perfect use-case for Large Language Models.

But Alex and his team have gone far beyond using LLMs for its voice transcription and summarization capabilities. They are using LLMs to help customers adapt the product to their own needs, and are even using LLMs to help understand and categorize their customer’s most desired features. So how are they doing this? In two creative ways:

Human In The Loop Prompt Feedback / Re-Execution

When the product uses LLMs and prompting to create an output for the user, there is a step where the user can provide unstructured feedback on that output. The UI gives the user guidance on what feedback is helpful via a few structured examples — but ultimately, the user can type in whatever they want.
Then, the prompts are creatively re-run through the LLM with this extra guidance (with some good prompt engineering of course). So basically:
1. Application has an initial prompt that incorporates the user data (e.g. transcribed audio)
2. Application generates output via LLM and a chain of prompts
3. User sees output, and provides feedback on that output via unstructured text with guidance in the UI
4. Application re-runs prompt –with additional creative prompting and including the user’s guidance – and updates the output
You can think of this as allowing the user to behave as the ‘follow-up prompt engineer’ with some additional guardrails.

Feedback Classification / Clustering via LLMs -> Product Feature Feedback Loop

What’s more awesome is this aspect, which truly represents trying to “invite AI to the table.” Alex and his team are periodically processing the unstructured feedback (step 3 above) at scale, and clustering it / classifying it using LLMs. From this, they can automatically see patterns in the feedback, and then they modify the software to allow the user to select this feedback inline versus having to type it in an unstructured text box. An example here might be clarifying:

The software receives 1000s of unstructured chunks of feedback from users. Let’s hypothetically say a number of them are as follows:
1. “I want the final text to be shorter”
2. “This is too long of a note for section 2 of the document”
3. “The output is too wordy”
The unstructured output is run through a creative prompt that is intended to distill the feedback into a taxonomy of feedback areas, ranked by frequency
From this, the LLM recognizes that a new toggle could be added to the UI, which has the label “Summary Text Length” and the following options.
1. Brief
2. Default
3. Descriptive
The team updates the software with this new toggle, and ships the new version. Now the user can just click a button to provide the feedback next time (or even select it before the LLM runs the first time)

What Alex Had To Say

Amazing stuff. If you have that background thread running of ‘how might AI help me here?’ you can discover some really cool things. I asked Alex to describe how he thinks about these concepts. I consider him to be on the leading edge of building software that uses AI and LLMs. Here’s what he told me below (I think he might have used ChatGPT to help him clarify his thinking — he is deep down the rabbit hole 🙂)

The concept of “human in the loop” influences how users interact with AI-powered software. Unlike traditional software with heavily parameterized options, AI allows for unbounded, natural language instructions. However, this new interaction style can be confusing. Initially, when we provided a textbox for users to input additional context, they found it unfamiliar. To help, we introduced example prompts, which users often relied on at first. Over time, they realized they could say anything, creating a liberating and magical experience.

It’s interesting because, naturally, you start to collect all this unstructured feedback, which previously would have been challenging to make sense of. With AI, we can analyze and classify the text, extracting the most common “settings” to present to users in the UI.

The role of a software engineer in the age of AI has evolved. Even before AI, it was never just about sitting at a desk and cranking out code—it was always about solving problems. The code is just the tool; the real value has always been in understanding the problem and finding innovative solutions. Now, with AI, that value lies even more in being a visionary, a shepherd guiding the potential of this transformative technology. It’s about collaborating with domain experts to solve their toughest challenges. Many don’t yet grasp the true capabilities of AI, and it’s our responsibility to illuminate that path.

When I joined my current company, no one was tapping into AI. But as people started to see what I was doing with it, the enthusiasm became infectious, and gradually more began to adopt it. This is vital because when a team embraces an AI-first mindset, you get a richer diversity of perspectives. People begin to discover new possibilities and share their insights, creating a cycle of innovation. The mindset shifts from ‘how can we do this?’ to ’how can we leverage AI to do this?’—and that’s where real breakthroughs happen.

Final Thoughts

I love how Alex thinks about this stuff. Having the background thread running at all times has allowed him and his team to achieve incredible results with a small number of people in a short amount of time. I think this is the future of building. What are you or your company trying to build? How are you using LLMs? What are you or your team most curious about or where do you need help? I’d love to hear from you: mattstockton@gmail.com