Videos within multi-modal models - A whole new world of opportunity
Yesterday I wrote about how long context windows are game-changers, enabling us to solve problems that were essentially unsolvable before.
Another capability I’m really excited about is video within LLMs. You can now drop a video into an LLM, and the LLM can do mind-boggling things with it. People aren’t thinking enough about what this means in terms of the task categories that can now be augmented with LLMs.
An idea I’ve been playing around with is: Can you use LLM video capabilities to optimize the quoting process for Electricians and other home services?
Here’s a video of me going from an iPhone video walk-through of a hypothetical Electrical project (EV Charger Install) at my house to a detailed quote (including scope of work, materials, labor, etc.) in under 10 minutes. I’ve also included the detailed inputs / outputs from the LLM steps at the end of this post.
What I Did
- Created the video, walking through what I thought the scope of the project might be, and showing areas in my yard/house/garage where components might need to be installed
- Used Claude to create a detailed prompt from a bulleted list of instructions. Using LLMs as prompt-builders is a really awesome use-case. They can build better prompts than we can.
- Fed the video and prompt from steps 1 and 2 into Google Gemini 1.5, via Google AI Studio
- Gemini outputs a detailed quote, which includes:
- Project Overview
- Step-by-Step project plan
- Materials List broken down by project phase (even estimating length of conduit required based on just the video)
- Labor Estimate
- Follow-up questions and concerns
- A total estimated cost for the project (including costs associated w/ local codes and permits)
Final Thoughts
If not the quoting process, are there other aspects of these jobs (or adjacent jobs) that can be made more effective with multi-modal models? I have not done any product discovery on this at all, and have not talked to any potential customers – so it’s possible that this idea doesn’t make any sense and wouldn’t gain traction. But I believe there are many real use-cases to discover involving video that makes jobs / tasks better. I’m passionate about connecting the dots between people experiencing these problems (e.g Electricians), and people who can help them build a thing to make it easier (e.g. me)
It’s mind-boggling to me that this is possible in under 10 minutes with just a video and some detailed prompting. If you were to build some type of product around this idea, you could obviously add a lot more user control, and figure out exactly what the user wanted to do, and that’s all hard work and likely to lead to a lot of dead ends. But the point here is the capability to even do this unlocks a whole set of product ideas that weren’t possible before. As a meta-point, building a product around this is also a lot easier now, because you can lean on LLMs to help you do it. It’s truly an incredible time to be building and learning in this space.
What are you building? Where do you or your company need help? I’d love to hear from you, and figure out a way to work together: mattstockton@gmail.com
Detailed Materials
- The final quote output from Google Gemini
- The prompt generated by Claude
- The input to Claude to generate the above prompt: