Introduction In this video, Han outlines three simple techniques to massively cut AI automation costs, focusing on input and output token costs. He notes that processing costs will be covered in a separate video.
Key takeaways
- Big savings come from reducing input prompts and avoiding repeated system prompts. 💡
- Batch inputs to fit within context windows and save tokens without sacrificing quality. ðŸ§
- Filtering, summarizing, and data compression can dramatically shrink inputs for reuse. 🪶
- Tracking tokens and costs is essential and doable without special tools. 💳
- Real-world Reddit workflow shows how batching and aggregation affect cost and context limits. 🧰
Main problem and cost levers discussed
- The largest cost is prompt/input tokens (the system prompt is reinvoked for every call). Output and processing costs exist, but this video emphasizes input and output token efficiency. The AI processing bit is acknowledged as a separate topic.
- Cost references include per-token pricing (input and output) and how batching/system prompts change total spend. Practical context window limits cap how much data can be processed meaningfully.
Techniques and steps to reduce token costs
- Minimize input tokens
- Use shorter prompts; reduce repeated system prompts; combine where appropriate.
- Use input batching
- Batch sizes of about 3–4 items per call; balance with the model’s context window limits.
- Data filtering and summarization to shrink inputs
- Filter data by time/score when possible; if not, summarize inputs to core insights.
- Caution on aggregation vs batching
- Aggregating all data can hit context-window tipping points and degrade quality; batching preserves quality while saving tokens.
- When and how to summarize inputs
- Summarize inputs to keep outputs useful and reusable for future tasks (e.g., turning raw posts into concise highlights or insights).
Real-world example from the transcript (Reddit workflow)
- 126 Reddit threads were scanned to find relevant posts for promotion.
- Costs illustrated: per-call input tokens, and how combining system prompts into one prompt reduces spend; batching reduces total cost (e.g., about $0.50–$1.00 saved per batch vs. individual calls).
- Context window limits highlighted: GPT-4 claims 128,000 tokens total with about 64,000 tokens effective; exceeding the tipping point can hurt quality. Aggregation can hit ~4,000 tokens usable for some models (e.g., Llama 27D), while effective lengths vary by model.
- Batch approach: use one system prompt plus several user prompts per call; the ideal batch size balances token savings with context-window limits.
- Data filtering example: filtering Reddit data by time and score reduced 120 posts to 14 relevant ones, yielding big token savings.
Practical tips for cost tracking
- Simple approach: track input tokens, output tokens, and total cost (no tools required).
- Two-tab structure: tokens (workflow ID, execution ID, client ID, model, input tokens, output tokens, total tokens, cost) and observability (which tools were called, results).
Takeaways and next steps
- Start by trimming input prompts and eliminating repeated system prompts; then layer in batching and selective filtering.
- Maintain a lightweight token/cost log to monitor savings and iterate on batch sizes and filtering rules.
- Consider summarization to reuse outputs and reduce repeated token usage.
Caveats and disclaimers
- Some optimizations depend on data availability (e.g., time/score filters) and model context-window behavior; future videos may cover deeper processing optimizations.