Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
Performance Optimization Techniques
Explore top LinkedIn content from expert professionals.
-
-
Imagine using video game technology to solve one of the toughest challenges in nuclear fusion — detecting high-speed particle collisions inside a reactor with lightning-fast precision. A team of researchers at UNIST has developed a groundbreaking algorithm inspired by collision detection in video games. This new method dramatically speeds up identifying particle impacts inside fusion reactors, essential for improving reactor stability and design. By cutting down unnecessary calculations, the algorithm enables real-time visualization and analysis, paving the way for safer and more efficient fusion energy development. 🎮 Gaming tech meets fusion science: The algorithm borrows from video game bullet-hit detection to track particle collisions. ⚡ 15x faster detection: It outperforms traditional methods by speeding up collision detection by up to fifteen times. 🔍 Smart calculation: Eliminates 99.9% of unnecessary computations with simple arithmetic shortcuts. 🌐 3D digital twin: Applied in the Virtual KSTAR, a detailed Korean fusion reactor virtual model. 🚀 Future-ready: Plans to leverage GPU supercomputers for faster processing and enhanced reactor simulations #FusionEnergy #VideoGameTech #ParticleDetection #NuclearFusion #Innovation #AIAlgorithm #VirtualKSTAR #CleanEnergy #ScientificBreakthrough #HighSpeedComputing https://lnkd.in/gfcssNTC
-
I've recently suffered a major career setback. Since I teach about high performance and career growth, I want to share how I am addressing it. One day you will need this recipe yourself! My goal in my current "career" is to reach as many people as I can, and to help them achieve career success and satisfaction. For the last three years, the way to do this has been through LinkedIn. Unfortunately, LinkedIn recently made some unknown changes to their algorithm. Other Top Voices and I have noticed a drop of 70% to 80% in the reach of our posts. Since my goal is to share my knowledge with more people, that means my goal just took an 80% hit. In general, setbacks in performance are either due to: A) Something we did Or B) Something external, outside our direct control Mistakes, poor decisions, and missed deadlines are examples of A. They are in our control. Things like Covid, high interest rates, and reorganizations at work are examples of B, outside our control. LinkedIn's change is also case B, outside my control. When a setback comes from something in your control, you know clearly what you did wrong and what you need to change to restore your performance and progress. Fixing your own issues may take time and be difficult, but you know what to do. When the setback is due to something outside your control, you do not know how to fix the issue. So, how can we react when our performance is shattered and we do not know why? Here is my recipe: 1. Allow yourself a fixed amount of time to grieve (and complain if you wish). Emotions are real, and before you can move on you will need to sit with those emotions. But, do not get stuck in them. Curse your bad luck, pout for a minute, etc. Then, move to the next step. 2. Refocus on your core value. Whatever happened, go back to how you define high performance to ensure it is still relevant. I admit, I slipped into defining my own performance by how many people viewed my LinkedIn posts. This was a mistake. My mission is to help others, so getting views is a proxy, not a result. And, using LinkedIn is just a method for the mission, not the mission itself. 3. Adapt your core value if you must (if its value has decreased). In my case, the value of what I offer hasn't changed, the external delivery system has. 4. Once you adapt and/or increase your value, find new ways to deliver it if necessary. Luckily, I have other options for reaching people: my Substack newsletter, YouTube, etc. Since Substack has been such a good partner recently, I will start there. I have also refocused how I write on LinkedIn to make every post focused on my goal. 5. Test, measure, adapt, repeat! Really, this step is everything. Once you get past the grief, jump into action in this loop. Nothing can stop you if you keep working to refine, deliver, and showcase your core value. Comments? Here's my newsletter, which is my next area of investment: https://lnkd.in/gXh2pdK2
-
Quantizing is not enough when fine-tuning a model! Even in the lowest precisions, most of the memory is going to be taken by the optimizer state when training that model! One great strategy that emerged recently is QLoRA. The idea is to apply LoRA adapters to quantized models. When the optimizer state is going to be computed, it is only going to be done on the adapter parameters instead of the whole model, and this will save a large amount of memory! The parameters are converted from BFloat16 / Float16 to 4-bits normal float. This quantization strategy comes from the realization that trained model weights tend to be Normal distributed, and we can create quantization buckets using that fact. This allows the compression of the model parameters without too much information loss. When we quantize a model, we need to capture the quantization constants to be able to dequantize the model. We usually capture them in Float32 to avoid as much dequantization error as possible. To compress further the model, we perform a double quantization to quantize the quantization constants to Float8. During the forward pass, because the input tensors are in BFloat16 / Float16, we need to dequantize the quantized parameters to perform the operations. However, during the backward pass, the original weights do not contribute to the computations, and they can remain quantized.
-
A sluggish API isn't just a technical hiccup – it's the difference between retaining and losing users to competitors. Let me share some battle-tested strategies that have helped many achieve 10x performance improvements: 1. 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 Not just any caching – but strategic implementation. Think Redis or Memcached for frequently accessed data. The key is identifying what to cache and for how long. We've seen response times drop from seconds to milliseconds by implementing smart cache invalidation patterns and cache-aside strategies. 2. 𝗦𝗺𝗮𝗿𝘁 𝗣𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 Large datasets need careful handling. Whether you're using cursor-based or offset pagination, the secret lies in optimizing page sizes and implementing infinite scroll efficiently. Pro tip: Always include total count and metadata in your pagination response for better frontend handling. 3. 𝗝𝗦𝗢𝗡 𝗦𝗲𝗿𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 This is often overlooked, but crucial. Using efficient serializers (like MessagePack or Protocol Buffers as alternatives), removing unnecessary fields, and implementing partial response patterns can significantly reduce payload size. I've seen API response sizes shrink by 60% through careful serialization optimization. 4. 𝗧𝗵𝗲 𝗡+𝟭 𝗤𝘂𝗲𝗿𝘆 𝗞𝗶𝗹𝗹𝗲𝗿 This is the silent performance killer in many APIs. Using eager loading, implementing GraphQL for flexible data fetching, or utilizing batch loading techniques (like DataLoader pattern) can transform your API's database interaction patterns. 5. 𝗖𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲𝘀 GZIP or Brotli compression isn't just about smaller payloads – it's about finding the right balance between CPU usage and transfer size. Modern compression algorithms can reduce payload size by up to 70% with minimal CPU overhead. 6. 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻 𝗣𝗼𝗼𝗹 A well-configured connection pool is your API's best friend. Whether it's database connections or HTTP clients, maintaining an optimal pool size based on your infrastructure capabilities can prevent connection bottlenecks and reduce latency spikes. 7. 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗟𝗼𝗮𝗱 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻 Beyond simple round-robin – implement adaptive load balancing that considers server health, current load, and geographical proximity. Tools like Kubernetes horizontal pod autoscaling can help automatically adjust resources based on real-time demand. In my experience, implementing these techniques reduces average response times from 800ms to under 100ms and helps handle 10x more traffic with the same infrastructure. Which of these techniques made the most significant impact on your API optimization journey?
-
💥 Data Engineer Interview Killer: Handling 500GB Daily with PySpark Data pros — have you ever been asked this in an interview? 👉 “How would you efficiently process a 500 GB dataset in PySpark, and how would you size your cluster?” It’s one of my favorite questions — because it blends architecture, optimization, and cost awareness into one real-world scenario. Here’s how I’d break it down 👇 💡 The 5-Step Optimization Blueprint 1️⃣ Format First — The Foundation of Speed 🚀 Action: Convert raw data (CSV/JSON) into Parquet or Delta Lake right away. Why: Columnar storage, compression, and predicate pushdown drastically cut I/O. 👉 This single step often gives the biggest performance boost. 2️⃣ Partitioning Math — Define Your Parallelism 🧮 Each Spark task should process around 128 MB. Calculation: 500 GB × 1024 MB/GB ÷ 128 MB/partition ≈ 4,000 partitions ➡️ Spark now has ~4,000 tasks to parallelize — perfect for scaling efficiently. 3️⃣ Cluster Sizing — Predictable Execution 🧠 Let’s assume: 10 worker nodes 8 cores & 32 GB RAM per node Parallelism: 10 nodes × 8 cores = 80 cores total Each core handles ~2–3 tasks → ~240 tasks concurrently Total time: 4,000 ÷ 240 ≈ 17 waves of execution At ~1–2 min per wave → ~25–30 minutes total runtime That’s how you explain both scaling and efficiency in an interview. 4️⃣ Memory Management — Avoid the Spill 💾 Plan for roughly 3× data size during joins and shuffles. Estimate: (500 GB × 3) ÷ 10 nodes = 150 GB per node With only 32 GB per node, Spark will spill to disk — which is fine if SSD-backed. For critical workloads, upgrade to 64 GB nodes to keep processing smooth. 5️⃣ Performance Tweaks — Fine-Tuning ⚙️ spark.sql.shuffle.partitions = 400 spark.sql.adaptive.enabled = True ✅ Use Broadcast Joins for small lookup tables. ✅ Implement Incremental Loads (Delta Lake makes this easy). ✅ Avoid full reloads — only process what’s changed. 🧭 The Real Data Engineering Challenge Optimizing Spark isn’t about adding more compute — it’s about finding the sweet spot between performance, cost, and scalability. 🔥 Question for you: If you got this same question in an interview — how would you size your cluster or optimize it differently? 👇 I’ll be sharing my cost–benefit breakdown in the next post — how to choose between scaling up vs scaling out for real workloads. #PySpark #ApacheSpark #Databricks #BigData #DataEngineering #Optimization #InterviewPrep #Azure
-
I spent 17 hours optimizing an API endpoint to make it 15x faster. Here's a breakdown of what I did. I worked on an e-commerce application. One endpoint was crunching some heavy numbers. And it wasn't scaling well. The endpoint calculated a report. It needed data from several services to perform the calculations. This is the high-level process I took: - Identify the bottlenecks - Fix the database queries - Fix the external API calls - Add caching as a final touch 𝗦𝗼, 𝗵𝗼𝘄 𝗱𝗼 𝘆𝗼𝘂 𝗶𝗱𝗲𝗻𝘁𝗶𝗳𝘆 𝘁𝗵𝗲 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸𝘀 𝗶𝗻 𝘆𝗼𝘂𝗿 𝘀𝘆𝘀𝘁𝗲𝗺? If you know the slowest piece of code, you will know what to fix. The 80/20 rule works wonders here. Improving 20% of the slowest code can yield an 80% improvement. The fun doesn't stop here. Performance optimization is a continuous process and requires constant monitoring and improvements. Fixing one problem will reveal the next one. The problems I found were: - Calling the database from a loop - Calling an external service many times - Duplicate calculations with the same parameters Measuring performance is also a crucial step in the optimization process: - Logging execution times with a Timer/Stopwatch - If you have detailed application metrics, even better - Use a performance profiler tool to find slow code 𝗙𝗶𝘅𝗶𝗻𝗴 𝘀𝗹𝗼𝘄 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗾𝘂𝗲𝗿𝗶𝗲𝘀 A round trip between your application and a database or service can last 5-10ms (or more). The more round trips you have, the more it adds up. Here are a few things you can do to improve this: - Don't call the database from a loop - Return multiple results in one query 𝗖𝗼𝗻𝗰𝘂𝗿𝗿𝗲𝗻𝘁 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗶𝘀 𝘆𝗼𝘂𝗿 𝗳𝗿𝗶𝗲𝗻𝗱 I had multiple asynchronous calls to different services. These services were independent of each other. So, I called these services concurrently and aggregated the results. This simple technique helped me achieve significant performance improvement. 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗮𝘀 𝗮 𝗹𝗮𝘀𝘁 𝗿𝗲𝘀𝗼𝗿𝘁 Caching is an effective way to speed up an application. But it can introduce bugs when the data is stale. Is this tradeoff worth it? In my case, achieving the desired performance was critical. You also have to consider the cache expiration and eviction strategies. A few caching options in ASP .NET: - IMemoryCache (uses server RAM) - IDistributedCache (Redis, Azure Cache for Redis) What do you think of my process? Would you do something differently? --- Subscribe to my weekly newsletter to accelerate your .NET skills: https://bit.ly/3R9JnT5
-
Study: Generators May Provide a Faster Path to Power A new study by energy researchers suggests that data centers could get faster access to power by adopting load flexibility, agreeing to briefly curtail utility usage and shift to generator power. In an in-depth analysis of the U.S. power grid, researchers at Duke University estimate that this approach could tap existing headroom in the system to more quickly integrate at least 76 gigawatts of new loads, arguing that even a small reduction in peak demand could reduce the need for new investments in transmission and generation capacity - as well as the need to pass on those investments to ratepayers. Data centers are all about uptime, and thus have been resistant to innovations that create additional risk around reliability. But current power constraints in key markets, along with growing demand for AI training workloads (which may be more interruptible than cloud or colocation) has prompted the industry to explore load flexibility options. Last year the Electric Power Research Institute (EPRI) launched the DCFlex project to work with utilities and a number of data center operators - including Compass Datacenters, QTS Data Centers, Google and Meta - on pilot projects for load flexibility. The Duke study, titled "Rethinking Load Growth," puts some interesting numbers on the upside potential. Their findings: - 76 gigawatts of new load could be enabled by a annual load curtailment rate of 0.25% of maximum uptime, equivalent to 1.7 hours per year operating on backup generators. - An annual curtailment rate of 0.5% (2.1 hours annually) could enable 98 GWs of new load, while a rate of 1.0% (2.5 hours) could boost that to 126 GWs. - A 0.5% curtailment could enable 18GWs in the PJM and 10 GWs in ERCOT, the research finds. At least one hyperscaler seems open to the idea. “This is a promising tool for managing large new energy loads without adding new generating capacity and should be part of every conversation about load growth,” said Michael Terrell, Senior Director of Clean Energy and Carbon Reduction at Google, in a LinkedIn post. With the acceleration of the AI arms race, speed-to-market is now a top priority, along with a competitive opportunity cost for companies that are unable to deploy new capacity. There are tradeoffs to consider (including more emissions), but the Duke paper will likely advance the conversation. Duke study: https://lnkd.in/eS3s_pvk Background on DCFlex: https://lnkd.in/euK746Zy
-
OpenAI released a guide on how to improve LLMs’ accuracy and consistency. Here are some lesser known tactics I found very interesting: 1. Prompt Baking: It involves logging the inputs and outputs during a pilot phase to identify the most effective examples. This helps you refine and prune the data into a more efficient training set, which will help improve the model's performance. 2. How to scale prompting when dealing with a long context: Having a long context can cause the LLM to struggle to maintain the attention given to all the tokens in the input context -- especially if the instructions are very complex. So, in such cases it’s important to evaluate your LLM on its ability to retrieve info from varying depths in long-context documents. Needle in A Haystack is one such model evaluation you can use. 3. Fine-Tuning with RAG Examples: They recommend incorporating your RAG context examples, into the fine-tuning process. This makes the model learn to leverage retrieved info effectively, to generate more relevant outputs. The guide also mentions common recommendations like: - Splitting complex tasks into separate calls - Using chain-of-thought prompting (you can use: https://lnkd.in/gN5eHby5) - Using GPT-4 itself to evaluate and score its outputs for iterative improvement Here's the full guide: https://lnkd.in/gAzjKdyp #AI #LLMs #OpenAI
-
I found one small function in Cleartrip’s sourcemap that quietly fixes a performance problem most frontend teams don’t even know they’re causing. requestIdleCallback(() => { cb() }) (see the screenshot) At first glance, it looks basic. But the idea behind it is extremely powerful. The real bottleneck in modern frontends isn’t React, Vue, or whatever framework you love blaming. It’s bad task scheduling.Code that works, but runs at the wrong time. This fixes that without touching a single line of business logic. What it actually achieves,It forces all non-urgent work to wait until the browser is genuinely free. Not kind of free or may beFree. requestIdleCallback: → Browser idle? Run the task. Fallback setTimeout(..., 0): → Push it out of the current frame so it doesn’t fight the UI. Result: UI stays focused on rendering, scrolling, input, and paint. Everything else waits its turn like a good citizen. Where this matters the most You throw these into idle: • analytics • logs • caching writes • JSON parsing • data massaging • preloading future screens • any important but not right now job These don’t belong next to your render cycle . Frontend performance isn’t optimize code. It’s run the right code at the right time.
Explore categories
- Hospitality & Tourism
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development