Evaluating Project Performance Metrics

Explore top LinkedIn content from expert professionals.

  • View profile for Lynn Loo
    Lynn Loo Lynn Loo is an Influencer

    CEO, Global Centre for Maritime Decarbonisation | Professor, Princeton University | Energy Transition and Shipping

    44,486 followers

    Have I mentioned we are data geeks?🤓🤓 Performance uncertainty remains one of the biggest barriers to wider uptake of #energy #efficiency technologies.💡 #Wind-assisted propulsion,💨 air-lubrication systems🫧 and other proven #retrofits can cut fuel use by double-digit percentages.📉 But real-world savings swing with weather, routing and operations. Without clarity on a retrofit’s actual contribution, neither shipowners nor charterers can forecast returns with confidence.🤷🏻♀️ And because we’ve always believed that #data📊 can give us the clearest truth, we set out to address this challenge.👊🏻 Our friends at Eastern Pacific Shipping Pte. Ltd. gave us access to the Pacific Sentinel, on which we installed a high-frequency data acquisition system as three suction #sails⛵️ were retrofitted onboard the MR tanker in March 2025. Calibrated sensors captured #power consumption, vessel speed, engine load, heading and wind conditions every 15 seconds. Over four months as the vessel traded spot around the Americas,🌎 we saw #weather and #performance at a fidelity far beyond the single daily datapoint in a noon report. Building on #ITTC and DNV methodologies, Global Centre for Maritime Decarbonisation (GCMD) and EPS implemented an “on-off’’ testing protocol,🎛️ comparing power consumption with the sails activated and deactivated under otherwise similar environmental and operational conditions to isolate the sails’ true contribution. Under the predominantly near-headwind conditions sampled, the vessel saw an average instantaneous power savings⚡️ of 7.2%, with a 95% confidence interval between 6.2% and 8.2%. Instantaneous savings ranged from +28% to –14%. These rare outliers highlight just how sensitive power savings are to wind speed and direction, and underscore the importance of tracking dynamic operational data.⚠️ Access report here:  https://lnkd.in/g_dRFtJp If we want to scale energy-efficiency retrofits, we must tackle performance uncertainty head-on. Shipowners won’t invest, and charterers won’t commit, if they can’t trust that the #savings will show up in their fuel bills.💵 We therefore developed a power savings polar heat map to predict energy and fuel savings with wind conditions. With 3rd-party verification, this will enable performance-linked financing of the retrofits.💰 This case study is but a first step in building that validation layer. And it ladders🪜 up to what we launched last week: #FEET — the world’s first blended-finance fund designed to support energy-efficiency retrofits through a pay-as-you-save repayment structure. Progress is incremental, and this marks a big step in the right direction.👊🏻 Together, we are stronger; together, we can💪🏻 Shane Balani, Zheng Yang Cheng 钟正扬, Bhushan Taskar, Goh Wan Ni, Pavlos Karagiannidis, Mirtcho Spassov, CFA, Mike Wilson, Rashim Berry, Cyril Ducau

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,491,246 followers

    I’ve noticed that many GenAI application projects put in automated evaluations (evals) of the system’s output probably later — and rely on humans to manually examine and judge outputs longer — than they should. This is because building evals is viewed as a massive investment (say, creating 100 or 1,000 examples, and designing and validating metrics) and there’s never a convenient moment to put in that up-front cost. Instead, I encourage teams to think of building evals as an iterative process. It’s okay to start with a quick-and-dirty implementation (say, 5 examples with unoptimized metrics) and then iterate and improve over time. This allows you to gradually shift the burden of evaluations away from humans and toward automated evals. I wrote previously in The Batch about the importance and difficulty of creating evals. Say you’re building a customer-service chatbot that responds to users in free text. There’s no single right answer, so many teams end up having humans pore over dozens of example outputs with every update to judge if it improved the system. While techniques like LLM-as-judge are helpful, the details of getting this to work well (such as what prompt to use, what context to give the judge, and so on) are finicky to get right. All this contributes to the impression that building evals requires a large up-front investment, and thus on any given day, a team can make more progress by relying on human judges than figuring out how to build automated evals. I encourage you to approach building evals differently. It’s okay to build quick evals that are only partial, incomplete, and noisy measures of the system’s performance, and to iteratively improve them. They can be a complement to, rather than replacement for, manual evaluations. Over time, you can gradually tune the evaluation methodology to close the gap between the evals’ output and human judgments. For example: - It’s okay to start with very few examples in the eval set, say 5, and gradually add to them over time — or subtract them if you find that some examples are too easy or too hard, and not useful for distinguishing between the performance of different versions of your system. - It’s okay to start with evals that measure only a subset of the dimensions of performance you care about, or measure narrow cues that you believe are correlated with, but don’t fully capture, system performance. For example if, at a certain moment in the conversation, your customer-support agent is supposed to (i) call an API to issue a refund and (ii) generate an appropriate message to the user, you might start off measuring only whether or not it calls the API correctly and not worry about the message. Or if, at a certain moment, your chatbot should recommend a specific product, a basic eval could measure whether or not the chatbot mentions that product without worrying about what it says about it. [Truncated due to length limit. Full text: https://lnkd.in/gygj3y7w ]

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,693 followers

    Evaluating LLMs is hard. Evaluating agents is even harder. This is one of the most common challenges I see when teams move from using LLMs in isolation to deploying agents that act over time, use tools, interact with APIs, and coordinate across roles. These systems make a series of decisions, not just a single prediction. As a result, success or failure depends on more than whether the final answer is correct. Despite this, many teams still rely on basic task success metrics or manual reviews. Some build internal evaluation dashboards, but most of these efforts are narrowly scoped and miss the bigger picture. Observability tools exist, but they are not enough on their own. Google’s ADK telemetry provides traces of tool use and reasoning chains. LangSmith gives structured logging for LangChain-based workflows. Frameworks like CrewAI, AutoGen, and OpenAgents expose role-specific actions and memory updates. These are helpful for debugging, but they do not tell you how well the agent performed across dimensions like coordination, learning, or adaptability. Two recent research directions offer much-needed structure. One proposes breaking down agent evaluation into behavioral components like plan quality, adaptability, and inter-agent coordination. Another argues for longitudinal tracking, focusing on how agents evolve over time, whether they drift or stabilize, and whether they generalize or forget. If you are evaluating agents today, here are the most important criteria to measure: • 𝗧𝗮𝘀𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀: Did the agent complete the task, and was the outcome verifiable? • 𝗣𝗹𝗮𝗻 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: Was the initial strategy reasonable and efficient? • 𝗔𝗱𝗮𝗽𝘁𝗮𝘁𝗶𝗼𝗻: Did the agent handle tool failures, retry intelligently, or escalate when needed? • 𝗠𝗲𝗺𝗼𝗿𝘆 𝘂𝘀𝗮𝗴𝗲: Was memory referenced meaningfully, or ignored? • 𝗖𝗼𝗼𝗿𝗱𝗶𝗻𝗮𝘁𝗶𝗼𝗻 (𝗳𝗼𝗿 𝗺𝘂𝗹𝘁𝗶-𝗮𝗴𝗲𝗻𝘁 𝘀𝘆𝘀𝘁𝗲𝗺𝘀): Did agents delegate, share information, and avoid redundancy? • 𝗦𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝘁𝗶𝗺𝗲: Did behavior remain consistent across runs or drift unpredictably? For adaptive agents or those in production, this becomes even more critical. Evaluation systems should be time-aware, tracking changes in behavior, error rates, and success patterns over time. Static accuracy alone will not explain why an agent performs well one day and fails the next. Structured evaluation is not just about dashboards. It is the foundation for improving agent design. Without clear signals, you cannot diagnose whether failure came from the LLM, the plan, the tool, or the orchestration logic. If your agents are planning, adapting, or coordinating across steps or roles, now is the time to move past simple correctness checks and build a robust, multi-dimensional evaluation framework. It is the only way to scale intelligent behavior with confidence.

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    630,793 followers

    Most people evaluate LLMs by just benchmarks. But in production, the real question is- how well do they perform? When you’re running inference at scale, these are the 3 performance metrics that matter most: 1️⃣ Latency How fast does the model respond after receiving a prompt? There are two kinds to care about: → First-token latency: Time to start generating a response → End-to-end latency: Time to generate the full response Latency directly impacts UX for chat, speed for agentic workflows, and runtime cost for batch jobs. Even small delays add up fast at scale. 2️⃣ Context Window How much information can the model remember- both from the prompt and prior turns? This affects long-form summarization, RAG, and agent memory. Models range from: → GPT-3.5 / LLaMA 2: 4k–8k tokens → GPT-4 / Claude 2: 32k–200k tokens → GPT-OSS-120B: 131k tokens Larger context enables richer workflows but comes with tradeoffs: slower inference and higher compute cost. Use compression techniques like attention sink or sliding windows to get more out of your context window. 3️⃣ Throughput How many tokens or requests can the model handle per second? This is key when you’re serving thousands of requests or processing large document batches. Higher throughput = faster completion and lower cost. How to optimize based on your use case: → Real-time chat or tool use → prioritize low latency → Long documents or RAG → prioritize large context window → Agentic workflows → find a balance between latency and context → Async or high-volume processing → prioritize high throughput My 2 cents 🤌 → Choose in-region, lightweight models for lower latency → Use 32k+ context models only when necessary → Mix long-context models with fast first-token latency for agents → Optimize batch size and decoding strategy to maximize throughput Don’t just pick a model based on benchmarks. Pick the right tradeoffs for your workload. 〰️〰️〰️ Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

  • View profile for Marianne Touchie

    Canada Research Chair in Sustainable Urban Housing, Associate Professor at University of Toronto

    4,185 followers

    What happens to Passive House energy targets after people move in? 🏢 Our recent post‑occupancy evaluation (POE) of two Ontario multi‑unit residential buildings, one EnerPHit retrofit and one new Passive House, compared modelled performance with measured energy use over a full year. While both buildings achieved low energy use intensities, differences emerged at when examining system level performance, particularly ventilation fan energy, cooling demand, and plug loads. Total building energy closely matched predictions in the new build but space conditioning was higher than anticipated, while the retrofit showed larger building-level energy use gaps driven by operational factors. The takeaway? POEs are essential. Commissioning quality, control strategies, and realistic modelling assumptions play a major role in whether high‑performance buildings deliver their intended outcomes in practice. Read the whole story here by lead author Yazan Zamel: https://lnkd.in/eXTweB23 #BEIE_Lab #PassiveHouse #EnerPHit #BuildingPerformance #POE #HighPerformanceBuildings #MURBs

  • View profile for Chris Do
    Chris Do Chris Do is an Influencer

    Success requires all of you. I’ll make the introductions. Unbland™ Yourself. Reformed introvert, Professional Weir-Do on a mission to help you be more YOU. Get help with your personal brand → Content Lab.

    621,792 followers

    Stuck in an endless loop of client changes? Lost track of what revision this constitutes? Yeah. Been there. Done that. The secret? It's not about saying no. It's about saying yes to the right things upfront. Every project that goes sideways starts the same way: Vague agreements. Fuzzy boundaries. Good intentions. Six weeks later you're bleeding money and everyone's frustrated. Here's my framework after 30 years of running two 8-figure businesses: The SOW is your salvation. Not some boilerplate template. A real document that covers: • Exact deliverables (not "design work" but "3 homepage concepts, 2 rounds of revisions") • Hours of operation ("We respond M-F, 9-5 PST. Weekend requests get Monday responses") • Revision rounds spelled out ("Round 1 includes up to 5 changes. Round 2 includes 3.") • Feedback cycles defined ("48-hour turnaround for client feedback or the project may be delayed or additional fees may be incurred") But here's what most people miss— Don't work on client notes immediately. Client sends 37 pieces of feedback at 11pm Friday? Producer sends conflicting notes from the CEO? Marketing wants one thing, sales wants another? Stop. Collect everything first. Resolve the conflicts. Get on the phone and discuss it with your client to get alignment. Separate the "have to haves" from the "nice to haves". Then present unified changes. "Based on all feedback received, here are the 8 changes we'll implement. This constitutes revision round 2 of 3." Watch how fast the random requests stop. No extra work that goes unappreciated. No more feelings of being taken advantage of. Communicate before the crisis, prevents the crisis from happening. "Just so you know, we're entering round 2. You have one more included. After that, it's $X per additional round." No surprises. No awkward money conversations. No resentment. Scope creep isn't a them problem. It's a you problem. And that's good news, because that means you are in control. They're not trying to take advantage. They just don't know where the boundaries are because you never drew them. Draw the lines early. Communicate them clearly. Everyone wins. What's your most painful scope creep story? What boundary would've prevented it? Small Business Builders #projectmanagement #clientmanagement #businessgrowth

  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems @meta

    207,007 followers

    Evaluations —or “Evals”— are the backbone for creating production-ready GenAI applications. Over the past year, we’ve built LLM-powered solutions for our customers and connected with AI leaders, uncovering a common struggle: the lack of clear, pluggable evaluation frameworks. If you’ve ever been stuck wondering how to evaluate your LLM effectively, today's post is for you. Here’s what I’ve learned about creating impactful Evals: 𝗪𝗵𝗮𝘁 𝗠𝗮𝗸𝗲𝘀 𝗮 𝗚𝗿𝗲𝗮𝘁 𝗘𝘃𝗮𝗹? - Clarity and Focus: Prioritize a few interpretable metrics that align closely with your application’s most important outcomes. - Efficiency: Opt for automated, fast-to-compute metrics to streamline iterative testing. - Representation Matters: Use datasets that reflect real-world diversity to ensure reliability and scalability. 𝗧𝗵𝗲 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗼𝗳 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: 𝗙𝗿𝗼𝗺 𝗕𝗟𝗘𝗨 𝘁𝗼 𝗟𝗟𝗠-𝗔𝘀𝘀𝗶𝘀𝘁𝗲𝗱 𝗘𝘃𝗮𝗹𝘀 Traditional metrics like BLEU and ROUGE paved the way but often miss nuances like tone or semantics. LLM-assisted Evals (e.g., GPTScore, LLM-Eval) now leverage AI to evaluate itself, achieving up to 80% agreement with human judgments. Combining machine feedback with human evaluators provides a balanced and effective assessment framework. 𝗙𝗿𝗼𝗺 𝗧𝗵𝗲𝗼𝗿𝘆 𝘁𝗼 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗲: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗘𝘃𝗮𝗹 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲 - Create a Golden Test Set: Use tools like Langchain or RAGAS to simulate real-world conditions. - Grade Effectively: Leverage libraries like TruLens or Llama-Index for hybrid LLM+human feedback. - Iterate and Optimize: Continuously refine metrics and evaluation flows to align with customer needs. If you’re working on LLM-powered applications, building high-quality Evals is one of the most impactful investments you can make. It’s not just about metrics — it’s about ensuring your app resonates with real-world users and delivers measurable value.

  • View profile for Chris Carson FRICS, FAACE, FGPC, PSP, DRMP, CEP, CCM, PMP

    Enterprise Director of Program & Project Controls, and Vice President at Arcadis

    14,598 followers

    Glen Palmer, PSP, CFCC, FAACE and I are honored by AACE publishing another of our Top Ten series of papers in the Cost Engineering Journal. Resource management sits at the heart of project success—and, too often, at the root of costly construction claims. Why Focus on Resources? Most construction schedules are built on assumptions about production rates, durations, and quantities. But when resource planning falls short—whether due to unrealistic manpower peaks, lack of skilled labor, or poor coordination—projects risk delays, cost overruns, and disputes. Rather than waiting for claims to arise, Palmer and Carson argue for a proactive approach: plan, validate, and monitor your resources from day one. Key Takeaways from the Top Ten Approaches: 1. Validate Resources by Discipline: Go beyond surface-level schedule checks. Detailed resource validation—using field-experienced personnel—can identify unrealistic resource peaks and prevent unachievable schedules. 2. Formalize Punch and Warranty List Management: Avoid never-ending completion and warranty periods by developing comprehensive, early punch lists and using structured warranty management systems. 3. Check Resource Earning Curves: Ensure planned progress is actually achievable by comparing planned manpower curves and production rates to real-world constraints. 4. Manage Schedule Compression: When compressing schedules, understand the risks and costs of acceleration and recovery. Use structured analysis and documentation to avoid disputes. 5. Review General Conditions Labor: Monitor and budget field overhead costs carefully, and avoid relying on variable, hard-to-track level-of-effort activities. 6. Use Constructability Reviews: Always have experienced field experts review “fast-tracked” project schedules to spot resource and constructability problems early. 7. Address Trade Stacking and Overcrowding: Analyze crew concurrency and area usage to prevent inefficiencies from too many workers or trades in the same space. 8. Specify Resource Requirements in Schedules: Include resource histograms and percent curves in scheduling specifications to enable thorough schedule reviews. 9. Plan for Resource Availability: Evaluate the availability of skilled labor and specialty resources, especially on large or geographically constrained projects. 10. Minimize Inefficiencies from Disrupted Trade Work: Align procurement, sequencing, and trade starts to reduce disruption, and use targeted planning to ensure work is completed efficiently on the first attempt. Conclusion: Resource-related claims are often avoidable with disciplined planning, honest schedule validation, and ongoing monitoring. By following these ten approaches, project teams can dramatically reduce the risk of disputes, keep projects on track, and protect both profit and reputation.

  • View profile for Arockia Liborious
    Arockia Liborious Arockia Liborious is an Influencer
    39,470 followers

    🔍 Diving into LLM System Metrics: What Really Matters After analyzing six months of LLM deployment data, here are the metrics that actually matter: ⚡ Reliability: 99.99% uptime - because enterprise solutions demand consistency ⏱️ Response Time: 500ms average - crucial for real-time applications 📈 Scale: Processing 10B+ tokens weekly across enterprise workloads 🔒 Security: 256-bit encryption, with <0.001% unauthorized access attempts 💰 Efficiency: Adaptive token allocation reducing operational costs by 30% 🧠 Intelligence: 5 specialized models, each learning from 1M+ daily interactions What stands out is how these metrics are evolving. While response time was the focus couple of years back, we're seeing a clear shift toward efficiency and specialized performance metrics in 2025. 💭 Curious to hear from other AI practitioners: Which metrics are you prioritizing for your LLM systems this year?

  • View profile for Suhail Diaz Valderrama MSc. MBA

    Director of Future Energies • Integrated Strategy & Asset Management • Driving Energy System Transformation • High-Impact Stakeholder Engagement • Advisory Board @ Khalifa University

    43,223 followers

    📢 A New Era for Energy Efficiency in Saudi Arabia: SEEC Releases Updated M&V User Guide Exciting news for energy transition experts! The Saudi Energy Efficiency Center (SEEC) has released the second version of its "Energy Saving Measurement & Verification (M&V) User Guide For the Kingdom of Saudi Arabia". This updated guide provides a robust framework for conducting M&V in energy efficiency projects, paving the way for a thriving ESCO sector and scaled-up ESPC implementation. Key takeaways from the report: 1️⃣ Emphasizes alignment with the International Performance Measurement and Verification Protocol (IPMVP), ensuring best practices and global credibility for KSA energy saving reports. 2️⃣ Outlines four detailed M&V options to address a diverse range of project complexities and circumstances. 3️⃣ Highlights the importance of a well-defined M&V plan, including clear measurement boundaries, baseline definitions, adjustment methodologies, data analysis procedures, and uncertainty assessments. 4️⃣ Addresses crucial issues such as metering equipment, data collection, and non-routine adjustments to ensure accurate and transparent saving calculations. 5️⃣ Includes practical examples and illustrations to guide practitioners through the M&V process. Opportunities: 1️⃣ Increased Transparency and Credibility: Standardized M&V practices will boost investor confidence in KSA's energy efficiency market, attracting further investments. 2️⃣ Enhanced ESCO Sector Development: The guide provides a solid foundation for ESCOs to operate, fostering competition and innovation in delivering energy saving solutions. 3️⃣ Scaled-up ESPC Implementation: Robust M&V will facilitate successful implementation of ESPCs, driving energy efficiency retrofits across various sectors. 4️⃣ Improved Project Design and Performance: The guide emphasizes comprehensive project design and performance monitoring, maximizing energy saving potential. Challenges: 1️⃣ Building M&V Capacity: Effective implementation of the guide requires training and development of qualified personnel with expertise in IPMVP and associated methodologies. 2️⃣ Data Management and Analysis: Handling large data sets and applying advanced statistical techniques require investment in appropriate software and expertise. 3️⃣ Adaptation to Specific Project Needs: Tailoring the M&V plan to individual project complexities and characteristics requires careful consideration and expertise. 4️⃣ Enforcing M&V Standards: Ensuring widespread adoption and adherence to the guidelines across the industry requires strong collaboration between SEEC, ESCOs, and project stakeholders. This updated M&V User Guide is a significant step forward for Saudi Arabia's energy transition. It provides a clear path for achieving energy efficiency goals, promoting sustainability, and driving economic growth. #EnergyTransition #EnergyEfficiency #SaudiArabia #M&V #ESPC #ESCO #IPMVP #SEEC #Decarbonization #EnergyTransition

Explore categories