DEV Community: Bestony

Evaluation as a Business Imperative: The Survival Guide for Large Model Application Development

Bestony — Mon, 13 Jan 2025 02:01:58 +0000

Are we truly ready for large model application development, or are we still stuck in the mindset of “as long as it works”?

Over the past few decades, software engineering has focused on reducing system risk and uncertainty through various methodologies. We’ve developed numerous approaches and frameworks to drive rapid business growth: TDD, BDD, DDD — each guiding us to minimize project uncertainty and ensure system stability. However, the advent of large language models (LLMs) challenges this stability, introducing a new reality: not everything will be stable anymore.

The inherent complexity of LLMs means their behavior is not as stable or predictable as code-based logic. In traditional software engineering, with well-designed code, we expect a deterministic output for a given input. But with LLMs, we can’t be sure that the same input will always yield the same output. This introduces a crack in our traditional software engineering approach: while the logic of your system may be clear, the output can still be random, introducing instability.

This randomness is why development in the LLM application era differs from what we’re used to. We must now focus on model evaluation. Evaluation, once a peripheral concern for algorithm specialists, is now crucial for every LLM application developer. To put it bluntly, evaluation is the business.

Why Evaluate?

As a traditional engineer, I’m used to building systems that solve business problems based on understanding requirements. I once developed an AI document generation tool with my team, initially focusing only on functionality. It wasn’t until my leader asked about business impact and precision/recall that I realized the need for evaluation.

I had no concept of evaluation then, so it wasn’t part of my project flow. We assumed that if things were running and users gave positive feedback, all was well. But I had no idea about actual effectiveness. That experience led me to research evaluation, and it inspired this post. I hope you can avoid the mistakes I made.

Why We Didn’t Need Much Evaluation Before

Evaluation isn’t new, but in the past, we mostly used stable services like databases or third-party APIs. Their behavior was predictable, so we only cared about the function itself. LLMs are different; they are inherently a source of uncertainty.

Evaluation has existed in traditional search and machine learning, but it was often a separate module. We assumed these modules were reliable. Business engineering teams focused on feature availability rather than metrics like recall and precision.

The LLM era changes this. LLMs are no longer external modules but embedded within our systems. This shift requires us to integrate evaluation into every development stage, treating it as an integral part of the process rather than a separate concern. Instead of just focusing on feature availability, we must monitor model performance, data quality, and user experience. Our team composition also changes. If previously a typical ratio for product/engineering/testing was 1:5~10:1, it may now become 1:5~10:2. That extra person is there to handle the uncertainty LLMs bring — we need more resources to ensure model performance.

How Do We Evaluate Effectively?

If you agree with the previous points, then we have a common understanding: LLMs are not inherently stable, and we must invest additional effort to ensure their stability within our systems.With this, we can start designing our evaluation systems and plans:

Start with the End: Defining Business and Technical Metrics

The first step in evaluation is defining business metrics and model inputs and outputs. This is where “evaluation is the business” comes into play. If you nail this step, you’ve already captured 80% of the value. Since business metrics are unique, let’s focus on more general technical metrics.

If your system integrates LLMs, focus on:

Generation Quality: Assess the quality of LLM-generated content. There are existing evaluation methods (like Bilingual Evaluation Understudy, Recall-Oriented Understudy for Gisting Evaluation). But the most reliable way is through human review, evaluating content for effectiveness, fluency, coherence, and relevance.
Model Efficiency: Evaluate LLM inference speed, throughput, and resource consumption. If using cloud-based APIs, focus on inference speed and throughput. If using local models, also evaluate model size, memory usage, and computational resources. These metrics impact your system architecture and cost.
Model Safety:Assess the LLM’s ability to handle malicious requests, whether it generates harmful, inappropriate, or biased content, and if there are any sensitive data leaks. Without these capabilities, your application could face severe issues and potentially shut down.

If your system uses RAG with a database, you also need to look into:

Precision: Assess the percentage of relevant documents returned among all documents. Low precision means the retrieval model returns too many unrelated documents. Increase similarity thresholds or refine the model.
Recall: Evaluate the percentage of relevant documents returned compared to all relevant documents. Low recall means the retrieval model is missing relevant documents. Decrease thresholds or optimize the model.
Hit Rate: Assess the likelihood of finding at least one relevant document across multiple user intents. Low hit rate may indicate gaps in your knowledge base.

If you are building code generation tools, also track code execution success rate, etc.

These are only starting points. You must tailor the metrics to your specific business. But in general, once you decide what to measure, you’re close to understanding your business. You’ll then choose suitable models and prompts and integrate them with engineering.

Clean Data: The Key to High-Quality Evaluation

After defining metrics, the next step is cleaning high-quality data for evaluation. Different evaluation objectives require different data sets. You will likely need to create your own datasets, using online data, manually curated data, or existing data with annotations.

Data cleaning is time-consuming. You need to remove errors, duplicates, incomplete information, format the data consistently, remove useless information, and normalize the data. If your data contains sensitive information, be sure to anonymize it for privacy.

Plan data collection and cleaning upfront to reduce pressure later and ensure adequate staffing.

Also, pay attention to your dataset’s quality (representativeness, accuracy, diversity, and completeness), scale, and data bias. Ensure your data doesn’t introduce biases that drive the model in the wrong direction.

Once you have quality data, the following steps are simpler: Maintain your data, update it regularly, adapt it to your business, and continuously evaluate your system and model to ensure metrics remain healthy.

Continuous Evaluation: A Standardized Process

Evaluation isn’t a one-off event. LLMs are constantly evolving, especially in cloud API scenarios, and your live data may also change. Therefore, you need continuous evaluation. Perform it regularly (e.g., weekly, monthly, or after significant releases) and integrate it into your project development workflow. This will help you identify issues and iterate quickly. Continuous evaluation isn’t optional; it’s essential for the LLM application era.

Evaluation as a Business Imperative

In the LLM era, evaluation is the business; it’s no longer optional, but essential for success. Without evaluation, your product is like a ship without a compass. As LLMs play an increasingly important role, they’re no longer an add-on, but the foundation of success. Thus, product managers and project members must understand business evaluation. A somewhat controversial statement is that if evaluation accounts for less than 30% of the business process, that business has at least a 50% optimization potential. Without evaluation, you won’t know your current status or limits, and therefore cannot improve.

Originally published on aistarter.dev on January 5, 2025, by Bestony.

I used 72 hours to replicate a ClubHouse

Bestony — Sat, 06 Feb 2021 14:04:20 +0000

2021, the first “war” of new social software, Clubhouse exploded overseas.

Behind the rapid popularity of this voice social app, Elon Musk, the CEO of Tesla and a big man in the technology circle, personally stood up and created a chat room (Room) called “Elon Musk on Good Time” on Clubhouse not long ago, and the live “room” was filled up instantly with the prerequisite of accommodating 5000 users.

ClubHouse has thus become the focus of much discussion and analysis. However, many of you may not have played with the app yet, because the “registration invitation code” is hard to find. So, 72 hours ago, a developer volunteered to develop a NESHouse in imitation of ClubHouse and open sourced the code.

Open source address: https://github.com/bestony/neshouse

Experience it at: https://neshouse.com/admin.html

The author of NESHouse, Bai Huancheng, is an engineer who plays podcasts and is also the technical leader of the Linux.cn open source community. We caught up with him to talk about the process of replicating ClubHouse and what he thinks of these kinds of applications from a professional podcaster’s perspective.

Behind the 72-hour development challenge

Q: How did you come up with the idea of doing a 72-hour development challenge?

Bai Huancheng: I like to research new products myself, and when ClubHouse exploded in these days, I got the invitation code early and started using it. In the process of using it, I think ClubHouse seems to be just like that. The problem with ClubHouse is that you can’t log on. Since you can’t log in, why not make one yourself?

In addition, I and my partners in the JINJINLEDAO podcast also want to use this “performance art” to prove that in today’s cloud services are very common, operational capabilities to help the success of the product may be more important than technical skills, as long as you have an idea, although you can boldly go to practice: build a minimalist model with reliable cloud services to Verify.

The entrepreneurial joke of “just one programmer away” may not be so applicable today.

And to put some pressure on myself, I chose to set myself a 72-hour Flag (why not 24 hours? Because I’m not sure about 24 hours), so that I can make sure that I can finish the development in the given time.

I’ve been in the habit of Hackathons, and when I was at the company, I used to give myself a Hackathon Time every Friday night to do some Side Projects, but I didn’t actually participate in any of them due to time and location constraints.

I usually take one out of my own inspiration bank at a specific time (like Friday night or Saturday night) and use it as the Hackathon theme, and then implement the project overnight.

Q: What factors are considered in the process of technology selection?

Bai Huancheng: In terms of technology selection, my main consideration is two factors.

It must be fast: I want to do things quickly & from 0 to 1, that determines one of my core factors is fast enough, otherwise it will take me half a month to do it, this thing is meaningless.

Must be new: I like to use some technology stack that I have never used before in Side Project / Hackathon, so that I can force myself to learn a new thing in the fastest time and give myself energy for subsequent development.

Other aspects are less important to me, because the consumption of resources during Hackathon development is actually limited and the cost is not too much of a problem, it’s more about how to implement it quickly and well.

Q: How was the implementation of the audio interaction function considered in the NESHouse project? What kind of problems did you encounter?

Bai Huancheng: I chose the fastest method for the audio interaction, after all, I wanted to implement it in a short time.

I had researched some third-party real-time audio SDKs before, and found that the Agora API was relatively simple and clear, and the development cost was not so high.
In the application process, to use an analogy, suppose any of our projects need to drink water (real-time audio), then the Agora to provide is the tap water, a twist of the faucet, the water will come.

If there is no such SDK, then you need to dig your own well, and then install your own pump, it is not that you can not drink the water, just more trouble, rather than directly access the existing SDK is more trouble.
This also allows me to finish the access faster and focus on the logic.

For example, the code for accessing audio listening in NESHouse is only 7 lines of code.

In fact, the implementation of audio interaction did not encounter too many difficulties, the main difficulty was in the adaptation of different browsers and devices.

Because I was working on a web-side implementation, I relied on the browser’s compatibility with WebRTC. For example, during the development process, I found that the browser of WeChat had to let the user actively click on the page to access the audio playback, so I made a special interface to do the access on WeChat devices.

Podcasts and audio social in my eyes

Q: How is ClubHouse different from the traditional podcast idea? Is it an evolutionary form of podcasting?

Bai Huancheng: My own feeling about ClubHouse is that its original intention is probably to be an extension of the offline scene.

For example, if I could go to an offline salon at the moment of the epidemic, but now I can’t go to listen to it, then I can listen to it in the ClubHouse. It will have a time limit, so I have to come to the House at a fixed time to listen to the content shared by the Club. But a podcast is different, a podcast has no time limit, I can come and listen to it at any time.

This time limit determines that ClubHouse is very live and requires you to be more engaged in using it. But unlike podcasts, there are not as many restrictions. However, ClubHouse can actually be used as a podcast, just maybe the theme changes often, so if you don’t mind, ClubHouse can also be a podcast.

Q: There is a perception that “ClubHouse is not a technical barrier, but mainly a success in terms of operational communication”. How do you see the success of the app after 72 hours of development?

Huancheng Bai: Was it difficult to develop ClubHouse? Yes, it was. Are there any barriers to audio social? No. There are no barriers. Because it can be realized based on the services of Agora.

The real barrier lies in the early development of the product, you need to weigh the pros and cons of the product, what do you want? What do you want and what do you not want? After you develop the product, how to get enough KOLs, such as Elon Musk, to join the community to share? How do you get more people to come in and play? How to get enough money to support the mass of users to enter the operation and cost issues.

In contrast, I think these latter things are the more difficult ones.

Q: As a veteran podcast host and freelance developer, do you think the ClubHouse style of audio social networking will become a trend in China?

Bai Huancheng: I think it’s still difficult, the ClubHouse style is more demanding for people to synchronize their time. It may slowly become a tool, when you have the need to open online salons, then ClubHouse will be a good tool.

Author

Bai Huancheng, the author of NESHouse, is an engineer who plays podcasts and is also the technical leader of Linux.CN open source community, GitHub ID: bestony.

I build A Pornhub Flavour Logo Generator

Bestony — Tue, 26 Mar 2019 16:29:23 +0000

Hi, Developers,

I build A Pornhub Flavour Logo Generator and deploy it on Netlify,

Now ,You can view it at Logoly

and I publish it on Github , The Source Code is here,With The Do What The F*ck You Want To Public License

If possible ,can you give me a star on Github ?