We used to compare Llama 2 7b, 13b and 70b (chat-hf fine-tuned) vs OpenAI gpt-3.5-turbo and gpt-4. We used a 3-way verified hand-labeled set of 373 news report statements and presented one correct and one incorrect summary of each. Each LLM had to decide which statement was the factually correct summary.ðŸ˜
[(https://link.medium.com/ugIcBrTXxCb)
![Cover image for Llama-2-70b is almost as strong at factuality as gpt-4, and considerably better than gpt-3.5-turbo.](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5yckexj17ycgszzyykn.jpg)
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)