Meta releases efficiency-optimized Llama 3.3 70B large language model

Meta Platforms Inc. today introduced Llama 3.3 70B, the latest addition to its eponymous line of open-source large language models.

The new algorithm provides similar output quality as Llama 3.1 405B, the most advanced LLM in the series, but using a fraction of the hardware. The result is a significant drop in infrastructure expenses. Meta says that Llama 3.3 70B generates prompt responses nearly five times more cost-efficiently.

The model is based on an optimized version of the Transformer architecture, the neural network design that underpins most cutting-edge LLMs. When analyzing a set of data points, Transformer-based models use a so-called attention mechanism to determine which data points are most relevant to the task at hand. Meta swapped the default attention mechanism with an improved implementation that lowers inference costs.

The company’s engineers trained Llama 3.3 70B on a cluster of H100-80GB chips from Nvidia Corp. The chips’ TDP, a metric that tracks the extent to which a processor’s compute capacity is utilized, was set to the 700-watt maximum. Meta that the LLM took 39.3 million graphics card-hours to train.

The training dataset includes about 15 trillion tokens, units of data that each correspond to a few letters or numbers. Meta used information from the public web, as well as more than 25 million synthetic examples. Those are AI-generated data points created specifically for LLM development purposes.

After Meta completed the initial training process, it refined Llama 3.3 70B with several methods.

One of the techniques the company used is known as supervised fine-tuning. It involves providing a freshly developed LLM with additional datasets that it didn’t access during the initial training. Those additional datasets contain metadata, or contextual information, that makes it easier for the LLM to find useful patterns.

Meta also used another AI method known as RLHF. While an LLM is being trained, it receives pointers from an algorithm on how to improve the quality of its output. RLHF combines those automatically-generated pointers with feedback from humans.

After completing the development process, Meta compared Llama 3.3 70B with Llama 3.1 405B using 10 AI benchmarks. Llama 3.3 70B trailed its larger namesake by under 2% in six of the tests and managed to achieve higher scores across three. It also mostly outperformed OpenAI’s GPT-4o.

According to Meta, processing 1 million input tokens with Llama 3.1 405B costs $1, while generating 1 million output tokens requires $1.80’s worth of compute capacity. Llama can manage the same tasks with 10 cents’ and 40 cents’ worth of infrastructure, respectively.

Meta has made the source code for Llama 3.3 70B available on Hugging Face.

Image: Unsplash

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU