Chetan Menge

Posted on May 8 • Updated on May 10

101- LLM DBRX Instruct Model Serving- Saving Cost

#llm #databricks #beginners #cost

Started exploring and trying Databricks instruct LLM. Was going over the Databricks Marketplace and installed and served the model by following steps listed in the Sample provided Notebook.

Was able to serve model successfully and interacted with it very well by proving few prompts. Its was after couple of days, got realised when saw budget alert notification that, planned budget got exceeded way beyond.

Lesson Learned,

LLM Download from Market place is free
Serving LLM - Similar to cloud hosted resources cost saving, there is way to scale down served LLM endpoint when not in use
Model which is accessed from Marketplace can be serve using "Databricks Model Serving" approach which server Model as REST endpoint using serverless compute.

Please find below details with screenshot for reference, for downloading and serving DBRX Model.

Model Download

On Databricks Workspace portal, we can go to Marketplace and search for LLM. E.g search for DBRX models.

Model and its details will be shown as below,

You can select / Click on Get instant access, to download model into your environment.

Validation of Model in Unity Catalog

Once downloaded, model will be available in unity catalog as shown below

If its listed in unity catalog that means model got downloaded and available for use.

Serving Model thru Endpoint

You can go to unity catalog and select specific model e.g. dbrx_instruct. You can create the endpoint and server model by clicking the “Serve this model” button above in the model UI.

Below page will be prompted to select the configuration before serving the model

Saving Cost of Serving Model Endpoint

While serving the model , make sure to expand the Advance Configuration section, which has option of "Scale to Zero" Please refer below screenshot for the details.

If the "scale to zero" is not selected, the minimum charge will depend on the minimum provisioned concurrency specified by the chosen concurrency range.

If ‘scale to zero’ is selected, scale to zero happens automatically after 30 minutes of no requests, at which time the endpoint enters the fully scaled-to-zero (idle) state. You are not charged during this time period. When a new request is made, the endpoint exits this idle state and begins scaling up at which point you begin getting charged.

Reference :-

Model Serving Pricing | Databricks

Databricks Model Serving simplifies the deployment of machine learning models as APIs, enabling real-time predictions within seconds or milliseconds.

databricks.com

DEV Community

101- LLM DBRX Instruct Model Serving- Saving Cost

Lesson Learned,

Model Download

Validation of Model in Unity Catalog

Serving Model thru Endpoint

Saving Cost of Serving Model Endpoint

Reference :-

Model Serving Pricing | Databricks

Top comments (0)

Read next

Playwright Automation Commands

Introducing Krabber.net

LoReFT and pyreft for surgical fine-tuning

Data in: documents and indices