The Best LLM for Cyber Threat Intelligence: OpenAI, Anthropic, Groq

Nikoloz Kokhreidze

Nikoloz Kokhreidze

We compare AI services and LLMs for cyber threat intelligence to find the best one for speed, context and cost.

boost your cyber threat intelligence with the right llm and AI service nikoloz kokhreidze mandos

Imagine a project that uses AI language models to make threat information easier to understand. The goal is simple:

1. Create a structured way to store threat data (using JSON)
2. Feed in a threat report
3. Have the AI analyze the report and fill in the structured data
4. Use the structured data for further analysis

To make this work, we'll use Python and FastAPI to manage the process, and connect to AI services from companies like OpenAI, Anthropic, and Groq. Since analyzing threat reports can take some time, we'll use asynchronous functions (async) to keep things running smoothly.

We want to make sure the AI gives us quick and accurate results, so we'll be testing several different AI language models:

- Two models from Anthropic (claude-3-opus-20240229 and claude-3-sonnet-20240229)
- Two models from OpenAI (gpt-4-0125-preview and gpt-3.5-turbo-0125)
- Two models running on Groq hardware (llama2-70b-4096 by Meta and mixtral-8x7b-32768 by Mixtral)

We'll look at how fast each model is, how much it costs to use, and how well it understands the threat reports. The best model for this project will balance speed, price, and accuracy.

ℹ️
The code snippets in this post are redacted examples, included only to demonstrate the concepts being discussed. They're not meant to be complete, working code samples. If you'd like to see more comprehensive, detailed code examples that you can actually run and experiment with, please leave a comment on this post letting me know. I'd be happy to provide more in-depth code samples in a future post.

Preparing the JSON Structure and Functions

To identify the best AI service for our cybersecurity use case, we create a 43-line JSON structure containing threat information (attack type, TTPs, vulnerabilities, CVE IDs, etc.) and Mandos Brief structure, with examples and details to assist the language model. By combining JSON structure with a simple prompt to fill it out, we get a 1852-character, 191-word system message that sets clear expectations for the language model's output.

Next, we provide content for the LLM to analyze and populate the JSON. We choose a joint cybersecurity advisory about APT28 from the FBI containing all the necessary items requested in the JSON. We copy the PDF body and save it as a file (.md or .txt), resulting in a 17499-character, 2234-word text.

With the content prepared, our next step is to create functions, starting with grab_content(). This async function is designed to consume a URL or file containing text and return the content, which we will use for both the system message and prompt.

ℹ️
I am using asynchronous functions in this post because my project has much broader application. For the purposes of this blog feel free to use functions that don't utilize asynchronous programming.
# Fetch content from a given URL or local file path.
async def grab_content(path_or_url):
    try:
        # Check if the input is a URL or a local file path
        if path_or_url.startswith('http://') or path_or_url.startswith('https://'):
            # The input is a URL, fetch the content from the web
            response = requests.get(path_or_url)
            response.raise_for_status()  # Raises an exception for HTTP errors
            content = response.text
        else:
            # The input is assumed to be a local file path, read the file content
            with open(path_or_url, 'r', encoding='utf-8') as file:
                content = file.read()
        
        return content
    except requests.RequestException as e:
        return str(e)
    except FileNotFoundError as e:
        return f"File not found: {e}"
    except Exception as e:
        return f"An error occurred: {e}"

Next, we need to create functions for each AI service. To do this, we configure parameters such as temperature, top_p, frequency_penalty, presence_penalty, and max_tokens. For this evaluation, we will set the same temperature and parameters for all AI services to avoid hallucinations as much as possible.

The example provided shows how to call the OpenAI API.

# Asynchronously calls the OpenAI API with the specified messages and model.
async def call_openai_api_async(messages, model="gpt-4-0125-preview"):
    try:
        response = await async_openai_client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
            top_p=1,
            frequency_penalty=0.1,
            presence_penalty=0.1,
            max_tokens=2048
        )
        
        incident_report = response.choices[0].message.content
        # Access usage information using dot notation.
        usage_info = response.usage
        # Calculate the cost based on the usage.
        model_used = response.model
        
        print(usage_info)
        # Assuming ai_api_calculate_cost returns a dictionary with cost details
        cost_data = ai_api_calculate_cost(usage_info, model=model_used)
        
        # Combine incident_report and cost_data
        combined_response = {
            "incidentReport": incident_report,
            "costData": cost_data
        }
        
        return combined_response
    
    except Exception as e:
        return f"An error occurred: {e}"

Let's break down how this function works. We supply the call_openai_api_async() function with the necessary messages and model parameters. This function sends an asynchronous request to the OpenAI API.

Once the API processes our request, it sends back a response. The call_openai_api_async() function parses this response and extracts two key pieces of information:
1. The filled-out JSON data
2. Usage information, which includes the number of tokens used for both the prompt and the response

The usage data is then passed to the ai_api_calculate_cost() function. This function takes the token usage information and calculates the cost in US dollars based on the pricing information provided by the AI service providers at the time of publishing (let me know in the comments if you want to see this function as well).

The cost calculation is based on pricing information available at the date of publishing of this post and shared by AI service providers:

Let's configure a function to trigger the process. While we could directly provide messages and config to call_openai_api_async(), we'll create a separate function as it will eventually serve as an API endpoint. This approach also allows our project to handle more extensive use cases than demonstrated in this example.

@app.post("/main/")
async def main(url: UrlBase):
    try: 

        # Read system message (system prompt + JSON)
        system_content = await grab_content(system_message.md)

        # Read the article content
        article = await grab_content("test_article.md")
        
        # Messages for OAI and GROQ
        # NOTE: You will have to adapt this to Anthropic API since it only recognizes "user" and "assistant" messages
        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": article}
        ]
        
        combined_response = await call_openai_api_async(messages)        
        
        return (combined_response)

    
    except Exception as e:
        print(e)
        raise HTTPException(status_code=500, detail="There was an error on our side. Try again later.")

Now that we have messages and calculation information, it's time to start evaluations.

Evaluation Methodology

Let's break down the evaluation criteria for the AI services:

  • Speed - We measure the time it takes from providing the article content to receiving the final results from the AI service. We won't stream the AI's response but will wait for the full response. We'll use Postman, the API testing tool, to get the time information.
  • Content Awareness - We assess how well the LLM recognizes the content and contextualizes it by filling out the JSON. I will manually review this.
  • Total Cost - We calculate how much we have to pay the AI service for outputting filled out JSON. The cost will be represented as a sum of the prompt and response costs.

Each test starts when we initiate the main() function and ends when we receive the response. We supply each LLM with the same prompt and article text. We then manually review the filled-out JSON and identify any shortcomings.

Limitations and Caveats

Here are some important caveats and limitations to keep in mind as we evaluate the performance of these large language models (LLMs):

Exclusive Content

⚠️ WARNING: For Security Leaders Only

This exclusive content isn't for those comfortable staying in the technical trenches. Each week, I will send you proven leadership frameworks and exclusive deep dives that can catapult you from 'security guy/girl' to a confident leader—but only if you put in the work and dedicate a bit of time.

Are you up for a challenge?

Already a member? Sign in

Nikoloz Kokhreidze

Share With Your Network

Check out these related posts