Cybersecurity AI

The Best LLM for Cyber Threat Intelligence: OpenAI, Anthropic, Groq

We compare AI services and LLMs for cyber threat intelligence to find the best one for speed, context and cost.

21 min read
boost your cyber threat intelligence with the right llm and AI service nikoloz kokhreidze mandos

Imagine a project that uses AI language models to make threat information easier to understand. The goal is simple:

1. Create a structured way to store threat data (using JSON)
2. Feed in a threat report
3. Have the AI analyze the report and fill in the structured data
4. Use the structured data for further analysis

To make this work, we'll use Python and FastAPI to manage the process, and connect to AI services from companies like OpenAI, Anthropic, and Groq. Since analyzing threat reports can take some time, we'll use asynchronous functions (async) to keep things running smoothly.

We want to make sure the AI gives us quick and accurate results, so we'll be testing several different AI language models:

- Two models from Anthropic (claude-3-opus-20240229 and claude-3-sonnet-20240229)
- Two models from OpenAI (gpt-4-0125-preview and gpt-3.5-turbo-0125)
- Two models running on Groq hardware (llama2-70b-4096 by Meta and mixtral-8x7b-32768 by Mixtral)

We'll look at how fast each model is, how much it costs to use, and how well it understands the threat reports. The best model for this project will balance speed, price, and accuracy.

ℹ️
The code snippets in this post are redacted examples, included only to demonstrate the concepts being discussed. They're not meant to be complete, working code samples. If you'd like to see more comprehensive, detailed code examples that you can actually run and experiment with, please leave a comment on this post letting me know. I'd be happy to provide more in-depth code samples in a future post.

Preparing the JSON Structure and Functions

To identify the best AI service for our cybersecurity use case, we create a 43-line JSON structure containing threat information (attack type, TTPs, vulnerabilities, CVE IDs, etc.) and Mandos Brief structure, with examples and details to assist the language model. By combining JSON structure with a simple prompt to fill it out, we get a 1852-character, 191-word system message that sets clear expectations for the language model's output.

Next, we provide content for the LLM to analyze and populate the JSON. We choose a joint cybersecurity advisory about APT28 from the FBI containing all the necessary items requested in the JSON. We copy the PDF body and save it as a file (.md or .txt), resulting in a 17499-character, 2234-word text.

With the content prepared, our next step is to create functions, starting with grab_content(). This async function is designed to consume a URL or file containing text and return the content, which we will use for both the system message and prompt.

ℹ️
I am using asynchronous functions in this post because my project has much broader application. For the purposes of this blog feel free to use functions that don't utilize asynchronous programming.
# Fetch content from a given URL or local file path.
async def grab_content(path_or_url):
    try:
        # Check if the input is a URL or a local file path
        if path_or_url.startswith('http://') or path_or_url.startswith('https://'):
            # The input is a URL, fetch the content from the web
            response = requests.get(path_or_url)
            response.raise_for_status()  # Raises an exception for HTTP errors
            content = response.text
        else:
            # The input is assumed to be a local file path, read the file content
            with open(path_or_url, 'r', encoding='utf-8') as file:
                content = file.read()
        
        return content
    except requests.RequestException as e:
        return str(e)
    except FileNotFoundError as e:
        return f"File not found: {e}"
    except Exception as e:
        return f"An error occurred: {e}"

Next, we need to create functions for each AI service. To do this, we configure parameters such as temperature, top_p, frequency_penalty, presence_penalty, and max_tokens. For this evaluation, we will set the same temperature and parameters for all AI services to avoid hallucinations as much as possible.

The example provided shows how to call the OpenAI API.

# Asynchronously calls the OpenAI API with the specified messages and model.
async def call_openai_api_async(messages, model="gpt-4-0125-preview"):
    try:
        response = await async_openai_client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
            top_p=1,
            frequency_penalty=0.1,
            presence_penalty=0.1,
            max_tokens=2048
        )
        
        incident_report = response.choices[0].message.content
        # Access usage information using dot notation.
        usage_info = response.usage
        # Calculate the cost based on the usage.
        model_used = response.model
        
        print(usage_info)
        # Assuming ai_api_calculate_cost returns a dictionary with cost details
        cost_data = ai_api_calculate_cost(usage_info, model=model_used)
        
        # Combine incident_report and cost_data
        combined_response = {
            "incidentReport": incident_report,
            "costData": cost_data
        }
        
        return combined_response
    
    except Exception as e:
        return f"An error occurred: {e}"

Let's break down how this function works. We supply the call_openai_api_async() function with the necessary messages and model parameters. This function sends an asynchronous request to the OpenAI API.

Once the API processes our request, it sends back a response. The call_openai_api_async() function parses this response and extracts two key pieces of information:
1. The filled-out JSON data
2. Usage information, which includes the number of tokens used for both the prompt and the response

The usage data is then passed to the ai_api_calculate_cost() function. This function takes the token usage information and calculates the cost in US dollars based on the pricing information provided by the AI service providers at the time of publishing (let me know in the comments if you want to see this function as well).

The cost calculation is based on pricing information available at the date of publishing of this post and shared by AI service providers:

Let's configure a function to trigger the process. While we could directly provide messages and config to call_openai_api_async(), we'll create a separate function as it will eventually serve as an API endpoint. This approach also allows our project to handle more extensive use cases than demonstrated in this example.

@app.post("/main/")
async def main(url: UrlBase):
    try: 

        # Read system message (system prompt + JSON)
        system_content = await grab_content(system_message.md)

        # Read the article content
        article = await grab_content("test_article.md")
        
        # Messages for OAI and GROQ
        # NOTE: You will have to adapt this to Anthropic API since it only recognizes "user" and "assistant" messages
        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": article}
        ]
        
        combined_response = await call_openai_api_async(messages)        
        
        return (combined_response)

    
    except Exception as e:
        print(e)
        raise HTTPException(status_code=500, detail="There was an error on our side. Try again later.")

Now that we have messages and calculation information, it's time to start evaluations.

Evaluation Methodology

Let's break down the evaluation criteria for the AI services:

Each test starts when we initiate the main() function and ends when we receive the response. We supply each LLM with the same prompt and article text. We then manually review the filled-out JSON and identify any shortcomings.

Limitations and Caveats

Here are some important caveats and limitations to keep in mind as we evaluate the performance of these large language models (LLMs):

AI Language Model Performance: Detailed Results

We start sending the same article content to each LLM and AI service and analyze the results. Here is how it goes down.

Evaluating Anthropic's Claude 3 Opus LLM for Cyber Threat Intelligence Effectiveness

testing claude-3-opus-20240229 for cybersecurity threat intelligence use case

Model: claude-3-opus-20240229

Speed (seconds): 52.30s

Prompt Cost: $0.08853

Completion Cost: $0.070575

Total Cost: $0.159105

Content Awareness:

    "malware": [
      "Moobot botnet",
      "MASEPIE backdoor"
    ]
    "title": "Russian APT28 Compromises Ubiquiti Routers for Global Cyber Espionage",
    "keyTakeaways": [
      "APT28 has compromised Ubiquiti EdgeRouters worldwide to facilitate malicious cyber operations.",
      "The actors used the routers to harvest credentials, collect NTLMv2 digests, proxy traffic, and host malicious content.",
      "APT28 exploited the Microsoft Outlook vulnerability CVE-2023-23397 to collect NTLMv2 hashes.",
      "Mitigation requires factory reset of routers, firmware updates, credential changes, and firewall rules."
    ]
🔬
The Opus LLM exhibits strong content awareness as it accurately identifies key details from the report, such as the attack's global scope, the responsible threat actor group, the malware strains used, and the ATT&CK techniques employed. However, we notice that it fails to mention some additional affected countries and companies that are included in the report.

Evaluating Anthropic's Claude 3 Sonnet LLM for Cyber Threat Intelligence Effectiveness

testing claude-3-sonnet-20240229 for cybersecurity threat intelligence use case

Model: claude-3-sonnet-20240229

Speed (seconds): 26.83

Prompt Cost: $0.017706

Completion Cost: $0.014625

Total Cost: $0.032331

Content Awareness:

    "malware": [
      "Moobot OpenSSH trojan",
      "MASEPIE backdoor"
    ],
    "title": "Russian APT28 Compromises Ubiquiti Routers Globally",
    "keyTakeaways": [
      "Russian state hackers compromised Ubiquiti EdgeRouters to facilitate cyber espionage operations worldwide.",
      "The routers were used to harvest credentials, collect NTLMv2 hashes, proxy traffic, and host malicious infrastructure.",
      "A zero-day vulnerability in Microsoft Outlook (CVE-2023-23397) was exploited to leak NTLMv2 hashes.",
      "Impacted organizations span aerospace, energy, government, manufacturing, retail, and more across multiple countries."
    ]
🔬
Anthropic's Sonnet language model, showcases robust content awareness as we observe its ability to precisely pinpoint essential information like the names of threat actors, specific malware strains, and various ATT&CK techniques mentioned in the text. Despite this strength, we notice that it encounters challenges when dealing with certain nuances, such as recognizing that the different threat actor names actually refer to a single group. Additionally, we find that it overlooks one of the mitigation recommendations outlined in the report.

Evaluating OpenAI's GPT-4 LLM for Cyber Threat Intelligence Effectiveness

testing gpt-4-0125-preview for cybersecurity threat intelligence use case

Model: gpt-4-0125-preview

Speed (seconds): 28.15

Prompt Cost: $0.05096

Completion Cost: $0.02046

Total Cost: $0.07142

Content Awareness:

"mitigationInstructions": "To remediate compromised EdgeRouters: perform a hardware factory reset, upgrade to the latest firmware version, change any default usernames and passwords, and implement strategic firewall rules on WAN-side interfaces. For CVE-2023-23397, update Microsoft Outlook and consider disabling NTLM or enabling server signing and Extended Protection for Authentication."
  },
    "title": "APT28 Exploits Compromised Ubiquiti EdgeRouters for Global Cyber Operations",
    "keyTakeaways": [
      "APT28 leverages compromised Ubiquiti EdgeRouters for malicious activities.",
      "CVE-2023-23397 exploited to collect NTLMv2 digests from targeted Outlook accounts.",
      "Recommend immediate remediation actions for EdgeRouter users to mitigate threats.",
      "Global industries targeted, highlighting the need for cross-sector cybersecurity vigilance."
    ]
🔬
GPT-4 shows a good grasp of the main points in the cyber threat intelligence report. However, there are a few details it overlooks, such as the full list of ATT&CK techniques and specific mitigation steps. This suggests that while GPT-4 has a solid foundation, there is still potential for it to enhance its ability to identify and include all the important information from the report.

Evaluating OpenAI's GPT-3.5 LLM for Cyber Threat Intelligence Effectiveness

testing gpt-3.5-turbo-0125 for cybersecurity threat intelligence use case

Model: gpt-3.5-turbo-0125

Speed (seconds): 11.83

Prompt Cost: $0.002548

Completion Cost: $0.0010395

Total Cost: $0.003587

Content Awareness:

"mitigationInstructions": "Perform a hardware factory reset, upgrade to the latest firmware version, change default usernames and passwords, and implement strategic firewall rules on WAN-side interfaces."
    "title": "Russian State-Sponsored Cyber Actors Exploit Ubiquiti EdgeRouters",
    "keyTakeaways": [
      "Russian cyber actors used compromised Ubiquiti EdgeRouters for malicious cyber operations globally.",
      "Mitigate by performing a hardware factory reset, upgrading firmware, changing default credentials, and implementing firewall rules.",
      "CVE-2023-23397 is a critical vulnerability in Microsoft Outlook leaking Net-NTLMv2 hashes.",
      "APT28 actors utilized various techniques to collect credentials, proxy network traffic, and host malicious tools."
    ]
  }
🔬
GPT-3.5 showcases remarkable abilities in pulling out essential details from the cyber threat intelligence report, including the countries and companies impacted, the CVSS score, and the threat actors involved. However, it encounters challenges in offering comprehensive mitigation guidelines and recognizing all the ATT&CK techniques discussed in the report, emphasizing aspects where additional enhancements are possible.

Evaluating LLAMA LLM on Groq for Cyber Threat Intelligence Effectiveness

testing llama2-70b-4096 running on groq for cybersecurity threat intelligence use case

Model: llama2-70b-4096

Speed (seconds): 4.46

Prompt Cost: $0.0019138

Completion Cost: $0.00066

Total Cost: $0.002574

Content Awareness:

"mitigationInstructions": "Apply the latest software updates and security patches to vulnerable systems, and implement strategic firewall rules on WAN-side interfaces."
        "title": "Russian Cyber Actors Use Compromised Routers to Facilitate Cyber Operations",
        "keyTakeaways": [
            "Russian state-sponsored cyber actors have compromised Ubiquiti EdgeRouters worldwide to facilitate malicious cyber operations.",
            "The actors have used the compromised routers to collect credentials, proxy network traffic, and host spear-phishing landing pages and custom tools.",
            "The U.S. Department of Justice, including the FBI, and international partners have recently disrupted a GRU botnet consisting of compromised EdgeRouters."
        ]
    }
🔬
The LLAMA model struggles with accurately extracting and representing the key details from cyber threat intelligence report. While it manages to identify some relevant information like affected companies, threat actors, and mitigation instructions, the model's output is incomplete and contains inaccuracies, such as hallucinated CVSS scores and a limited list of countries and ATT&CK techniques.

Evaluating Mixtral 8x7b LLM on Groq for Cyber Threat Intelligence Effectiveness

testing mixtral-8x7b-32768 running on groq for cybersecurity threat intelligence use case

Model: mixtral-8x7b-32768

Speed (seconds): 3.76

Prompt Cost: $0.00176337

Completion Cost: $0.00026919

Total Cost: $0.002033

Content Awareness:

"mitigationInstructions": "Update Microsoft Outlook to the latest version and disable NTLM when feasible, or enable server signing and Extended Protection for Authentication configurations."
    },
        "title": "Russian Cyber Actors Use Compromised Routers for Global Operations",
        "keyTakeaways": [
            "Russian state-sponsored cyber actors have been using compromised Ubiquiti EdgeRouters globally for malicious operations.",
            "The targeted industries include Aerospace & Defense, Education, Energy & Utilities, Governments, Hospitality, Manufacturing, Oil & Gas, Retail, Technology, and Transportation.",
            "The vulnerability CVE-2023-23397 is a zero-day in Microsoft Outlook on Windows that allows the leakage of Net-NTLMv2 hashes to actor-controlled infrastructure.",
            "Update Microsoft Outlook to the latest version and disable NTLM when feasible, or enable server signing and Extended Protection for Authentication configurations."
        ]
🔬
The Mixtral 8x7b model shows a mixed performance in capturing key details from the cyber threat intelligence report. It accurately identifies some critical information like the CVSS score and version, and provides a good vulnerability description. However, the model misses several affected countries, companies, and mitigation instructions that are important to fully understand and address the cyber threat.

Comparative Analysis: Identifying the Best AI Language Models

Based on the evaluation results, let's highlight the best LLM for cyber threat intelligence for each category.

Fastest AI Language Model for Cyber Threat Intelligence

  1. Mixtral 8x7b LLM on Groq: 3.76 seconds
  2. LLAMA LLM on Groq: 4.46 seconds
  3. OpenAI's GPT-3.5: 11.83 seconds
  4. Anthropic's Claude 3 Sonnet: 26.83 seconds
  5. OpenAI's GPT-4: 28.15 seconds
  6. Anthropic's Claude 3 Opus: 52.30 seconds

Most Cost-Effective AI Language Model for Cyber Threat Intelligence

  1. Mixtral 8x7b LLM on Groq: $0.002033
  2. LLAMA LLM on Groq: $0.002574
  3. OpenAI's GPT-3.5: $0.003587
  4. Anthropic's Claude 3 Sonnet: $0.032331
  5. OpenAI's GPT-4: $0.07142
  6. Anthropic's Claude 3 Opus: $0.159105

AI Language Model with the Best Content Awareness for Cyber Threat Intelligence

  1. Anthropic's Claude 3 Opus: Accurately identified global scope, threat actor group, malware strains, and ATT&CK techniques. Missed some additional affected countries and companies.
  2. OpenAI's GPT-4: Solid understanding of key elements but missed some details like the complete list of ATT&CK techniques and certain mitigation instructions.
  3. Anthropic's Claude 3 Sonnet: Accurately identified key details like threat actors, malware strains, and ATT&CK techniques. Struggled with nuances and missed one mitigation recommendation.
  4. OpenAI's GPT-3.5: Impressive capabilities in extracting key information but struggled to provide complete mitigation instructions and identify all ATT&CK techniques.
  5. Mixtral 8x7b LLM on Groq: Mixed performance, accurately identified some critical information but missed several affected countries, companies, and mitigation instructions.
  6. LLAMA LLM on Groq: Struggled with accurately extracting and representing key details, output was incomplete and contained inaccuracies.

In summary, the Mixtral 8x7b LLM on Groq and LLAMA LLM on Groq excel in speed and cost, while Anthropic's Claude 3 Opus and Sonnet models demonstrate the best content awareness. OpenAI's GPT-4 and GPT-3.5 show solid performance in content awareness but lag behind in speed and cost compared to the Groq-based models.

Conclusion and Future Directions

In this evaluation, we compared the performance of large language models (LLMs) from Anthropic, OpenAI, Meta, and Mixtral in the context of cyber threat intelligence. By assessing each model's speed, content awareness, and total cost, we gained valuable insights into their suitability for specific cybersecurity use cases.

The results show that while all models demonstrate a degree of content awareness, there are notable differences in their ability to accurately identify and contextualize key information from the provided threat report.

Anthropic's Claude 3 Opus and OpenAI's GPT-4 exhibit strong content awareness, while OpenAI's GPT-3.5 and Mixtral's 8x7b models provide accurate CVSS scores and versions, despite this information not being explicitly mentioned in the report.

In terms of speed and cost, LLMs running on Groq's custom chips (LLAMA and Mixtral) process the content significantly faster and at lower costs compared to their counterparts, making them attractive options for organizations looking to optimize their cybersecurity workflows.

However, it is crucial to consider the limitations and caveats of this evaluation, such as the specific prompt, instructions, and content used, as well as the subjective nature of the assessment and potential differences in latency.

Moving forward, further research and development in prompt engineering, model fine-tuning, and specialized training data could potentially enhance the performance of these language models in cybersecurity threat intelligence tasks.

Additionally, exploring hybrid approaches that combine the strengths of different models or leverage ensemble techniques could yield even more robust and accurate threat analysis capabilities.

FAQ

What are AI language models, and how can they be used for cyber threat intelligence?

AI language models, such as GPT-4, GPT-3.5, and Claude, are advanced machine learning models that can understand and generate human-like text. These models can be used to analyze cyber threat reports, extract relevant information, and provide structured data for further analysis, helping organizations better understand and respond to cyber threats.

How do AI language models compare to traditional methods of cyber threat intelligence?

Traditional methods of cyber threat intelligence often involve manual analysis of threat reports by cybersecurity experts. AI language models can automate this process, quickly analyzing large volumes of data and providing structured insights. This can help organizations save time and resources while improving the speed and accuracy of threat detection and response.

What factors should be considered when choosing an AI language model for cyber threat intelligence?

When selecting an AI language model for cyber threat intelligence, consider factors such as:

  • Speed: How quickly can the model analyze threat reports and provide insights?
  • Content Awareness: How accurately can the model identify and extract relevant information from threat reports?
  • Cost: What are the costs associated with using the model, including API usage and computational resources?
  • Integration: How easily can the model be integrated into your existing cybersecurity workflows and tools?

Can AI language models replace human cybersecurity experts?

AI language models are powerful tools that can augment and support the work of human cybersecurity experts, but they cannot replace them entirely (at least not yet). Human expertise is still essential for interpreting the insights provided by AI models, making strategic decisions, and handling complex or novel threats.

How can organizations ensure the security and privacy of data when using AI language models for cyber threat intelligence?

When using AI language models for cyber threat intelligence, organizations should:

  • Use secure API connections and authentication mechanisms to protect data in transit
  • Ensure that the AI service providers have robust security and privacy practices in place
  • Anonymize or pseudonymize sensitive data before feeding it into the AI models
  • Regularly monitor and audit AI model usage to detect and prevent unauthorized access or misuse

What are the limitations of using AI language models for cyber threat intelligence?

Some limitations of using AI language models for cyber threat intelligence include:

  • Dependence on the quality and relevance of the training data used to build the models
  • Potential for biases or inaccuracies in the model's outputs, especially for complex or novel threats
  • Limited ability to understand context or nuance in threat reports, which may require human interpretation
  • Possible vulnerability to adversarial attacks or manipulation of input data to deceive the models

How can organizations get started with using AI language models for cyber threat intelligence?

To get started with using AI language models for cyber threat intelligence, organizations can:

  • Identify the specific use cases and requirements for threat intelligence within their organization
  • Evaluate and select appropriate AI language models based on factors such as speed, content awareness, cost, and integration
  • Develop and test workflows for integrating AI language models into their existing cybersecurity processes and tools
  • Train and educate cybersecurity teams on how to effectively use and interpret the insights provided by AI language models
  • Continuously monitor and assess the performance of AI language models and iterate on their usage as needed

Whenever you're ready, there are 3 ways I can help you:

  1. Work with me. I love helping people! Let's discuss your challenges, career, or ask me anything about cybersecurity in 25 minutes.
  2. Explore solutions with me. Need cybersecurity strategy and execution for your startup or scale-up? Let's achieve tangible outcomes together.
  3. Looking for something different? Reach out.

If this sparked your interest, I'd love to hear from you in the comments. Stay tuned for more and consider following me on LinkedIn and X.

Nikoloz

Share This Post

Check out these related posts

3 Critical Steps to Build an Intelligence-Led SOC

  • Nikoloz Kokhreidze
by Nikoloz Kokhreidze | | 5 min read

Choosing a Security Operations Center: In-House, Hybrid, or Outsourced

  • Nikoloz Kokhreidze
by Nikoloz Kokhreidze | | 14 min read

The Perils of Platform Dependence: Lessons from the Great CrowdStrike Meltdown

  • Nikoloz Kokhreidze
by Nikoloz Kokhreidze | | 9 min read