Anvilogic on Databricks Architecture

Anvilogic implementation on Databricks (AWS, Azure, GCP).

Databricks Technology Partner and Built-On Partner Badges

Architecture Diagram

Below is the generic architecture digram for how Anvilogic works on top of Databricks.

This supports Databricks on AWS, Azure, and GCP.

Diagram:

ETL Parsing & Normalization Process

PDF Download:

7MB

Reference Architecture - Databricks.pdf

pdf

Frequently Asked Questions (FAQs)

Does it matter which public cloud I own?

Databricks will be configured in the IaaS environment that you have and is available across AWS, Azure, and GCP.

Data that already originates in IaaS that can be sent to cloud storage does not require a streaming tool and can be onboarded to Databricks directly.

What type of Databricks Compute is required?

Anvilogic requires two types of compute warehouses to run:

SQL Warehouse - Compute for ad-hoc queries to assist search, hunt, and IR
All-Purpose/Job Compute - Used to execute scheduled detection workflow jobs and for our low-code detection builder

For both, you have the option to use either serverless or classic compute, but serverless is the default and highly recommended for improved performance, scalability, and cost.

How do detection use cases execute in Databricks?

Detections execute as jobs within a Workflow. Rules built on the Anvilogic platform are converted from a user friendly SQL builder to PySpark functions that run on a defined schedule.

How do I onboard logs from data center assets?

Datasets that come from assets hosted within a data center or not in a public IaaS environment will require a solution to route that data to Databricks.

Data streaming tools (ex. Cribl, Fluentbit, Apache NiFi) can be used to send on-prem. logs directly to Databricks.

It is a requirement that you have a data transport/streaming tool to send data to IaaS storage or Anvilogic pipelines for ingestion.

Forwarding agents that are installed on endpoints also need to be re-configured to send to the streaming tools for ingestion into Databricks.

Databricks and/or Anvilogic does not provide any data streaming or endpoint agent technology.

How do you get data that originates in public cloud into Databricks?

Python Notebooks are used to collect data from storage and transform raw events into the AVL detection schema, preferrably using Delta Live Tables.

Can Anvilogic help with getting raw data into Databricks?

Yes, if you have a streaming tool (ex. Cribl, Fluentbit, Apache NiFi) you can send custom data sources directly to your primary storage servers (ex. S3, Blob, etc.) and Anvilogic can orchestrate the ETL process into the correct schema and tables required for detection purposes.

Does Anvilogic help with parsing and normalization of raw data into Databricks?

Yes, Anvilogic helps with all of the parsing and normalization of security relevant data into the Anvilogic schema.

We have onboarding templates and configs that will help ensure the data you are bringing into Databricks is properly formatted to execute detections and perform triage, hunting, and response.

All data parsing, normalization, and enrichment is done in the Python Notebook section of the diagram above.

What is the difference between Bronze, Silver, and Gold Delta Live Tables?

Anvilogic assists in the ETL process of parsing, normalization and enrichment.

Bronze Tables - Unparsed and unstructured data, usually in 2 columns (time and raw).
Silver Tables - Parsed and structured data; this is usually where raw data is separated into multiple columns (normalization and enrichment can also occur here).
Gold Tables - Normalized and enriched security data feeds that have been organized into tables based on their security domain (ex. Endpoint, Cloud, Network, etc).

Each feed can be customized based on your organization's preferences.

Does Anvilogic have out-of-the-box integrations for specific vendors alert sources?

Yes, Anvilogic can provide out of the box integrations for common vendor alerts and data collection for specific SaaS Security tools (ex. Crowdstrike FDR).

Tools not listed in our integration marketplace can be sent through the Custom Data Integration pipeline as a self service option.

What is the difference between raw data and alert data?

Raw data sources are events/telemetry that is generated from endpoints/tools/appliances (ex. Windows Event logs, EDR logs).

Alerts data is curated signals from security tools (ex. Proofpoint alerts, Anti-virus alerts, etc.) that has already been identified to be suspicious or malicious by the vendor.

Do you integrate with SOAR?

Yes, Anvilogic can integrate with most SOARs via REST API through either a push or a pull method.

Does Anvilogic have a search user interface (UI) for Databricks?

Yes, Anvilogic has a search user interface (UI) to make it easy to query data that is inside of a Databricks catalog.

In addition, Anvilogic makes it easy to build repeatable detections that can execute on top of Databricks using a low-code UI builder.

Does Anvilogic have a data model? Does it work with OCSF?

Yes, Anvilogic has a data model and offers parsing and normalization code for any security data set that you want to use within the platform.

Yes, we can also work with OCSF data, and each data feed can be modified/controlled to customize to your needs.

Does Anvilogic support IOC collection & searching?

Yes, Anvilogic can onboard IOCs from your third party threat intel tools (ex. Threat Connect) and use that data to create new detections, conduct ongoing exposure checks across your data feeds, or use it to enrich your alert output for triage analysts.

Last updated 2 months ago

Was this helpful?