Anvilogic on Databricks Architecture

Anvilogic implementation on Databricks (AWS, Azure, GCP).

Databricks Technology Partner and Built-On Partner Badges

Architecture Diagram

Below is the generic architecture digram for how Anvilogic works on top of Databricks.

circle-info

This supports Databricks on AWS, Azure, and GCP.

Diagram:

Reference architecture diagram for Anvilogic on Databricks
Anvilogic on Databricks (AWS, GCP, or Azure)

ETL Parsing & Normalization Process

ETL Process for moving data from Raw format to Gold.

PDF Download:

Frequently Asked Questions (FAQs)

chevron-rightDoes it matter which public cloud I own?hashtag

Databricks will be configured in the IaaS environment that you have and is available across AWS, Azure, and GCP.

Data that already originates in IaaS that can be sent to cloud storage does not require a streaming tool and can be onboarded to Databricks directly.

chevron-rightWhat type of Databricks Compute is required?hashtag

Anvilogic requires two types of compute warehouses to run:

  1. SQL Warehousearrow-up-right - Compute for ad-hoc queries to assist search, hunt, and IR

  2. All-Purpose/Job Computearrow-up-right - Used to execute scheduled detection workflow jobs and for our low-code detection builder

For both, you have the option to use either serverless or classic compute, but serverless is the default and highly recommended for improved performance, scalability, and cost.

chevron-rightHow do detection use cases execute in Databricks?hashtag

Detections execute as jobs within a Workflowarrow-up-right. Rules built on the Anvilogic platform are converted from a user friendly SQL builder to PySpark functions that run on a defined schedule.

chevron-rightHow do I onboard logs from data center assets?hashtag

Datasets that come from assets hosted within a data center or not in a public IaaS environment will require a solution to route that data to Databricks.

Data streaming tools (ex. Cribl, Fluentbit, Apache NiFi) can be used to send on-prem. logs directly to Databricks.

It is a requirement that you have a data transport/streaming tool to send data to IaaS storage or Anvilogic pipelines for ingestion.

Forwarding agents that are installed on endpoints also need to be re-configured to send to the streaming tools for ingestion into Databricks.

Databricks and/or Anvilogic does not provide any data streaming or endpoint agent technology.

chevron-rightHow do you get data that originates in public cloud into Databricks?hashtag

Python Notebooks are used to collect data from storage and transform raw events into the AVL detection schema, preferably using Lakeflow Pipelinesarrow-up-right (formerly known as Delta Live Tables).

chevron-rightCan Anvilogic help with getting raw data into Databricks?hashtag

Yes, if you have a streaming tool (ex. Cribl, Fluentbit, Apache NiFi) you can send custom data sources directly to your primary storage servers (ex. S3, Blob, etc.) and Anvilogic can orchestrate the ETL process into the correct schema and tables required for detection purposes.

chevron-rightDoes Anvilogic help with parsing and normalization of raw data into Databricks?hashtag

Yes, Anvilogic helps with all of the parsing and normalization of security relevant data into the Anvilogic schema.

We have onboarding templates and configs that will help ensure the data you are bringing into Databricks is properly formatted to execute detections and perform triage, hunting, and response.

All data parsing, normalization, and enrichment is done in the Python Notebook section of the diagram above.

chevron-rightWhat is the difference between Bronze, Silver, and Gold tables?hashtag

Anvilogic leverages Databricks Lakeflow Pipelines to assist in the ETL process of parsing, normalization and enrichment.

  • Bronze Tables - Unparsed and unstructured data, usually in 2 columns (time and raw).

  • Silver Tables - Parsed and structured data; this is usually where raw data is separated into multiple columns (normalization and enrichment can also occur here).

  • Gold Tables - Normalized and enriched security data feeds that have been organized into tables based on their security domain (ex. Endpoint, Cloud, Network, etc).

Each feed can be customized based on your organization's preferences.

chevron-rightDoes Anvilogic have out-of-the-box integrations for specific vendors alert sources?hashtag

Yes, Anvilogic can provide out of the box integrations for common vendor alerts and data collection for specific SaaS Security tools (ex. Crowdstrike FDR).

Tools not listed in our integration marketplace can be sent through the Custom Data Integration pipeline as a self service option.

chevron-rightWhat is the difference between raw data and alert data?hashtag

Raw data sources are events/telemetry that is generated from endpoints/tools/appliances (ex. Windows Event logs, EDR logs).

Alerts data is curated signals from security tools (ex. Proofpoint alerts, Anti-virus alerts, etc.) that has already been identified to be suspicious or malicious by the vendor.

chevron-rightDo you integrate with SOAR?hashtag

Yes, Anvilogic can integrate with most SOARs via REST API through either a push or a pull method.

chevron-rightDoes Anvilogic have a search user interface (UI) for Databricks?hashtag

Yes, Anvilogic has a search user interface (UI) to make it easy to query data that is inside of a Databricks catalog.

In addition, Anvilogic makes it easy to build repeatable detections that can execute on top of Databricks using a low-code UI builder.

chevron-rightDoes Anvilogic have a data model? Does it work with OCSF?hashtag

Yes, Anvilogic has a data model and offers parsing and normalization code for any security data set that you want to use within the platform.

Yes, we can also work with OCSF data, and each data feed can be modified/controlled to customize to your needs.

chevron-rightDoes Anvilogic support IOC collection & searching?hashtag

Yes, Anvilogic can onboard IOCs from your third party threat intel tools (ex. Threat Connect) and use that data to create new detections, conduct ongoing exposure checks across your data feeds, or use it to enrich your alert output for triage analysts.

Last updated

Was this helpful?