Ingestion Engine

The first phase of the Hunters SOC pipeline is getting all of your organization’s security data to the Security Data Lake. The Ingestion Engine is responsible for this phase, which consists of collection (e.g. access to a security product’s REST API), transformation, and insertion to the data lake.

Security Data Lake

Hunters SOC utilizes a Security Data Lake for raw data storage and analysis.
Hunters currently supports Snowflake as a data lake, which can be either hosted on Hunters’ Snowflake account, or connected to the customer’s Snowflake account in “Partner Connect” mode.

Data Flow

A Data Flow defines how your data should be available to Hunters. It consists of three key components:

  • Data Type: Which data should be ingested. It defines the schema to which the data should conform.

  • Source: Where the data should be ingested from, including the source identity and properties necessary to access this source (such as host address, username and relevant access keys).

  • Sink: Where the data should be ingested onto. By default, the ingested data is stored in a secure data lake accessible only to the Hunters SOC platform. A sales representative would be happy to elaborate on dedicated alternatives.

All active data flows are monitored and displayed in the Data Flows tab in the Hunters portal, including the amount of raw data being ingested over time.

Data Type

A Data type represents the data structure of records of a certain type. Hunters supports a variety of data types, belonging to a wide range of security and monitoring products, from both cloud and on-premise eco-systems.

A Data type contains a superset of all the fields that could be received in events of that type, while not all of them have to exist in every event. Under the hood, Hunters holds the collected data per data type and generates threat signals, enriches them and enables advanced investigations based on the semantics of their data types. The more data types being collected, the more comprehensive and accurate the threat hunting would be.

Data types supported in Hunters are grouped by the Product they're related to. For example, the AWS Product consists of Cloudtrail, VPC Flow logs and Config Snapshots data types; while Okta consists of many other data types, such as users, groups, apps, users-to-groups, users-to-apps and groups-to-apps. It is worth noting that not every product consists of multiple data types, products like Windows Event Logs and osquery, for example, supply us with only one kind of data type.

A data type can be supplied from various Sources, such as Cloud Storage (AWS S3, Azure Blob Storage), a vendor's REST API or even syslog streams. In order to ship the events of a specific data type into Hunters from a specific source, a new Data Flow has to be configured.


Product is a logical grouping of Data Types. It is mostly related to a commercial product like AWS, Azure, Okta, etc.


Sink is where the data resides after it has been ingested and is made available for the Hunters SOC platform. The Sink is provisioned as part of the infrastructure according to the particular Hunters SOC deployment (e.g., Snowflake Partner Connect).


Source is where data originates from. It can be an API, an S3 bucket, a Snowflake table, etc.

Connecting to a source usually requires credentials such as an authentication token, a combination of a username and password etc.

Upon selecting a source, it is required to fill in the required credentials and test the connection.

Configuring Data Flows with a file storage source (such as AWS-S3 or Azure Blob Storage) would make the File Format attribute emerge for every single data type, as you should instruct the Hunters SOC ingestion engine in what way to decode the files coming in.

File Format

Data types come in a variety of formats. First things that come to mind are JSON, CSV, XML, Syslog, etc. Each file format poses a different parsing strategy, and is critical for Hunters to successfully structure the data type accordingly.

In this section, we will go over the supported file formats:


New line delimited JSON is a common format used to group multiple JSON lines in a single file.


{"text": "hello and welcome to Hunters.AI!", "name": "Hunters"}
{"text": "this is an NDJSON", "name": "example"}

As you can see from the above example, we have two JSON objects delimited by a newline. Note that even though there are multiple JSON objects in the file, they should all contain the same object structure.

CSV with Header

Comma separated values, also known as CSV, is a tabular format similar to a database table. It contains columns and rows, where each column represents a single field, usually of a specific type. CSV files may or may not have a header row, which is typically the first row of the every such CSV file, specifying the name of each column.
At Hunters, we currently support only CSV files which contain a header as the first row.

Additionally, a CSV may contain a comma as a separator as its name states, but may also use other separators, such as '|' or '^'. Field values which may themselves contain the separator (for example, if we had a row where the name was 'sandra,', we would additionally have to escape the field, traditionally with double quotes, to denote that this field itself also contains the separator value. At Hunters we support any CSV spec compliant separator.


name, age, height
david, 35, 180
sandra, 30, 180

AWS Format

AWS has a custom JSON layout which they use for their services which emit log data, such as AWS Config and CloudTrail. The format looks as follows:

    "Records": [{
        "eventVersion": "1.0",
        "userIdentity": {
            "type": "IAMUser",
            "principalId": "EX_PRINCIPAL_ID",
            "arn": "arn:aws:iam::123456789012:user/Alice",
            "accessKeyId": "EXAMPLE_KEY_ID",
            "accountId": "123456789012",
            "userName": "Alice"
        "eventTime": "2014-03-06T21:22:54Z",
        "eventSource": "",
        "eventName": "StartInstances",
        "awsRegion": "us-east-2",
        "sourceIPAddress": "",
        "userAgent": "ec2-api-tools",
        "requestParameters": {
            "instancesSet": {
                "items": [{
                    "instanceId": "i-ebeaf9e2"
        "responseElements": {
            "instancesSet": {
                "items": [{
                    "instanceId": "i-ebeaf9e2",
                    "currentState": {
                        "code": 0,
                        "name": "pending"
                    "previousState": {
                        "code": 80,
                        "name": "stopped"

The format contains a the key "Records" as a single field, followed by a JSON array containing various JSON object structures.