> For the complete documentation index, see [llms.txt](https://docs.dinmo.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.dinmo.io/identity-resolution/event-stitching/event-stitching-prepare-data.md).

# Prepare event data

Before creating an [Event Stitching](/identity-resolution/event-stitching.md) project, make sure your event models contain the fields DinMo needs to process events safely and explain the resulting event profile graph.

Event Stitching works best when each input model is an event table: one row represents one event that happened at a specific time.

## What you need

| Requirement            | Why it matters                                                                                                                         |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Event models           | Tables or models where each row is an event, such as page views, sessions, app events, purchases, or conversions.                      |
| Model primary key      | Stable identifier for one event row. DinMo uses the model primary key for idempotency and audit.                                       |
| Timestamp field        | Standard timestamp field on the model. DinMo uses it to order observations and evaluate stitching lifetime.                            |
| Event partition column | Date or timestamp column used to select complete event windows efficiently.                                                            |
| Event identifiers      | Columns that carry identity evidence, such as user ID, email, anonymous ID, cookie ID, device ID, session ID, click ID, or IP address. |
| Output permissions     | Permission to create or replace Event Stitching output tables in the configured output dataset or schema.                              |

## Good input models

Good Event Stitching inputs are behavioral tables:

* web events
* app events
* product usage events
* sessions
* purchases
* conversions
* support interactions
* campaign interactions

Avoid profile-like tables such as contacts, accounts, users, leads, subscribers, and customers. Event Stitching expects event-grain models.

## Model primary key

The model primary key must identify one event row inside a source model.

Good primary keys are:

* stable across reruns
* non-null
* unique inside the event model
* not derived from mutable fields

If two selected models can produce the same primary key value, DinMo still keeps them separate by source model.

Use this query to check duplicate primary keys in one model:

```sql
select
  primary_key,
  count(*) as row_count
from `project.dataset.events`
where primary_key is not null
group by primary_key
having count(*) > 1
order by row_count desc
limit 100;
```

## Timestamp field

The model timestamp field should represent when the event happened, not when the row was loaded into the warehouse.

Check for:

* null timestamps
* future timestamps
* timestamps far outside the expected backfill range
* inconsistent timezone handling

```sql
select
  count(*) as row_count,
  countif(timestamp_field is null) as null_timestamp_count,
  min(timestamp_field) as min_event_timestamp,
  max(timestamp_field) as max_event_timestamp
from `project.dataset.events`;
```

## Event partition column

The event partition column is the field DinMo uses to select bounded windows.

Use a column that:

* is present on every selected event model
* is a date or timestamp field
* follows the same calendar as the event timestamp
* lets DinMo select complete windows, usually days
* matches the physical partitioning or clustering strategy of the source table when possible

The best default is usually the event timestamp itself, or a derived event date that is physically partitioned in the warehouse.

Check daily volume before creating the project:

```sql
select
  date(event_partition_column) as event_date,
  count(*) as event_count
from `project.dataset.events`
where date(event_partition_column) >= date_sub(current_date(), interval 30 day)
group by event_date
order by event_date desc;
```

If many days have zero events, the project can still run, but the schedule and backfill window should match the real event cadence.

## Identifier fields

Identifiers are the values DinMo can use to connect events.

Common identifiers:

| Identifier                                    | Typical strength | Notes                                                                              |
| --------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------- |
| User ID, customer ID, account ID              | Strong           | Good anchor candidates when they are authenticated and stable.                     |
| Email, email hash, phone                      | Strong to medium | Good when standardized and governed.                                               |
| Anonymous ID, cookie ID, device ID, client ID | Medium to weak   | Useful for pre-login behavior; protect with lifetime and profile-per-value limits. |
| Session ID                                    | Weak             | Use for short windows only.                                                        |
| Click IDs                                     | Weak             | Useful for attribution windows; avoid long lifetimes.                              |
| IP address                                    | Weakest          | Use only with strict policy or for audit.                                          |

Do not map campaign fields, page URLs, product names, country, channel, `utm_*` fields, or free-text values as identifiers unless they are intentionally used as identity evidence.

Check identifier coverage before setup:

```sql
select
  count(*) as event_count,
  countif(user_id is not null and trim(cast(user_id as string)) != '') as user_id_count,
  countif(email is not null and trim(cast(email as string)) != '') as email_count,
  countif(anonymous_id is not null and trim(cast(anonymous_id as string)) != '') as anonymous_id_count,
  countif(session_id is not null and trim(cast(session_id as string)) != '') as session_id_count
from `project.dataset.events`
where date(event_partition_column) >= date_sub(current_date(), interval 30 day);
```

## Placeholder and polluted values

Bad values should be blocked before the first production run.

Common examples:

* `unknown`
* `undefined`
* `null`
* `none`
* `test`
* `00000000-0000-0000-0000-000000000000`
* empty strings after standardization

Find common low-quality values:

```sql
select
  lower(trim(cast(anonymous_id as string))) as value,
  count(*) as event_count
from `project.dataset.events`
where anonymous_id is not null
group by value
order by event_count desc
limit 100;
```

Add placeholder values to blocked values in the [Identifier policy](/identity-resolution/event-stitching/event-stitching-identifier-policy.md).

## Shared and corrupted weak identifiers

Weak identifiers can create unsafe connections when one value appears across many strong identities.

Before trusting a weak identifier, check whether it behaves like a shared or corrupted value:

```sql
select
  anonymous_id,
  count(*) as event_count,
  count(distinct user_id) as distinct_user_id_count
from `project.dataset.events`
where anonymous_id is not null
  and user_id is not null
group by anonymous_id
having count(distinct user_id) > 5
order by distinct_user_id_count desc, event_count desc
limit 100;
```

If this query returns many high-count values, configure a lower `Max profiles per value`, shorten the stitching lifetime, or block known bad values.

## Multiple event models

An Event Stitching project can process several event models from the same source.

For each selected model:

* choose an event partition column
* map available physical fields to logical identifiers
* use the same logical identifier when two models carry the same type of value
* leave unrelated fields unmapped

Example:

| Model         | Physical field | Logical identifier |
| ------------- | -------------- | ------------------ |
| `web_events`  | `user_id`      | User ID            |
| `web_events`  | `anonymous_id` | Anonymous ID       |
| `conversions` | `customer_id`  | User ID            |
| `conversions` | `email_hash`   | Email hash         |

This lets DinMo evaluate events from several models as one event profile graph.

## Readiness checklist

Before creating the project, confirm:

* selected models are event-grain tables
* each model has a stable primary key
* model timestamp fields are populated and plausible
* each model has a date or timestamp event partition column
* the main identifiers have meaningful coverage
* weak identifiers do not obviously create large shared clusters
* placeholder values are known and can be blocked
* source tables are partitioned or clustered in a way that supports efficient window scans
* the output dataset or schema can be written by DinMo

Then continue with [Create an Event Stitching project](/identity-resolution/event-stitching/event-stitching-create-project.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.dinmo.io/identity-resolution/event-stitching/event-stitching-prepare-data.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
