intake/README.md

88 lines
5.4 KiB
Markdown

# intake
Intake is an arbitrary feed aggregator that generalizes the concept of a feed. Rather than being restricted to parsing items out of an RSS feed, Intake provides a middle layer of executing arbitrary programs that conform to a JSON-based specification. An Intake source can parse an RSS feed, but it can also scrape a website without a feed, provide additional logic to filter or annotate feed items, or integrate with an API.
A basic demonstration in a VM can be run with `nixos-shell` using the `#demo` flake attribute.
## Feed source definitions
The base Intake directory is `$XDG_DATA_HOME/intake`. Each feed source's data is contained within a subdirectory of the base directory. The name of the feed source is the name of the subdirectory.
Feed source directories have the following structure:
```
intake
|- <source name>
| |- intake.json
| |- state
| |- <item id>.item
| |- <item id>.item
| |- ...
|- <source name>
| | ...
| ...
```
`intake.json` must be present; the other files are optional. Each `.item` file contains the data for one feed item. `state` provides a file for the feed source to write arbitrary data, e.g. JSON or binary data.
`intake.json` has the following structure:
```json
{
"action": {
"fetch": {
"args": ["program name", "and", "list", "of", "arguments"]
},
"<action name>": {
"args": "..."
}
},
"env": {
"...": "..."
},
"cron": "* * * * *"
}
```
Each key under `action` defines an action that can be taken for the source. A source must at least have a `fetch` action. If an action named `on_create` is defined for the source, it is executed once for an item when that item is created, that is, the first time the item is returned from the source.
Each key under `env` defines an environment variable that will be set when `fetch` or other actions are executed.
If `cron` is present, it must define a crontab schedule. Intake will automatically create crontab entries to update each source according to its cron schedule.
## Interface for source programs
Intake interacts with sources by executing the actions defined in the source's `intake.json`. The `fetch` action is required and used to check for new feed items when `intake update` is executed.
To execute an action, intake executes the command given by `args`. The process's working directory is set to the source's folder, i.e. the folder containing `intake.json`. The process's environment is as follows:
* intake's environment is inherited.
* `STATE_PATH` is set to the absolute path of `state`.
* Each key in `env` in `config.json` is passed with its value.
Anything written to `stderr` by the process will be captured and logged by Intake.
The `fetch` action is used to fetch the current state of the feed source. It receives no input and should write feed items to `stdout` as JSON objects, each on one line. All other actions are taken in the context of a single item. These actions receive the item as a JSON object on the first line of `stdin`. The process should write the item back to `stdout` with any changes as a result of the action.
An item must have a key under `action` with that action's name to support executing that action for that item. The value under that key may be any JSON structure used to manage the item-specific state.
All input and output is treated as UTF-8. If an item cannot be parsed or the exit code of the process is nonzero, Intake will consider the action to be a failure. No items or other feed changes will happen as a result of a failed action, except for changes to `state` done by the action process.
## Top-level item fields
| Field name | Specification | Description |
| ---------- | ------------- | ----------- |
| `id` | **Required** | A unique identifier within the scope of the feed source. |
| `created` | **Automatic** | The Unix timestamp at which intake first processed the item. |
| `active` | **Automatic** | Whether the item is active. Inactive items are not displayed in channels. |
| `title` | Optional | The title of the item. If an item has no title, `id` is used as a fallback title.
| `author` | Optional | An author name associated with the item. Displayed in the item footer.
| `body` | Optional | Body text of the item as raw HTML. This will be displayed in the item without further processing! Consider your sources' threat models against injection attacks.
| `link` | Optional | A hyperlink associated with the item.
| `time` | Optional | A time associated with the item, not necessarily when the item was created. Feeds sort by `time` when it is defined and fall back to `created`. Displayed in the item footer.
| `tags` | Optional | A list of tags that describe the item. Tags help filter feeds that contain different kinds of content.
| `tts` | Optional | The time-to-show of the item. An item with `tts` defined is hidden from channel feeds until the current time is after `created + tts`.
| `ttl` | Optional | The time-to-live of the item. An item with `ttl` defined is not deleted by feed updates as long as `created + ttl` is in the future, even if it is inactive.
| `ttd` | Optional | The time-to-die of the item. An item with `ttd` defined is deleted by feed updates if `created + ttd` is in the past, even if it is active.
| `action` | Optional | An object with keys for all supported actions. The schema of the values depends on the source.