localgenai/pyinfra/HowToScan.md

# How to onboard an existing host into pyinfra

There's no auto-generator that reads a running system and emits a
deploy.py — not in pyinfra, not in Ansible, not in any IaC tool.
Reverse-engineering "what's intentional vs incidental" needs human
judgment. Below is the workflow that actually works.

## What pyinfra gives you for free

The **facts** API queries live state without writing any deploy:

```sh
pyinfra @host fact deb.DebPackages           # installed packages
pyinfra @host fact apt.AptSources            # extra apt repos
pyinfra @host fact systemd.SystemdServices   # service state
pyinfra @host fact server.Crontab            # cron
pyinfra @host fact server.Users              # users
```

Output is JSON. There's no fact → op translator, but the data is easy
to grep through and tells you what's actually on the box.

## Practical workflow

1. Run a scanner against the live host that dumps the *meaningful*
   state — packages explicitly installed, enabled services, custom
   configs, dotfiles, sudoers fragments, fstab non-defaults. Skip the
   distro-default noise.
2. Hand-pick the bits worth managing into a new `<station>/deploy.py`.
3. Run with `--dry` against the live box; pyinfra's diff shows what's
   still drifting.
4. Iterate until `--dry` reports "no changes."

## One-shot scanner

```sh
ssh you@host '
echo "=== manually-installed packages ==="
apt-mark showmanual
echo "=== enabled services ==="
systemctl list-unit-files --state=enabled --type=service --no-legend | awk "{print \$1}"
echo "=== extra apt repos ==="
ls /etc/apt/sources.list.d/
echo "=== sudoers fragments ==="
ls /etc/sudoers.d/
echo "=== fstab non-default ==="
grep -vE "^#|^$|/proc|/sys|/dev/pts" /etc/fstab
echo "=== users with login shells ==="
getent passwd | awk -F: "\$3>=1000 && \$3<60000 {print \$1}"
echo "=== docker stacks ==="
ls /srv/docker/ 2>/dev/null
echo "=== home dotfiles ==="
sudo find /home -maxdepth 3 -name ".*rc" -o -name ".*config" 2>/dev/null
'
```

That gets you ~80% of what's worth managing in roughly 30 lines of
pyinfra. The remaining 20% is case-by-case (kernel cmdline tweaks,
hand-edited /etc files, dotfile contents) that no scanner fully
captures — you write those as you re-discover the box's quirks.

## What scanners can't catch

- The *intent* behind a config (why this `sysctl` value, why this
  cron entry).
- Hand-edited fragments mixed into otherwise-default files (e.g. a
  custom line in `/etc/ssh/sshd_config`).
- Out-of-band state: data in databases, container volumes,
  `/var/lib/<service>`.

For these, treat pyinfra as managing the *bones* (packages, services,
users, configs) and leave the data alone. Re-deploying onto fresh
hardware should give you a working system; restoring data is a
separate concern.

## If "scan and reproduce" is a hard requirement

The native answer is **NixOS**. The whole OS is declarative;
`nixos-generate-config` reads your machine and writes a full config
that recreates it. Different paradigm, big switch — but the right
answer if reproducible-from-a-file is core to your workflow and you
aren't already deep in Ubuntu-land.