Skip to content

Commit 3201376

Browse files
committed
many fixes
1 parent 21e3837 commit 3201376

29 files changed

Lines changed: 1810 additions & 1483 deletions

.github/workflows/ci.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main, master]
6+
pull_request:
7+
8+
jobs:
9+
test:
10+
name: Python ${{ matrix.python-version }}
11+
runs-on: ubuntu-latest
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
python-version: ["3.10", "3.11", "3.12"]
16+
17+
steps:
18+
- name: Checkout
19+
uses: actions/checkout@v4
20+
21+
- name: Set up Python
22+
uses: actions/setup-python@v5
23+
with:
24+
python-version: ${{ matrix.python-version }}
25+
cache: pip
26+
27+
- name: Install
28+
run: python -m pip install -e ".[dev]"
29+
30+
- name: Test
31+
run: pytest -q
32+
33+
- name: Ruff
34+
run: ruff check .
35+
36+
- name: Black
37+
run: black --check .
38+
39+
- name: Mypy
40+
run: mypy subs_diff

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,11 @@ coverage.xml
4545
*.py,cover
4646
.hypothesis/
4747
.pytest_cache/
48+
.pytest_tmp/
49+
.pytest-run/
50+
test-tmp/
51+
.codex-pip-tmp/
52+
.codex-pip-cache/
4853

4954
# Translations
5055
*.mo
@@ -93,4 +98,4 @@ Thumbs.db
9398
.dual-graph/
9499

95100
# Subtitles
96-
*.srt
101+
*.srt

AGENTS.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Repository Guidelines
2+
3+
## Project Structure & Module Organization
4+
Core code lives in `subs_diff/`:
5+
- `cli.py` and `__main__.py` provide the CLI entry points.
6+
- `parser.py`, `align.py`, `heuristics.py`, `segments.py`, and `llm.py` implement parsing, matching, scoring, long-segment checks, and LLM verification.
7+
- `report.py` and `reporter.py` generate JSON/HTML reports.
8+
- Shared dataclasses and config types are in `types.py` and `config.py`.
9+
10+
Tests live in `tests/` and follow the same feature split (for example, `tests/test_align.py`, `tests/test_cli_filters.py`).
11+
12+
## Build, Test, and Development Commands
13+
- `pip install -e .` installs the package in editable mode.
14+
- `pip install -e ".[dev]"` installs development tools (`pytest`, `ruff`, `black`, `mypy`).
15+
- `python -m subs_diff compare --stt A.srt --ref B.srt --out report.json` runs the main compare flow.
16+
- `pytest -q` runs the test suite.
17+
- `pytest --cov=subs_diff --cov-report=html` runs tests with coverage output in `htmlcov/`.
18+
- `ruff check .` runs lint checks.
19+
- `black .` formats code.
20+
- `mypy subs_diff` runs strict type checking.
21+
22+
## Coding Style & Naming Conventions
23+
- Python 3.10+ codebase; keep compatibility with versions listed in `pyproject.toml`.
24+
- Use 4-space indentation and max line length `100` (Black/Ruff config).
25+
- Use snake_case for functions/variables/modules; PascalCase for dataclasses/types.
26+
- Prefer explicit type annotations; `mypy` is configured with `strict = true`.
27+
- Keep modules focused; add new logic to existing domain modules before creating new top-level files.
28+
29+
## Testing Guidelines
30+
- Framework: `pytest` (`tests/`, files named `test_*.py`).
31+
- Add tests for every behavior change, especially CLI flags and alignment heuristics.
32+
- Name tests by behavior, e.g. `test_compare_resumes_from_checkpoint`.
33+
- For bug fixes, add a regression test that fails before the fix.
34+
35+
## Commit & Pull Request Guidelines
36+
Current history uses short, direct subjects (for example, `long segments detection`, `Update .gitignore`). Follow that style, but make subjects specific and actionable.
37+
38+
- Commit message format: short imperative subject, optionally with scope (e.g. `align: tighten time window filter`).
39+
- PRs should include: purpose, key changes, test evidence (`pytest`/lint/type-check output), and sample CLI command(s) for manual verification.
40+
- Link related issues and attach report artifacts/screenshots when output format changes.
41+
42+
## Security & Configuration Tips
43+
- Do not commit API keys or local config files.
44+
- Prefer CLI/config storage for secrets (`subs_diff config set ...`) and keep generated reports/debug logs out of commits unless needed for fixtures.

pyproject.toml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,14 +40,12 @@ dev = [
4040
"black>=23.0.0",
4141
"ruff>=0.1.0",
4242
"mypy>=1.5.0",
43+
"types-tqdm>=4.67.0",
4344
]
4445

4546
[project.scripts]
4647
subs-diff = "subs_diff.cli:main"
4748

48-
[project.entry-points."console_scripts"]
49-
subs-diff = "subs_diff.cli:main"
50-
5149
[tool.setuptools.packages.find]
5250
where = ["."]
5351
include = ["subs_diff*"]

subs_diff/__init__.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@
33
__version__ = "0.1.0"
44

55
from subs_diff.types import (
6-
Segment,
76
Candidate,
8-
Issue,
9-
Severity,
107
Category,
8+
Config,
9+
Issue,
1110
LLMVerdict,
1211
Report,
13-
Config,
12+
Segment,
13+
Severity,
1414
)
1515

1616
__all__ = [

subs_diff/__main__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""Entry point для python -m subs_diff."""
22

33
import sys
4+
45
from subs_diff.cli import main
56

67
if __name__ == "__main__":

subs_diff/align.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
11
"""Выравнивание сегментов и merge операций."""
22

33
from dataclasses import dataclass
4-
from typing import Iterable
54

6-
from subs_diff.types import Segment, MergedSegment, Candidate, SimilarityMetrics
7-
from subs_diff.heuristics import compute_similarity, is_candidate, RareTokenDetector
5+
from subs_diff.heuristics import compute_similarity, is_candidate
6+
from subs_diff.types import Candidate, MergedSegment, Segment, SimilarityMetrics
87

98

109
@dataclass
@@ -201,7 +200,7 @@ def align_segments(
201200
merged_a, merged_b, metrics = best_match
202201
aligned_pairs.append((merged_a, merged_b))
203202
if is_candidate(
204-
metrics,
203+
metrics,
205204
min_score=min_score,
206205
a_tokens=merged_a.tokens,
207206
b_tokens=merged_b.tokens,
@@ -268,7 +267,7 @@ def align_segments(
268267
merged_a = merge_segments([best_temporal_a])
269268
metrics = compute_similarity(merged_a, merged_b)
270269
if is_candidate(
271-
metrics,
270+
metrics,
272271
min_score=min_score,
273272
a_tokens=merged_a.tokens,
274273
b_tokens=merged_b.tokens,
@@ -290,8 +289,12 @@ def align_segments(
290289
),
291290
b_segment=merged_b,
292291
metrics=SimilarityMetrics(
293-
jaccard=0.0, char_3gram=0.0, levenshtein=0.0,
294-
length_ratio=0.0, rare_token_overlap=0.0, rare_token_missing=1.0,
292+
jaccard=0.0,
293+
char_3gram=0.0,
294+
levenshtein=0.0,
295+
length_ratio=0.0,
296+
rare_token_overlap=0.0,
297+
rare_token_missing=1.0,
295298
),
296299
is_forced_like=True,
297300
)

subs_diff/checkpoint.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""Checkpoint persistence for interrupted comparisons."""
2+
3+
import logging
4+
from pathlib import Path
5+
6+
from subs_diff.report import generate_report, load_report_json, save_report_json
7+
from subs_diff.types import Issue
8+
9+
logger = logging.getLogger(__name__)
10+
11+
12+
def save_checkpoint(
13+
issues: list[Issue],
14+
out_file: str | Path | None,
15+
processed: int = 0,
16+
total: int = 0,
17+
) -> None:
18+
"""Save partial comparison results in the report JSON shape."""
19+
if out_file is None:
20+
return
21+
22+
try:
23+
report = generate_report(
24+
issues=issues,
25+
stt_file="",
26+
ref_file="",
27+
config={"partial": True, "processed": processed, "total": total},
28+
)
29+
save_report_json(report, out_file)
30+
logger.info("Чекпоинт сохранён: %s/%s проблем обработано", processed, total)
31+
except Exception as exc:
32+
logger.error("Ошибка сохранения чекпоинта: %s", exc)
33+
34+
35+
def load_resume_checkpoint(out_file: str | Path) -> tuple[list[Issue], int]:
36+
"""
37+
Load a partial checkpoint from a report JSON file.
38+
39+
Returns:
40+
A pair of restored issues and processed candidate count. If the file is
41+
absent or is not a partial checkpoint, returns an empty checkpoint.
42+
"""
43+
path = Path(out_file)
44+
if not path.exists():
45+
return [], 0
46+
47+
report = load_report_json(path)
48+
cfg = report.metadata.config
49+
processed = int(cfg.get("processed", 0)) if cfg.get("partial") else 0
50+
return report.issues, processed

0 commit comments

Comments
 (0)