2025 12-December 09
Date: 2025 12-December 09
add github repo: https://github.com/City-of-Memphis-Wastewater/pdflinkcheck.git
How to add local SSH git credentials to a repository
PDF links: GoTo links (inter document) vs URI links
**PDF viewers often distinguish between two types of internal links:**
1. **URI Links (What you extracted):** Links that point to a URI (Uniform Resource Identifier), typically external URLs (`http://`, `mailto:`, `tel:`) or external file paths (`file://`, `mhtml:file://`). These are stored with an `/A` (Action) dictionary containing a `/URI` key. **Your current code captures these.**
2. **GoTo Links (What's missing):** Internal jump links that point to a **specific page and position** _within the same PDF document_. These use a `/Dest` (Destination) key instead of an `/A` (Action) key.
The "just to jump to a section, figure, etc" links you're missing are almost certainly **GoTo Links (Destinations)**.
Focusing on link text is crucial for verification, and identifying link remnants is key to finding and fixing missing links.
Here is a plan outlining the steps to achieve both goals.
⚙️ Plan for Link Verification and Fixing
The plan is divided into three stages: Enhancement (getting the data you need), Verification (confirming existing links), and Remediation (identifying and fixing missing links).
Stage 1: Tool Enhancement (Get Link Text)
Your first goal is to modify your pdflinkcheck tool to capture the link text for all link types.
-
Extract Anchor Text for Jumps: For internal jumps (GoTo Actions), instead of
(Rect: N/A), your tool must use the rectangle coordinates of the link annotation to extract the text string that underlies that area in the PDF content stream. This transforms your output from:Page 3 (Rect: N/A): Jumps via Action/Dest to Page 16
to:
Page 3 (Introduction): Jumps via Action/Dest to Page 16
-
Extract Anchor Text for External URIs: Do the same for external links (HTTP/Mailto/Tel). This helps distinguish if the text is the URL itself (e.g.,
www.flowserve.com/) or a descriptive name (e.g.,Flowserve Homepage).
Stage 2: Verification (Confirm Existing Links)
Once you have the text associations, you can focus on the correctness of the links you already found (the 322 existing links).
-
Internal Jumps:
-
Action: Manually (or programmatically) check the link text against the actual destination page content.
-
Goal: Ensure that the link text (e.g., "Section 2.1: Equipment List") actually corresponds to the title on the target page (e.g., Page 16).
-
-
External URIs (HTTP/HTTPS):
-
Action: Programmatically check the HTTP status code (this is what your tool likely does already).
-
New Goal: Group the URLs by the accompanying link text. If ten different sections link to
http://www.gsengr.com/using the text "Guthrie Sales Engineering," a failure of the URI means you only have one issue to fix (the URL), but ten places to update (the source links).
-
🔍 Stage 3: Remediation (Find and Fix Missing Links)
This stage addresses the core issue of missing links by looking for "remnants" and fixing them.
1. Identify Link Remnants
Link remnants are text strings that look like they should be links but were never converted to interactive PDF annotations. This is the best way to find links that were "simply never included."
-
URLs/Domains in Text: Scan the full PDF text for patterns that match URLs or email addresses that are not present in your list of 75 External URIs.
- Example Remnant: The text string "Please visit our website at www.westech-inc.com for details" exists, but there is no clickable link annotation over it on that page.
-
Email Addresses in Text: Scan for
(word)@(word).(word)patterns that don't have amailto:association.- Example Remnant: The text says "Contact us at support@flowerve.com" but there is no clickable link.
2. Identify Internal Cross-Reference Remnants
This is typically a manual or highly context-specific task, but often the most important for O&M manuals:
-
Cross-Reference Phrases: Search for common cross-reference phrases used throughout the document that are not currently associated with an internal jump.
-
Search Terms: "See Section X.Y," "Refer to Figure Z," "Referenced in Chapter 5," or "Detailed on Page 128."
-
Action: If the text says "Refer to Section 4.5.1" but it's not a hyperlink, this is a missing internal jump that needs to be added.
-
Summary of Fixes
| Issue | Found Via | Fix Action |
|---|---|---|
| Broken External Link | Stage 2 (HTTP Status Code) | Update or remove the URI. |
| Missing External Link | Stage 3 (URL Remnant Scan) | Add the missing Link Annotation over the existing text. |
| Incorrect Jump Target | Stage 2 (Verification) | Update the GoTo Action target to point to the correct page. |
| Missing Internal Jump | Stage 3 (Cross-Reference Scan) | Add the missing GoTo Action annotation over the existing text. |