2025 12-December 09

Date: 2025 12-December 09

add github repo: https://github.com/City-of-Memphis-Wastewater/pdflinkcheck.git

How to add local SSH git credentials to a repository

PDF links: GoTo links (inter document) vs URI links

**PDF viewers often distinguish between two types of internal links:**

1. **URI Links (What you extracted):** Links that point to a URI (Uniform Resource Identifier), typically external URLs (`http://`, `mailto:`, `tel:`) or external file paths (`file://`, `mhtml:file://`). These are stored with an `/A` (Action) dictionary containing a `/URI` key. **Your current code captures these.**
    
2. **GoTo Links (What's missing):** Internal jump links that point to a **specific page and position** _within the same PDF document_. These use a `/Dest` (Destination) key instead of an `/A` (Action) key.
    

The "just to jump to a section, figure, etc" links you're missing are almost certainly **GoTo Links (Destinations)**.

Focusing on link text is crucial for verification, and identifying link remnants is key to finding and fixing missing links.

Here is a plan outlining the steps to achieve both goals.


The plan is divided into three stages: Enhancement (getting the data you need), Verification (confirming existing links), and Remediation (identifying and fixing missing links).

Your first goal is to modify your pdflinkcheck tool to capture the link text for all link types.

  1. Extract Anchor Text for Jumps: For internal jumps (GoTo Actions), instead of (Rect: N/A), your tool must use the rectangle coordinates of the link annotation to extract the text string that underlies that area in the PDF content stream. This transforms your output from:

    Page 3 (Rect: N/A): Jumps via Action/Dest to Page 16

    to:

    Page 3 (Introduction): Jumps via Action/Dest to Page 16

  2. Extract Anchor Text for External URIs: Do the same for external links (HTTP/Mailto/Tel). This helps distinguish if the text is the URL itself (e.g., www.flowserve.com/) or a descriptive name (e.g., Flowserve Homepage).

Once you have the text associations, you can focus on the correctness of the links you already found (the 322 existing links).


This stage addresses the core issue of missing links by looking for "remnants" and fixing them.

Link remnants are text strings that look like they should be links but were never converted to interactive PDF annotations. This is the best way to find links that were "simply never included."

2. Identify Internal Cross-Reference Remnants

This is typically a manual or highly context-specific task, but often the most important for O&M manuals:

Summary of Fixes

Issue Found Via Fix Action
Broken External Link Stage 2 (HTTP Status Code) Update or remove the URI.
Missing External Link Stage 3 (URL Remnant Scan) Add the missing Link Annotation over the existing text.
Incorrect Jump Target Stage 2 (Verification) Update the GoTo Action target to point to the correct page.
Missing Internal Jump Stage 3 (Cross-Reference Scan) Add the missing GoTo Action annotation over the existing text.