How do you select programaticaly an area of a PDF and extract the text ou of it ? (VCL)

  • Hi,

    I know this sounds possible but I could not figure any where how to do it.

    I have thousands of PDFs that have their reference at the same place.

    I would like to be able to select this area and get the text to identify the reference of the PDF.

    Basically I would like to have something like :

    WPDF_GETTEXT( x, y , width , height, Page number) : string;


    How should I do ?

    (No action from user requiered, batch fonctionnement).

    thanks

  • Hi,

    I load the new version and started working on it...

    {$I WPViewPDFINC.INC}

    Uses WPDF_ViewCommands, WPViewPDF4, WPViewPDF3,..

    ...

    WPViewPDF1.ViewerStart('wPDFView03.dll', 'aircraft data systems', 'yyyyyyyyyyyyyyyyy' , xxxxxxxxx);

    WPViewPDF1.MouseMode(wpLeftButton, wpmDrawCustom );

    WPViewPDF1.Command(COMPDF_SelectMode, wpmouse_DrawCustom );

    ...

    WPViewPDF1.command(COMPDF_GetTextSetOptions, 4+2); // Activate the filter

    WPViewPDF1.command(COMPDF_GetTextFilterRectX , StrToInt(Edit6.Text) ); --> 36

    WPViewPDF1.command(COMPDF_GetTextFilterRectY , StrToInt(Edit7.Text) ); --> 483

    WPViewPDF1.command(COMPDF_GetTextFilterRectX1, StrToInt(Edit8.Text) );--> 47

    WPViewPDF1.command(COMPDF_GetTextFilterRecty1, StrToInt(Edit9.Text) ); --> 541

    RzMemo1.Lines.Add(WPViewPDF1.GetPageText(0));

    ...

    Coordinates were taken out of a direct selection on screen with the DoSelRectEvent event.

    And it takes the whole page text every times.

  • What is that ?

    function GetPageText(PageNo: Integer; format: string = ''): AnsiString;

    { :: This function retrieves the ANSI text of a certain page in the range 0..PageCount-1<br>

    PageNo = -1 will read the complete text.<br>

    PageNo = -2 will read the selected pages.<br>

    PageNo = -3 will read the selected text<br>

    This formats are possible: RTF, ANSI, UNICODE, HTML.<br>

    For HTML it is possible to add a '=' and the path which will be used for the images: HTML=path

    }

    should I put : RzMemo1.Lines.Add(WPViewPDF1.GetPageText(-3)); ?

  • Hi,

    I tried with that :

    FormCreate :

    WPViewPDF1.ViewerStart('wPDFViewPlus04.dll', 'aircraft data systems', 'YYYY-YYYY-YYYY-YYYY' , XXXXXXX );

    WPViewPDF1.MouseMode(wpLeftButton, wpmDrawCustom );

    WPViewPDF1.Command(COMPDF_SelectMode, wpmouse_DrawCustom );

    And I get the same result.

    The whole page is recongnized whatever I pass as parameters:

    ButtonCLick:

    WPViewPDF1.command(COMPDF_GetTextSetOptions, 4+2); // Activate the filter

    WPViewPDF1.command(COMPDF_GetTextFilterRectX , StrToInt(Edit6.Text) );

    WPViewPDF1.command(COMPDF_GetTextFilterRectY , StrToInt(Edit7.Text) );

    WPViewPDF1.command(COMPDF_GetTextFilterRectX1, StrToInt(Edit8.Text) );

    WPViewPDF1.command(COMPDF_GetTextFilterRecty1, StrToInt(Edit9.Text) );

    WPViewPDF1.AddHighlightRect(0, StrToInt(Edit6.Text), StrToInt(Edit7.Text), StrToInt(Edit8.Text), StrToInt(Edit9.Text) , 255, [wpAnnotAtFoundText] );

    RzMemo1.Lines.Add(WPViewPDF1.GetPageText(0));

  • What I have in my Installation and execution folders:

    wp_type1ttf.dll Version 2.10.0.0

    wp_type1ttf64.dll Version 2.4.6.2

    wpdecodejp.dll Version 1.2.0.1

    wpdecodejp64.dll Version 1.2.0.1

    wPDFView03.dll Version 3.28.4.2

    wPDFView03x64.dll Version 3.28.4.3

    wPDFViewPlus04.dll Version 4.8.0.2

    wPDFViewPlus04x64.dll Version 4.4.2.0

  • Zugriffsverletzung bei Adresse 0726446F in Modul 'wPDFViewPlus04.dll'. Lesen von Adresse 00000000.

    Just after : WPViewPDF1.ViewerStart('wPDFViewPlus04.dll', 'aircraft data systems', 'YYYY-YYYY-YYYY-YYYY' , XXXXXXX );

    Coordinates:

    WPViewPDF1.command(COMPDF_GetTextSetOptions, 4+2); // Activate the filter

    WPViewPDF1.command(COMPDF_GetTextFilterRectX , 457 );

    WPViewPDF1.command(COMPDF_GetTextFilterRectY , 86 );

    WPViewPDF1.command(COMPDF_GetTextFilterRectX1, 100 );

    WPViewPDF1.command(COMPDF_GetTextFilterRecty1, 23 );

    WPViewPDF1.AddHighlightRect(0, StrToInt(Edit6.Text), StrToInt(Edit7.Text), StrToInt(Edit8.Text), StrToInt(Edit9.Text) , 255, [wpAnnotAtFoundText] );

    RzMemo1.Lines.Add(WPViewPDF1.GetPageText(0));

    • Offizieller Beitrag

    Please note that the 3. and 4. COMPDF_GetTextFilterRect parameters are X1 and Y1 values, not Width and Height.

    I used this code to test: