3 Easy Ways to Extract Text from PDFs for Bard

Google’s AI, Bard, is a powerful tool that can answer your questions, generate text, and translate languages. However, it can’t read PDFs or text from webpages. If you want to use Bard with these types of documents, you’ll need to extract the text first. Then, the text can be copied into Bard’s prompt either manually or using Python for automation.

It should be noted that Google is always updating Bard to have more capabilities. However, at the time this article is being written, Bard does not have the ability to read PDFs or web URLs.

What are PDFs?

Portable Document Format (PDF) is a file format developed by Adobe Systems in 1993. PDF is a popular format for electronic documents because it can be viewed and printed on any device, regardless of the software or operating system being used. PDFs can contain text, images, and other multimedia content. They can also be secured with passwords to prevent unauthorized access.

According to a study by Statista, PDF is the third most popular file format on the web, after HTML and XHTML. PDFs are used by businesses, governments, and individuals all over the world to share documents, such as articles, reports, and books. They are also used to create online forms and surveys.

Types of PDFs This Process Will Work With

Before attempting the following methods, it is important to make sure the PDF contains text. Many PDFs are actually scanned documents and thus are images, not text. The methods listed below cannot extract text from images in the PDFs.

Types of Methods

In this article, we will cover both manual and automated methods.

It is important to note that Google has put a limit on the maximum number of characters that can be entered into Bard. Some users have reported this number to be around 4000 letters.

As a result, users may not be able to copy and paste long PDFs into Bard at this time.

A manual method is typically used for smaller jobs that are infrequent. The manual methods all utilize copy and paste. While it is possible to copy a large block of text from a word document, this method becomes a bit unwieldy as the document size increases.

An automated method is better for larger jobs that may require text extraction from many PDFs. This method will utilize a programming language such as Python to repeat the text extraction process multiple times and then perform some action on the text before sending it to Bard.

How to Extract Text From PDFs

Manual Methods

1: Copy And Paste

The easiest way to extract text from a PDF and get it into Bard’s prompt, is to copy and paste. This option is best for small blocks of text.

If you have a PDF reader installed:

Open the document using the PDF reader
Select and copy the text of interest. If you want to select the entire document, use CTRL+A or Edit Menu > Select All. Then use Edit Menu > Copy.
Copy the text directly into the Bard prompt and add a sentence at the end asking Bard to summarize the text.

2: Use an Online PDF-to-Text Converter

For larger texts, another option is to use an online PDF to text converter such as: OCR ONLINE

Use this to convert the PDF to a Plain Text (txt) document. Download the text file it creates and copy the contents directly into Bard.

Be sure to add your prompt either before or to the end of the text before submitting to Bard.

Automated Method

3: Use Python To Extract Text From A PDF and Submit it to Bard

Python is an extremely powerful language. Here, we are going to use it to extract text from a PDF, create a custom prompt using the PDF text and then submit it to Bard.

This Python script uses the PyPDF2 and Selenium packages. To install, run the following:

pip install PyPDF2
pip install selenium

After the packages have been installed, you must get the authentication cookie value for access to Bard. Follow the instructions in this link to get the cookie value: How To Get Google User Authentication Cookie Value

Next, find a PDF file to test with. Here is a PDF document as an example: https://core.ac.uk/download/pdf/15614911.pdf

Download this document to your hard drive. Then copy the file path to the ‘PDF_TO_READ’ variable in the Python code.

Next, copy the authentication cookie value to the ‘COOKIE_VALUE’ variable.

This code performs the following operations:

Open the PDF
Extract all text from page(40)
Clean extracted text by removing whitespaces and newline characters
Open Chrome Browser and insert Authentication Cookie for Bard
Create simple Prompt with extracted text
Submit prompt to Bard and return response

Here is the final code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
from PyPDF2 import PdfReader


PDF_TO_READ = '15614911.pdf'
COOKIE_NAME = "__Secure-1PSID"
COOKIE_VALUE = "<paste your cookie value here>"

def extract_text(pdf_file):

    # Open PDF and create a document object
    pdfDoc = PdfReader(pdf_file)

    # Index a single page
    single_page = pdfDoc.pages[40]

    # Extract the text from that page
    text = single_page.extract_text()

    # Clean text by splitting at each newline.
    # Process each line to remove whitespace
    # Append lines to create single block of text
    split_text = text.split('\n')
    cleaned_text = str()
    for line in split_text:
        # Append cleaned text
        cleaned_text = cleaned_text + line.strip()

    return cleaned_text


def prompt_bard(web_driver, search_string):
    search_bar = web_driver.find_element("id", "mat-input-0")
    search_bar.clear()

    # Enter Search String
    search_bar.send_keys(search_string)

    # Submit Search
    search_bar.send_keys(Keys.ENTER)

    # Allow time for Bard to respond
    time.sleep(10.0)

    # Grab response and return
    return web_driver.find_element(By.CLASS_NAME, "model-response-text").text


# Main Script
if __name__ == '__main__':

    try:
        print("Connecting to ChromeDriver")
        driver = webdriver.Chrome('./chromedriver')
        driver.implicitly_wait(5.0)

        # First connect to a dummy site to allow the browser to startup
        print("Connecting to dummy site")
        driver.get("https://bard.google.com/u/1/")

        # Insert the authentication cookie into the browser
        print("Adding cookie")
        driver.add_cookie({
            "name": COOKIE_NAME,
            "value": COOKIE_VALUE
        })

        # After the cookie has been applied, refresh the page to get access to Bard's input field
        print("Transferring to Bard site")
        driver.get("https://bard.google.com/u/1/")

        # Extract text from PDF
        pdf_text = extract_text(PDF_TO_READ)

        # Create search prompt
        prompt_text = f"Summarize the following text: {pdf_text}"

        # Capture the response and print
        bard_response = prompt_bard(driver, prompt_text)

        print("Bard Response:")
        print(bard_response)

    except Exception as e:
        # DO NOT DO THIS. Use proper exception handling!
        print(e)

    finally:
        print("Closing ChromeDriver")
        driver.close()

3 Easy Ways to Extract Text from PDFs for Bard

What are PDFs?

Types of PDFs This Process Will Work With

Types of Methods

How to Extract Text From PDFs

Manual Methods

1: Copy And Paste

2: Use an Online PDF-to-Text Converter

Automated Method

3: Use Python To Extract Text From A PDF and Submit it to Bard

Related

One thought on “3 Easy Ways to Extract Text from PDFs for Bard”

Leave a Reply Cancel reply

What are PDFs?

Types of PDFs This Process Will Work With

Types of Methods

How to Extract Text From PDFs

Manual Methods

1: Copy And Paste

2: Use an Online PDF-to-Text Converter

Automated Method

3: Use Python To Extract Text From A PDF and Submit it to Bard

Stay in the loop! Sign up for new post alerts from AutomateBard.com

Related

One thought on “3 Easy Ways to Extract Text from PDFs for Bard”

Leave a Reply Cancel reply