Amro Hendawi commited on
Commit
03b2221
β€’
2 Parent(s): 3ccc981 2846658

Merge pull request #11 from borhenryk/9-migrate-to-haystack-2.0

Browse files
.gitignore CHANGED
@@ -2,4 +2,5 @@
2
  .vscode
3
  .idea
4
  *.pyc
5
- **/.DS_Store
 
 
2
  .vscode
3
  .idea
4
  *.pyc
5
+ **/.DS_Store
6
+ venv/
Dockerfile ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y \
6
+ build-essential \
7
+ curl \
8
+ software-properties-common \
9
+ git \
10
+ && rm -rf /var/lib/apt/lists/*
11
+
12
+ COPY requirements.txt .
13
+
14
+ RUN pip3 install -r requirements.txt
15
+
16
+ COPY . .
17
+
18
+ # extract version
19
+ COPY .git ./.git
20
+ RUN git rev-parse --short HEAD > revision.txt
21
+ RUN rm -rf ./.git
22
+
23
+ EXPOSE 8501
24
+
25
+ HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
26
+
27
+ ENV PYTHONPATH "${PYTHONPATH}:."
28
+
29
+ ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -9,70 +9,63 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # Template Streamlit App for Haystack Search Pipelines
13
 
14
- This template [Streamlit](https://docs.streamlit.io/) app set up for simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
 
 
15
 
16
- See the ['How to use this template'](#how-to-use-this-template) instructions below to create a simple UI for your own Haystack search pipelines.
17
-
18
- Below you will also find instructions on how you could [push this to Hugging Face Spaces πŸ€—](#pushing-to-hugging-face-spaces-).
19
 
20
  ## Installation and Running
21
- To run the bare application which does _nothing_:
22
- 1. Install requirements: `pip install -r requirements.txt`
23
- 2. Run the streamlit app: `streamlit run app.py`
24
-
25
- This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll notice that the app will only show you instructions on what to edit.
26
 
27
- ### Optional Configurations
28
-
29
- You can set optional cofigurations to set the:
30
- - `--task` you want to start the app with: `rag` or `extractive` (default: rag)
31
- - `--store` you want to use: `inmemory`, `opensearch`, `weaviate` or `milvus` (default: inmemory)
32
- - `--name` you want to have for the app. (default: 'My Search App')
33
-
34
- E.g.:
35
-
36
- ```bash
37
- streamlit run app.py -- --store opensearch --task extractive --name 'My Opensearch Documentation Search'
38
- ```
39
 
40
- In a `.env` file, include all the config settings that you would like to use based on:
41
- - The DocumentStore of your choice
42
- - The Extractive/Generative model of your choice
43
 
44
- While the `/utils/config.py` will create default values for some configurations, others have to be set in the `.env` such as the `OPENAI_KEY`
 
45
 
46
- Example `.env`
 
47
 
48
- ```
49
- OPENAI_KEY=YOUR_KEY
50
- EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L12-v2
51
- GENERATIVE_MODEL=text-davinci-003
52
- ```
53
 
 
54
 
55
- ## How to use this template
56
- 1. Create a new repository from this template or simply open it in a codespace to start playing around πŸ’™
57
- 2. Make sure your `requirements.txt` file includes the Haystack and Streamlit versions you would like to use.
58
- 3. Change the code in `utils/haystack.py` if you would like a different pipeline.
59
- 4. Create a `.env`file with all of your configuration settings.
60
- 5. Make any UI edits you'd like to and [share with the Haystack community](https://haystack.deepeset.ai/community)
61
- 6. Run the app as show in [installation and running](#installation-and-running)
62
 
63
  ### Repo structure
64
- - `./utils`: This is where we have 3 files:
65
- - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it uses default values. An example of this is in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
66
- - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and cache it, and `query()` which is the function called by `app.py` once a user query is received.
 
 
 
 
 
67
  - `ui.py`: Use this file for any UI and initial value setups.
68
- - `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search bar, a 'Run' button, and a response that you can highlight answers with.
 
 
 
69
 
70
  ### What to edit?
 
71
  There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
72
 
73
  - Change the pipelines to use the embedding models, extractive or generative models as you need.
74
- - If using the `rag` task, change the `default_prompt_template` to use one of our available ones on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
 
 
 
75
 
 
 
 
76
 
77
  ## Pushing to Hugging Face Spaces πŸ€—
78
 
@@ -83,15 +76,19 @@ A few things to pay attention to:
83
  1. Create a New Space on Hugging Face with the Streamlit SDK.
84
  2. Create a Hugging Face token on your HF account.
85
  3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
86
- 4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for your HF Space too!
87
- 5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any changes to the frontmatter of this readme to display the title, emoji etc you desire.
88
- 6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information, and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml) working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
 
 
 
 
89
 
90
  ```yaml
91
  name: Sync to Hugging Face hub
92
  on:
93
  push:
94
- branches: [main]
95
 
96
  # to run this workflow manually from the Actions tab
97
  workflow_dispatch:
 
9
  pinned: false
10
  ---
11
 
12
+ # Document Insights - Extractive & Generative Methods using Haystack
13
 
14
+ This template [Streamlit](https://docs.streamlit.io/) app set up for
15
+ simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to
16
+ do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
17
 
18
+ Below you will also find instructions on how you
19
+ could [push this to Hugging Face Spaces πŸ€—](#pushing-to-hugging-face-spaces-).
 
20
 
21
  ## Installation and Running
 
 
 
 
 
22
 
23
+ ### Local development
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ To run the bare application which does _nothing_:
 
 
26
 
27
+ 1. Install requirements: `pip install -r requirements.txt`
28
+ 2. Run the streamlit app: `streamlit run app.py`
29
 
30
+ This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll
31
+ notice that the app will only show you instructions on what to edit.
32
 
33
+ ### Docker
 
 
 
 
34
 
35
+ To run the app in a Docker container:
36
 
37
+ 1. Build the Docker image: `docker build -t haystack-streamlit .`
38
+ 2. Run the Docker container: `docker run -p 8501:8501 haystack-streamlit` (make sure to bind any other ports you need)
39
+ 3. Open your browser and go to `http://localhost:8501`
 
 
 
 
40
 
41
  ### Repo structure
42
+
43
+ - `./utils`: This is where we have 3 files:
44
+ - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it
45
+ uses default values. An example of this is
46
+ in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
47
+ - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search
48
+ pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and
49
+ cache it, and `query()` which is the function called by `app.py` once a user query is received.
50
  - `ui.py`: Use this file for any UI and initial value setups.
51
+ - `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search
52
+ bar, a 'Run' button, and a response that you can highlight answers with.
53
+ - `requirements.txt`: This file includes the required libraries to run the Streamlit app.
54
+ - `document_qa_engine.py`: This file includes the QA pipeline with Haystack.
55
 
56
  ### What to edit?
57
+
58
  There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
59
 
60
  - Change the pipelines to use the embedding models, extractive or generative models as you need.
61
+ - If using the `rag` task, change the `default_prompt_template` to use one of our available ones
62
+ on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
63
+
64
+ ### Using local LLM models
65
 
66
+ To use the `local LLM` mode you can use [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/).
67
+ For more info on how to run the app with a local LLM model please refer to the documentation of the tool you are using.
68
+ The `local_llm` mode expects an API available at `http://localhost:1234/v1`.
69
 
70
  ## Pushing to Hugging Face Spaces πŸ€—
71
 
 
76
  1. Create a New Space on Hugging Face with the Streamlit SDK.
77
  2. Create a Hugging Face token on your HF account.
78
  3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
79
+ 4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for
80
+ your HF Space too!
81
+ 5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any
82
+ changes to the frontmatter of this readme to display the title, emoji etc you desire.
83
+ 6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information,
84
+ and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml)
85
+ working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
86
 
87
  ```yaml
88
  name: Sync to Hugging Face hub
89
  on:
90
  push:
91
+ branches: [ main ]
92
 
93
  # to run this workflow manually from the Actions tab
94
  workflow_dispatch:
app.py CHANGED
@@ -1,284 +1,240 @@
1
- from utils.check_pydantic_version import use_pydantic_v1
2
- use_pydantic_v1() #This function has to be run before importing haystack. as haystack requires pydantic v1 to run
3
-
4
-
5
- from operator import index
6
- import streamlit as st
7
- import logging
8
  import os
9
-
10
- from annotated_text import annotation
11
- from json import JSONDecodeError
12
- from markdown import markdown
13
- from utils.config import parser
14
- from utils.haystack import start_document_store, query, initialize_pipeline, start_preprocessor_node, start_retriever, start_reader
15
- from utils.ui import reset_results, set_initial_state
16
  import pandas as pd
17
- import haystack
18
-
19
- from datetime import datetime
20
- import streamlit.components.v1 as components
21
  import streamlit_authenticator as stauth
22
- import pickle
23
-
24
  from streamlit_modal import Modal
25
- import numpy as np
26
-
27
-
28
 
29
- names = ['mlreply']
30
- usernames = ['mlreply']
31
- with open('hashed_password.pkl','rb') as f:
32
- hashed_passwords = pickle.load(f)
33
-
34
-
35
-
36
- # Whether the file upload should be enabled or not
37
- DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
38
-
39
-
40
- def show_documents_list(retrieved_documents):
41
- data = []
42
- for i, document in enumerate(retrieved_documents):
43
- data.append([document.meta['name']])
44
- df = pd.DataFrame(data, columns=['Uploaded Document Name'])
45
- df.drop_duplicates(subset=['Uploaded Document Name'], inplace=True)
46
- df.index = np.arange(1, len(df) + 1)
47
- return df
48
-
49
- # Define a function to handle file uploads
50
- def upload_files():
51
- uploaded_files = upload_container.file_uploader(
52
- "upload", type=["pdf", "txt", "docx"], accept_multiple_files=True, label_visibility="hidden", key=1
53
- )
54
- return uploaded_files
55
-
56
-
57
- # Define a function to process a single file
58
- def process_file(data_file, preprocesor, document_store):
59
- # read file and add content
60
- file_contents = data_file.read().decode("utf-8")
61
- docs = [{
62
- 'content': str(file_contents),
63
- 'meta': {'name': str(data_file.name)}
64
- }]
65
- try:
66
- names = [item.meta.get('name') for item in document_store.get_all_documents()]
67
- #if args.store == 'inmemory':
68
- # doc = converter.convert(file_path=files, meta=None)
69
- if data_file.name in names:
70
- print(f"{data_file.name} already processed")
71
- else:
72
- print(f'preprocessing uploaded doc {data_file.name}.......')
73
- #print(data_file.read().decode("utf-8"))
74
- preprocessed_docs = preprocesor.process(docs)
75
- print('writing to document store.......')
76
- document_store.write_documents(preprocessed_docs)
77
- print('updating emebdding.......')
78
- document_store.update_embeddings(retriever)
79
- except Exception as e:
80
- print(e)
81
-
82
-
83
- # Define a function to upload the documents to haystack document store
84
- def upload_document():
85
- if data_files is not None:
86
- for data_file in data_files:
87
- # Upload file
88
- if data_file:
89
- try:
90
- #raw_json = upload_doc(data_file)
91
- # Call the process_file function for each uploaded file
92
- if args.store == 'inmemory':
93
- processed_data = process_file(data_file, preprocesor, document_store)
94
- #upload_container.write(str(data_file.name) + "    βœ… ")
95
- except Exception as e:
96
- upload_container.write(str(data_file.name) + "    ❌ ")
97
- upload_container.write("_This file could not be parsed, see the logs for more information._")
98
-
99
- # Define a function to reset the documents in haystack document store
100
- def reset_documents():
101
- print('\nReseting documents list at ' + str(datetime.now()) + '\n')
102
- st.session_state.data_files = None
103
- document_store.delete_documents()
104
-
105
- try:
106
- args = parser.parse_args()
107
- preprocesor = start_preprocessor_node()
108
- document_store = start_document_store(type=args.store)
109
- document_store.get_all_documents()
110
- retriever = start_retriever(document_store)
111
- reader = start_reader()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  st.set_page_config(
113
- page_title="MLReplySearch",
114
- layout="centered",
115
  page_icon=":shark:",
 
 
116
  menu_items={
117
  'Get Help': 'https://www.extremelycoolapp.com/help',
118
  'Report a bug': "https://www.extremelycoolapp.com/bug",
119
  'About': "# This is a header. This is an *extremely* cool app!"
120
  }
121
  )
122
- st.sidebar.image("ml_logo.png", use_column_width=True)
123
 
124
- authenticator = stauth.Authenticate(names, usernames, hashed_passwords, "document_search", "random_text", cookie_expiry_days=1)
125
 
126
- name, authentication_status, username = authenticator.login("Login", "main")
 
 
127
 
128
- if authentication_status == False:
129
- st.error("Username/Password is incorrect")
130
 
131
- if authentication_status == None:
132
- st.warning("Please enter your username and password")
 
 
133
 
134
- if authentication_status:
135
 
136
- # Sidebar for Task Selection
137
- st.sidebar.header('Options:')
 
 
 
 
138
 
139
- # OpenAI Key Input
140
- openai_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password")
 
 
 
 
141
 
142
- if openai_key:
143
- task_options = ['Extractive', 'Generative']
144
- else:
145
- task_options = ['Extractive']
146
 
147
- task_selection = st.sidebar.radio('Select the task:', task_options)
 
 
 
 
 
 
 
148
 
149
- # Check the task and initialize pipeline accordingly
150
- if task_selection == 'Extractive':
151
- pipeline_extractive = initialize_pipeline("extractive", document_store, retriever, reader)
152
- elif task_selection == 'Generative' and openai_key: # Check for openai_key to ensure user has entered it
153
- pipeline_rag = initialize_pipeline("rag", document_store, retriever, reader, openai_key=openai_key)
154
 
 
 
 
 
 
155
 
156
- set_initial_state()
157
 
158
- modal = Modal("Manage Files", key="demo-modal")
159
- open_modal = st.sidebar.button("Manage Files", use_container_width=True)
160
- if open_modal:
161
- modal.open()
162
 
163
- st.write('# ' + args.name)
164
- if modal.is_open():
165
- with modal.container():
166
- if not DISABLE_FILE_UPLOAD:
167
- upload_container = st.container()
168
- data_files = upload_files()
169
- upload_document()
170
- st.session_state.sidebar_state = 'collapsed'
171
- st.table(show_documents_list(document_store.get_all_documents()))
172
-
173
- # File upload block
174
- # if not DISABLE_FILE_UPLOAD:
175
- # upload_container = st.sidebar.container()
176
- # upload_container.write("## File Upload:")
177
- # data_files = upload_files()
178
- # Button to update files in the documentStore
179
- # upload_container.button('Upload Files', on_click=upload_document, args=())
180
 
181
- # Button to reset the documents in DocumentStore
182
- st.sidebar.button("Reset documents", on_click=reset_documents, args=(), use_container_width=True)
183
 
184
- if "question" not in st.session_state:
185
- st.session_state.question = ""
186
- # Search bar
187
- question = st.text_input("Question", value=st.session_state.question, max_chars=100, on_change=reset_results, label_visibility="hidden")
188
 
189
- run_pressed = st.button("Run")
190
 
191
- run_query = (
192
- run_pressed or question != st.session_state.question #or task_selection != st.session_state.task
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  )
194
 
195
- # Get results for query
196
- if run_query and question:
197
- if task_selection == 'Extractive':
198
- reset_results()
199
- st.session_state.question = question
200
- with st.spinner("πŸ”Ž    Running your pipeline"):
201
- try:
202
- st.session_state.results_extractive = query(pipeline_extractive, question)
203
- st.session_state.task = task_selection
204
- except JSONDecodeError as je:
205
- st.error(
206
- "πŸ‘“    An error occurred reading the results. Is the document store working?"
207
- )
208
- except Exception as e:
209
- logging.exception(e)
210
- st.error("🐞    An error occurred during the request.")
211
-
212
- elif task_selection == 'Generative':
213
- reset_results()
214
- st.session_state.question = question
215
- with st.spinner("πŸ”Ž    Running your pipeline"):
216
- try:
217
- st.session_state.results_generative = query(pipeline_rag, question)
218
- st.session_state.task = task_selection
219
- except JSONDecodeError as je:
220
- st.error(
221
- "πŸ‘“    An error occurred reading the results. Is the document store working?"
222
- )
223
- except Exception as e:
224
- if "API key is invalid" in str(e):
225
- logging.exception(e)
226
- st.error("🐞    incorrect API key provided. You can find your API key at https://platform.openai.com/account/api-keys.")
227
- else:
228
- logging.exception(e)
229
- st.error("🐞    An error occurred during the request.")
230
- # Display results
231
- if (st.session_state.results_extractive or st.session_state.results_generative) and run_query:
232
-
233
- # Handle Extractive Answers
234
- if task_selection == 'Extractive':
235
- results = st.session_state.results_extractive
236
-
237
- st.subheader("Extracted Answers:")
238
-
239
- if 'answers' in results:
240
- answers = results['answers']
241
- treshold = 0.2
242
- higher_then_treshold = any(ans.score > treshold for ans in answers)
243
- if not higher_then_treshold:
244
- st.markdown(f"<span style='color:red'>Please note none of the answers achieved a score higher then {int(treshold) * 100}%. Which probably means that the desired answer is not in the searched documents.</span>", unsafe_allow_html=True)
245
- for count, answer in enumerate(answers):
246
- if answer.answer:
247
- text, context = answer.answer, answer.context
248
- start_idx = context.find(text)
249
- end_idx = start_idx + len(text)
250
- score = round(answer.score, 3)
251
- st.markdown(f"**Answer {count + 1}:**")
252
- st.markdown(
253
- context[:start_idx] + str(annotation(body=text, label=f'SCORE {score}', background='#964448', color='#ffffff')) + context[end_idx:],
254
- unsafe_allow_html=True,
255
- )
256
- else:
257
- st.info(
258
- "πŸ€” &nbsp;&nbsp; Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
259
- )
260
-
261
- # Handle Generative Answers
262
- elif task_selection == 'Generative':
263
- results = st.session_state.results_generative
264
- st.subheader("Generated Answer:")
265
- if 'results' in results:
266
- st.markdown("**Answer:**")
267
- st.write(results['results'][0])
268
-
269
- # Handle Retrieved Documents
270
- if 'documents' in results:
271
- retrieved_documents = results['documents']
272
- st.subheader("Retriever Results:")
273
-
274
- data = []
275
- for i, document in enumerate(retrieved_documents):
276
- # Truncate the content
277
- truncated_content = (document.content[:150] + '...') if len(document.content) > 150 else document.content
278
- data.append([i + 1, document.meta['name'], truncated_content])
279
-
280
- # Convert data to DataFrame and display using Streamlit
281
- df = pd.DataFrame(data, columns=['Ranked Context', 'Document Name', 'Content'])
282
- st.table(df)
283
- except SystemExit as e:
284
- os._exit(e.code)
 
 
 
 
 
 
 
 
1
  import os
2
+ from dotenv import load_dotenv
 
 
 
 
 
 
3
  import pandas as pd
4
+ import streamlit as st
 
 
 
5
  import streamlit_authenticator as stauth
 
 
6
  from streamlit_modal import Modal
 
 
 
7
 
8
+ from utils import new_file, clear_memory, append_documentation_to_sidebar, load_authenticator_config, init_qa, \
9
+ append_header
10
+ from haystack.document_stores.in_memory import InMemoryDocumentStore
11
+ from haystack import Document
12
+
13
+ load_dotenv()
14
+
15
+ OPENAI_MODELS = ['gpt-3.5-turbo',
16
+ "gpt-4",
17
+ "gpt-4-1106-preview"]
18
+
19
+ OPEN_MODELS = [
20
+ 'mistralai/Mistral-7B-Instruct-v0.1',
21
+ 'HuggingFaceH4/zephyr-7b-beta'
22
+ ]
23
+
24
+
25
+ def reset_chat_memory():
26
+ st.button(
27
+ 'Reset chat memory',
28
+ key="reset-memory-button",
29
+ on_click=clear_memory,
30
+ help="Clear the conversational memory. Currently implemented to retain the 4 most recent messages.",
31
+ disabled=False)
32
+
33
+
34
+ def manage_files(modal, document_store):
35
+ open_modal = st.sidebar.button("Manage Files", use_container_width=True)
36
+ if open_modal:
37
+ modal.open()
38
+
39
+ if modal.is_open():
40
+ with modal.container():
41
+ uploaded_file = st.file_uploader(
42
+ "Upload a CV in PDF format",
43
+ type=("pdf",),
44
+ on_change=new_file(),
45
+ disabled=st.session_state['document_qa_model'] is None,
46
+ label_visibility="collapsed",
47
+ help="The document is used to answer your questions. The system will process the document and store it in a RAG to answer your questions.",
48
+ )
49
+ edited_df = st.data_editor(use_container_width=True, data=st.session_state['files'],
50
+ num_rows='dynamic',
51
+ column_order=['name', 'size', 'is_active'],
52
+ column_config={'name': {'editable': False}, 'size': {'editable': False},
53
+ 'is_active': {'editable': True, 'type': 'checkbox',
54
+ 'width': 100}}
55
+ )
56
+ st.session_state['files'] = pd.DataFrame(columns=['name', 'content', 'size', 'is_active'])
57
+
58
+ if uploaded_file:
59
+ st.session_state['file_uploaded'] = True
60
+ st.session_state['files'] = pd.concat([st.session_state['files'], edited_df])
61
+ with st.spinner('Processing the CV content...'):
62
+ store_file_in_table(document_store, uploaded_file)
63
+ ingest_document(uploaded_file)
64
+
65
+
66
+ def ingest_document(uploaded_file):
67
+ if not st.session_state['document_qa_model']:
68
+ st.warning('Please select a model to start asking questions')
69
+ else:
70
+ try:
71
+ st.session_state['document_qa_model'].ingest_pdf(uploaded_file)
72
+ st.success('Document processed successfully')
73
+ except Exception as e:
74
+ st.error(f"Error processing the document: {e}")
75
+ st.session_state['file_uploaded'] = False
76
+
77
+
78
+ def store_file_in_table(document_store, uploaded_file):
79
+ pdf_content = uploaded_file.getvalue()
80
+ st.session_state['pdf_content'] = pdf_content
81
+ st.session_state.messages = []
82
+ document = Document(content=pdf_content, meta={"name": uploaded_file.name})
83
+ df = pd.DataFrame(st.session_state['files'])
84
+ df['is_active'] = False
85
+ st.session_state['files'] = pd.concat([df, pd.DataFrame(
86
+ [{"name": uploaded_file.name, "content": pdf_content, "size": len(pdf_content),
87
+ "is_active": True}])])
88
+ document_store.write_documents([document])
89
+
90
+
91
+ def init_session_state():
92
+ st.session_state.setdefault('files', pd.DataFrame(columns=['name', 'content', 'size', 'is_active']))
93
+ st.session_state.setdefault('models', [])
94
+ st.session_state.setdefault('api_keys', {})
95
+ st.session_state.setdefault('current_selected_model', 'gpt-3.5-turbo')
96
+ st.session_state.setdefault('current_api_key', '')
97
+ st.session_state.setdefault('messages', [])
98
+ st.session_state.setdefault('pdf_content', None)
99
+ st.session_state.setdefault('memory', None)
100
+ st.session_state.setdefault('pdf', None)
101
+ st.session_state.setdefault('document_qa_model', None)
102
+ st.session_state.setdefault('file_uploaded', False)
103
+
104
+
105
+ def set_page_config():
106
  st.set_page_config(
107
+ page_title="CV Insights AI Assistant",
 
108
  page_icon=":shark:",
109
+ initial_sidebar_state="expanded",
110
+ layout="wide",
111
  menu_items={
112
  'Get Help': 'https://www.extremelycoolapp.com/help',
113
  'Report a bug': "https://www.extremelycoolapp.com/bug",
114
  'About': "# This is a header. This is an *extremely* cool app!"
115
  }
116
  )
 
117
 
 
118
 
119
+ def update_running_model(api_key, model):
120
+ st.session_state['api_keys'][model] = api_key
121
+ st.session_state['document_qa_model'] = init_qa(model, api_key)
122
 
 
 
123
 
124
+ def init_api_key_dict():
125
+ st.session_state['models'] = OPENAI_MODELS + list(OPEN_MODELS) + ['local LLM']
126
+ for model_name in OPENAI_MODELS:
127
+ st.session_state['api_keys'][model_name] = None
128
 
 
129
 
130
+ def display_chat_messages(chat_box, chat_input):
131
+ with chat_box:
132
+ if chat_input:
133
+ for message in st.session_state.messages:
134
+ with st.chat_message(message["role"]):
135
+ st.markdown(message["content"], unsafe_allow_html=True)
136
 
137
+ st.chat_message("user").markdown(chat_input)
138
+ st.session_state.messages.append({"role": "user", "content": chat_input})
139
+ with st.chat_message("assistant"):
140
+ response = st.session_state['document_qa_model'].process_message(chat_input)
141
+ st.markdown(response)
142
+ st.session_state.messages.append({"role": "assistant", "content": response})
143
 
 
 
 
 
144
 
145
+ def setup_model_selection():
146
+ model = st.selectbox(
147
+ "Model:",
148
+ options=st.session_state['models'],
149
+ index=0, # default to the first model in the list gpt-3.5-turbo
150
+ placeholder="Select model",
151
+ help="Select an LLM:"
152
+ )
153
 
154
+ if model:
155
+ if model != st.session_state['current_selected_model']:
156
+ st.session_state['current_selected_model'] = model
157
+ if model == 'local LLM':
158
+ st.session_state['document_qa_model'] = init_qa(model)
159
 
160
+ api_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password",
161
+ disabled=st.session_state['current_selected_model'] == 'local LLM')
162
+ if api_key and api_key != st.session_state['current_api_key']:
163
+ update_running_model(api_key, model)
164
+ st.session_state['current_api_key'] = api_key
165
 
166
+ return model
167
 
 
 
 
 
168
 
169
+ def setup_task_selection(model):
170
+ # enable extractive and generative tasks if we're using a local LLM or an OpenAI model with an API key
171
+ if model == 'local LLM' or st.session_state['api_keys'].get(model):
172
+ task_options = ['Extractive', 'Generative']
173
+ else:
174
+ task_options = ['Extractive']
 
 
 
 
 
 
 
 
 
 
 
175
 
176
+ task_selection = st.sidebar.radio('Select the task:', task_options)
 
177
 
178
+ # TODO: Add the task selection logic here (initializing the model based on the task)
 
 
 
179
 
 
180
 
181
+ def setup_page_body():
182
+ chat_box = st.container(height=350, border=False)
183
+ chat_input = st.chat_input(
184
+ placeholder="Upload a document to start asking questions...",
185
+ disabled=not st.session_state['file_uploaded'],
186
+ )
187
+ if st.session_state['file_uploaded']:
188
+ display_chat_messages(chat_box, chat_input)
189
+
190
+
191
+ class StreamlitApp:
192
+ def __init__(self):
193
+ self.authenticator_config = load_authenticator_config()
194
+ self.document_store = InMemoryDocumentStore()
195
+ set_page_config()
196
+ self.authenticator = self.init_authenticator()
197
+ init_session_state()
198
+ init_api_key_dict()
199
+
200
+ def init_authenticator(self):
201
+ return stauth.Authenticate(
202
+ self.authenticator_config['credentials'],
203
+ self.authenticator_config['cookie']['name'],
204
+ self.authenticator_config['cookie']['key'],
205
+ self.authenticator_config['cookie']['expiry_days']
206
  )
207
 
208
+ def setup_sidebar(self):
209
+ with st.sidebar:
210
+ st.sidebar.image("resources/ml_logo.png", use_column_width=True)
211
+
212
+ # Sidebar for Task Selection
213
+ st.sidebar.header('Options:')
214
+ model = setup_model_selection()
215
+ setup_task_selection(model)
216
+ st.divider()
217
+ self.authenticator.logout()
218
+ reset_chat_memory()
219
+ modal = Modal("Manage Files", key="demo-modal")
220
+ manage_files(modal, self.document_store)
221
+ st.divider()
222
+ append_documentation_to_sidebar()
223
+
224
+ def run(self):
225
+ name, authentication_status, username = self.authenticator.login()
226
+ if authentication_status:
227
+ self.run_authenticated_app()
228
+ elif st.session_state["authentication_status"] is False:
229
+ st.error('Username/password is incorrect')
230
+ elif st.session_state["authentication_status"] is None:
231
+ st.warning('Please enter your username and password')
232
+
233
+ def run_authenticated_app(self):
234
+ self.setup_sidebar()
235
+ append_header()
236
+ setup_page_body()
237
+
238
+
239
+ app = StreamlitApp()
240
+ app.run()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
authenticator_config.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ credentials:
2
+ usernames:
3
+ mlreply:
4
+ email: mlreply@reply.de
5
+ failed_login_attempts: 0 # Will be managed automatically
6
+ logged_in: False # Will be managed automatically
7
+ name: ML Reply
8
+ password: mlreply # Will be hashed automatically
9
+ cookie:
10
+ expiry_days: 1
11
+ key: some_signature_key # Must be string
12
+ name: some_cookie_name
13
+ #pre-authorized:
14
+ # emails:
15
+ # - melsby@gmail.com
document_qa_engine.py ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+ from pypdf import PdfReader
3
+ from haystack.utils import Secret
4
+ from haystack import Pipeline, Document, component
5
+
6
+ from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
7
+ from haystack.components.writers import DocumentWriter
8
+ from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
9
+ from haystack.document_stores.in_memory import InMemoryDocumentStore
10
+ from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
11
+ from haystack.components.builders import PromptBuilder
12
+ from haystack.components.generators.chat import OpenAIChatGenerator, HuggingFaceTGIChatGenerator
13
+ from haystack.components.generators import OpenAIGenerator, HuggingFaceTGIGenerator
14
+ from haystack.document_stores.types import DuplicatePolicy
15
+
16
+ SENTENCE_RETREIVER_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
17
+
18
+ MAX_TOKENS = 500
19
+
20
+ template = """
21
+ As a professional HR recruiter given the following information, answer the question shortly and concisely in 1 or 2 sentences.
22
+
23
+ Context:
24
+ {% for document in documents %}
25
+ {{ document.content }}
26
+ {% endfor %}
27
+
28
+ Question: {{question}}
29
+ Answer:
30
+ """
31
+
32
+
33
+ @component
34
+ class UploadedFileConverter:
35
+ """
36
+ A component to convert uploaded PDF files to Documents
37
+ """
38
+
39
+ @component.output_types(documents=List[Document])
40
+ def run(self, uploaded_file):
41
+ pdf = PdfReader(uploaded_file)
42
+ documents = []
43
+ # uploaded file name without .pdf at the end and with _ and page number at the end
44
+ name = uploaded_file.name.rstrip('.PDF') + '_'
45
+ for page in pdf.pages:
46
+ documents.append(
47
+ Document(
48
+ content=page.extract_text(),
49
+ meta={'name': name + f"_{page.page_number}"}))
50
+ return {"documents": documents}
51
+
52
+
53
+ def create_ingestion_pipeline(document_store):
54
+ doc_embedder = SentenceTransformersDocumentEmbedder(model=SENTENCE_RETREIVER_MODEL)
55
+ doc_embedder.warm_up()
56
+
57
+ pipeline = Pipeline()
58
+ pipeline.add_component("converter", UploadedFileConverter())
59
+ pipeline.add_component("cleaner", DocumentCleaner())
60
+ pipeline.add_component("splitter",
61
+ DocumentSplitter(split_by="passage", split_length=100, split_overlap=10))
62
+ pipeline.add_component("embedder", doc_embedder)
63
+ pipeline.add_component("writer",
64
+ DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
65
+
66
+ pipeline.connect("converter", "cleaner")
67
+ pipeline.connect("cleaner", "splitter")
68
+ pipeline.connect("splitter", "embedder")
69
+ pipeline.connect("embedder", "writer")
70
+ return pipeline
71
+
72
+
73
+ def create_query_pipeline(document_store, model_name, api_key):
74
+ prompt_builder = PromptBuilder(template=template)
75
+ if model_name == "local LLM":
76
+ generator = OpenAIGenerator(model=model_name,
77
+ api_base_url="http://localhost:1234/v1",
78
+ generation_kwargs={"max_tokens": MAX_TOKENS}
79
+ )
80
+ elif "gpt" in model_name:
81
+ generator = OpenAIGenerator(api_key=Secret.from_token(api_key), model=model_name,
82
+ generation_kwargs={"max_tokens": MAX_TOKENS}
83
+ )
84
+ else:
85
+ generator = HuggingFaceTGIGenerator(token=Secret.from_token(api_key), model=model_name,
86
+ generation_kwargs={"max_new_tokens": MAX_TOKENS}
87
+ )
88
+
89
+ query_pipeline = Pipeline()
90
+ query_pipeline.add_component("text_embedder",
91
+ SentenceTransformersTextEmbedder(model=SENTENCE_RETREIVER_MODEL))
92
+ query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
93
+ query_pipeline.add_component("prompt_builder", prompt_builder)
94
+ query_pipeline.add_component("generator", generator)
95
+
96
+ query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
97
+ query_pipeline.connect("retriever.documents", "prompt_builder.documents")
98
+ query_pipeline.connect("prompt_builder", "generator")
99
+
100
+ return query_pipeline
101
+
102
+
103
+ class DocumentQAEngine:
104
+ def __init__(self,
105
+ model_name,
106
+ api_key=None
107
+ ):
108
+ self.api_key = api_key
109
+ self.model_name = model_name
110
+ document_store = InMemoryDocumentStore()
111
+ self.chunks = []
112
+ self.query_pipeline = create_query_pipeline(document_store, model_name, api_key)
113
+ self.pdf_ingestion_pipeline = create_ingestion_pipeline(document_store)
114
+
115
+ def ingest_pdf(self, uploaded_file):
116
+ self.pdf_ingestion_pipeline.run({"converter": {"uploaded_file": uploaded_file}})
117
+
118
+ def process_message(self, query):
119
+ response = self.query_pipeline.run({"text_embedder": {"text": query}, "prompt_builder": {"question": query}})
120
+ return response["generator"]["replies"][0]
generate_keys.py DELETED
@@ -1,15 +0,0 @@
1
- # -*- coding: utf-8 -*-
2
-
3
- import pickle
4
- from pathlib import Path
5
-
6
- import streamlit_authenticator as stauth
7
-
8
- names = ['mlreply']
9
- usernames = ['mlreply']
10
- passwords = ['mlreply1']
11
-
12
- hashed_passwords = stauth.Hasher((passwords)).generate()
13
-
14
- with open('hashed_password.pkl','wb') as f:
15
- pickle.dump(hashed_passwords, f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hashed_password.pkl DELETED
Binary file (78 Bytes)
 
requirements.txt CHANGED
@@ -1,10 +1,18 @@
1
- scikit-learn==1.3.2
2
- safetensors==0.3.3.post1
3
- farm-haystack[inference,weaviate,opensearch,file-conversion,pdf]==1.20.0
4
- milvus-haystack
5
- streamlit==1.23.0
6
- streamlit-authenticator==0.1.5
7
- streamlit_modal
8
- markdown
9
- st-annotated-text
10
- datasets
 
 
 
 
 
 
 
 
 
1
+ # Streamlit
2
+ streamlit~=1.32.2
3
+ streamlit-modal==0.1.2
4
+ streamlit-authenticator==0.3.2
5
+ streamlit-pdf-viewer==0.0.9
6
+
7
+ # LLM
8
+ haystack-ai~=2.0.0
9
+ sentence_transformers~=2.6.0
10
+
11
+ # Utils
12
+ pandas~=2.2.1
13
+ pypdf~=4.2.0
14
+ pytest~=8.1.1
15
+ python-dotenv~=1.0.1
16
+
17
+ # Dev Utils
18
+ watchdog
ml_logo.png β†’ resources/ml_logo.png RENAMED
File without changes
utils.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from document_qa_engine import DocumentQAEngine
2
+
3
+ import streamlit as st
4
+
5
+ import logging
6
+ from yaml import load, SafeLoader, YAMLError
7
+
8
+
9
+ def load_authenticator_config(file_path='authenticator_config.yaml'):
10
+ try:
11
+ with open(file_path, 'r') as file:
12
+ authenticator_config = load(file, Loader=SafeLoader)
13
+ return authenticator_config
14
+ except FileNotFoundError:
15
+ logging.error(f"File {file_path} not found.")
16
+ except YAMLError as error:
17
+ logging.error(f"Error parsing YAML file: {error}")
18
+
19
+
20
+ def new_file():
21
+ st.session_state['loaded_embeddings'] = None
22
+ st.session_state['doc_id'] = None
23
+ st.session_state['uploaded'] = True
24
+ clear_memory()
25
+
26
+
27
+ def clear_memory():
28
+ if st.session_state['memory']:
29
+ st.session_state['memory'].clear()
30
+
31
+
32
+ def init_qa(model, api_key=None):
33
+ print(f"Initializing QA with model: {model} and API key: {api_key}")
34
+ return DocumentQAEngine(model, api_key=api_key)
35
+
36
+
37
+ def append_header():
38
+ _, header_container, _ = st.columns([0.25, 0.5, 0.25])
39
+ with header_container:
40
+ st.header('πŸ“„ Document Insights :rainbow[AI] Assistant πŸ“š', divider='rainbow')
41
+ st.text("πŸ“₯ Upload documents in PDF format. Get insights.. ask questions..")
42
+
43
+
44
+ def append_documentation_to_sidebar():
45
+ with st.expander("Disclaimer"):
46
+ st.markdown(
47
+ """
48
+ :warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely
49
+ for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use
50
+ or handling of the data submitted to third parties LLMs.
51
+ """)
52
+ with st.expander("Documentation"):
53
+ st.markdown(
54
+ """
55
+ Upload a CV as PDF document. Once the spinner stops, you can proceed to ask your questions. The answers will
56
+ be displayed in the right column. The system will answer your questions using the content of the document
57
+ and mark refrences over the PDF viewer.
58
+ """)