New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and
privacy statement. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
Closed
ksednew opened this issue
Apr 19, 2018
· 22 comments
Closed
Cannot import NLTK stopwords after install
#685
ksednew opened this issue
Apr 19, 2018
· 22 comments
Comments
On a Mac using Python 3.6 and Anaconda. Have installed NLTK and used both command line and manual download of stop words. I see the stop word folder in NLTK folder, but cannot get it to load in my Jupyter notebook:
from nltk.corpus import stopwords
LookupError Traceback (most recent call last)
/anaconda3/lib/python3.6/site-packages/nltk/corpus/util.py in __load(self)
79 except LookupError as e:
—> 80 try: root = nltk.data.find(‘{}/{}’.format(self.subdir, zip_name))
81 except LookupError: raise e
/anaconda3/lib/python3.6/site-packages/nltk/data.py in find(resource_name, paths)
672 resource_not_found = ‘n%sn%sn%sn’ % (sep, msg, sep)
—> 673 raise LookupError(resource_not_found)
674
LookupError:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download(‘stopwords’)
Searched in:
— ‘/Users/ksednew/nltk_data’
— ‘/usr/share/nltk_data’
— ‘/usr/local/share/nltk_data’
— ‘/usr/lib/nltk_data’
— ‘/usr/local/lib/nltk_data’
— ‘/anaconda3/nltk_data’
— ‘/anaconda3/lib/nltk_data’
During handling of the above exception, another exception occurred:
LookupError Traceback (most recent call last)
in ()
1 from nltk.corpus import stopwords
—-> 2 stop = stopwords.words(«english»)
3 def stopwords(x):
4 x = re.sub(«[^a-zs]», » «, x.lower())
5 x = [w for w in x.split()
/anaconda3/lib/python3.6/site-packages/nltk/corpus/util.py in getattr(self, attr)
114 raise AttributeError(«LazyCorpusLoader object has no attribute ‘bases‘»)
115
—> 116 self.__load()
117 # This looks circular, but its not, since __load() changes our
118 # class to something new:
/anaconda3/lib/python3.6/site-packages/nltk/corpus/util.py in __load(self)
79 except LookupError as e:
80 try: root = nltk.data.find(‘{}/{}’.format(self.subdir, zip_name))
—> 81 except LookupError: raise e
82
83 # Load the corpus.
/anaconda3/lib/python3.6/site-packages/nltk/corpus/util.py in __load(self)
76 else:
77 try:
—> 78 root = nltk.data.find(‘{}/{}’.format(self.subdir, self.__name))
79 except LookupError as e:
80 try: root = nltk.data.find(‘{}/{}’.format(self.subdir, zip_name))
/anaconda3/lib/python3.6/site-packages/nltk/data.py in find(resource_name, paths)
671 sep = ‘*’ * 70
672 resource_not_found = ‘n%sn%sn%sn’ % (sep, msg, sep)
—> 673 raise LookupError(resource_not_found)
674
675
LookupError:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download(‘stopwords’)
Searched in:
— ‘/Users/ksednew/nltk_data’
— ‘/usr/share/nltk_data’
— ‘/usr/local/share/nltk_data’
— ‘/usr/lib/nltk_data’
— ‘/usr/local/lib/nltk_data’
— ‘/anaconda3/nltk_data’
— ‘/anaconda3/lib/nltk_data’
I have tried placing copies the stopwords folder in various places (where it says it searched) as well as in the corpus folder and still no luck. Any ideas?
We have official support for corpora, but I believe it does not function properly on Python 3.6. I will have to investigate.
Interesting, I thought on my other Mac running the same versions, it had worked, but I may be wrong.
I will add that I was not able to download the Stopwords corpora because of issues involving my company’s proxy:
nltk.download(‘stopwords’)
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data] (_ssl.c:833)>
If you have ideas for that, maybe that would solve it. I’m wondering if I’m just dropping the manually downloaded version in the wrong place.
Are there any updates to this? It seems no one every commented as to whether this is a problem with the third-party’s support of Python 3.6 or something else.
Further, it sounds like this is a problem on your local machine @ksednew and I’m not certain how this is relevant to the buildpack.
Hemants-MacBook-Pro:TextSummarizer khemant$ python2
Python 2.7.15 (v2.7.15:ca079a3ea3, Apr 29 2018, 20:59:26)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type «help», «copyright», «credits» or «license» for more information.
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words(‘english’))
Traceback (most recent call last):
File «», line 1, in
File «/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/util.py», line 116, in getattr
self.__load()
File «/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/util.py», line 81, in __load
except LookupError: raise e
LookupError:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download(‘stopwords’)
Searched in:
— ‘/Users/khemant/nltk_data’
— ‘/usr/share/nltk_data’
— ‘/usr/local/share/nltk_data’
— ‘/usr/lib/nltk_data’
— ‘/usr/local/lib/nltk_data’
— ‘/Library/Frameworks/Python.framework/Versions/2.7/nltk_data’
— ‘/Library/Frameworks/Python.framework/Versions/2.7/share/nltk_data’
— ‘/Library/Frameworks/Python.framework/Versions/2.7/lib/nltk_data’
nltk.download(‘stopwords’)
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data] (_ssl.c:726)>
False
This is really surprising.
@khemanta this is not running on Heroku. It seems this is an issue with your local installations. I can not help you with this but I suspect folks on StackOverflow can. Cheers
@ksednew Try using averaged_perceptron_tagger module from the nltk.download() method.
Also do restart the IDE after unzipping completes. and then run your code base.
From comments, this sounds like problems in local instances, which the buildpack can’t address. Closing the issue.
I have the same exact error with Python 3.6; i download the resources with nltk.download method. Then when i deploy get the following errors:
2019-01-15T05:23:23.773231+00:00 app[web.1]: **********************************************************************
2019-01-15T05:23:23.773232+00:00 app[web.1]: Resource �[93mstopwords�[0m not found.
2019-01-15T05:23:23.773234+00:00 app[web.1]: Please use the NLTK Downloader to obtain the resource:
2019-01-15T05:23:23.773236+00:00 app[web.1]:
2019-01-15T05:23:23.773237+00:00 app[web.1]: �[31m>>> import nltk
2019-01-15T05:23:23.773239+00:00 app[web.1]: >>> nltk.download('stopwords')
2019-01-15T05:23:23.773241+00:00 app[web.1]: �[0m
2019-01-15T05:23:23.773242+00:00 app[web.1]: Searched in:
2019-01-15T05:23:23.773248+00:00 app[web.1]: - '/code/nltk_data'
2019-01-15T05:23:23.773249+00:00 app[web.1]: - '/usr/share/nltk_data'
2019-01-15T05:23:23.773251+00:00 app[web.1]: - '/usr/local/share/nltk_data'
2019-01-15T05:23:23.773252+00:00 app[web.1]: - '/usr/lib/nltk_data'
2019-01-15T05:23:23.773254+00:00 app[web.1]: - '/usr/local/lib/nltk_data'
2019-01-15T05:23:23.773255+00:00 app[web.1]: - '/usr/local/nltk_data'
2019-01-15T05:23:23.773257+00:00 app[web.1]: - '/usr/local/share/nltk_data'
2019-01-15T05:23:23.773259+00:00 app[web.1]: - '/usr/local/lib/nltk_data'
2019-01-15T05:23:23.773260+00:00 app[web.1]: **********************************************************************
2019-01-15T05:23:23.773261+00:00 app[web.1]:
@ksednew did you get any help with issue, I’m having the same with my linux os
Is anyone experiencing this when run on Heroku, or only locally?
I’m experiencing this when running on heroku only. Works fine in my local.
@raheebashraf Hi! Please could you open a new issue with steps to reproduce?
Experienced same error while working on google colab
@Jheel-patel Hi! Could you open a new issue with steps to reproduce?
I am using below code to use stopwords through jupyter hub, I have hosted jupyter hub on AWS DLAMI Linux server.
python3 -m nltk.downloader stopwords
python3 -m nltk.downloader words
python3 -m nltk.downloader punkt
from nltk.corpus import words
from nltk.corpus import stopwords
python3
from nltk.corpus import stopwords
stop_words = set(stopwords.words(«english»))
print(stop_words)
This works fine while running in python terminal.
But when I try below in Jupyternotebook its failing with error. Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
$python3
import nltk
nltk.download(‘stopwords’) [nltk_data] Downloading package stopwords to /root/nltk_data… [nltk_data] Package stopwords is already up-to-date!
Well, I still have the same problem. But one way to work around this was to download it to my local machine and then via Dockerfile copy that to my container.
This solved the problem to Heroku.
Regarding the problem, I suspect that in the pushing process, the content is downloaded and saved in a different path. That’s the reason why the container couldn’t find the nltk_data.
I will investigate this and if I found a better solution I will update it.
Experienced same error while working on google colab
run the command «nltk.download(‘stopwords’)»in a separate cell just above the cell in which error occured.. i was facing the same issue in the colab but its running now.. after doing this.
Experienced same error while working on google colab
run the command «nltk.download(‘stopwords’)»in a separate cell just above the cell in which error occured.. i was facing the same issue in the colab but its running now.. after doing this.
it is working thank you.
i also facing the same issue if any one know the solution please let me know
LookupError Traceback (most recent call last)
File ~anaconda3libsite-packagesnltkcorpusutil.py:84, in LazyCorpusLoader.__load(self)
83 try:
—> 84 root = nltk.data.find(f»{self.subdir}/{zip_name}»)
85 except LookupError:
File ~anaconda3libsite-packagesnltkdata.py:583, in find(resource_name, paths)
582 resource_not_found = f»n{sep}n{msg}n{sep}n»
—> 583 raise LookupError(resource_not_found)
LookupError:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download(‘stopwords’)
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/stopwords.zip/stopwords/
Searched in:
— ‘C:UsersAbhishek Pandey/nltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3nltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3sharenltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3libnltk_data’
— ‘C:UsersAbhishek PandeyAppDataRoamingnltk_data’
— ‘C:nltk_data’
— ‘D:nltk_data’
— ‘E:nltk_data’
— ‘path_to_nltk_data’
During handling of the above exception, another exception occurred:
LookupError Traceback (most recent call last)
Input In [20], in <cell line: 1>()
—-> 1 sw = stopwords.words(‘english’)
File ~anaconda3libsite-packagesnltkcorpusutil.py:121, in LazyCorpusLoader.getattr(self, attr)
118 if attr == «bases«:
119 raise AttributeError(«LazyCorpusLoader object has no attribute ‘bases‘»)
—> 121 self.__load()
122 # This looks circular, but its not, since __load() changes our
123 # class to something new:
124 return getattr(self, attr)
File ~anaconda3libsite-packagesnltkcorpusutil.py:86, in LazyCorpusLoader.__load(self)
84 root = nltk.data.find(f»{self.subdir}/{zip_name}»)
85 except LookupError:
—> 86 raise e
88 # Load the corpus.
89 corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
File ~anaconda3libsite-packagesnltkcorpusutil.py:81, in LazyCorpusLoader.__load(self)
79 else:
80 try:
—> 81 root = nltk.data.find(f»{self.subdir}/{self.__name}»)
82 except LookupError as e:
83 try:
File ~anaconda3libsite-packagesnltkdata.py:583, in find(resource_name, paths)
581 sep = «*» * 70
582 resource_not_found = f»n{sep}n{msg}n{sep}n»
—> 583 raise LookupError(resource_not_found)
LookupError:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download(‘stopwords’)
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/stopwords
Searched in:
— ‘C:UsersAbhishek Pandey/nltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3nltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3sharenltk_data’
— ‘C:UsersAbhishek Pandeyanaconda3libnltk_data’
— ‘C:UsersAbhishek PandeyAppDataRoamingnltk_data’
— ‘C:nltk_data’
— ‘D:nltk_data’
— ‘E:nltk_data’
— ‘path_to_nltk_data’
This buildpack does not use Anaconda, and does not run on Windows — so has nothing to do with the more recent issues posted to this thread.
If you’re having issues with NLTK or Anaconda, please first read their docs and failing that, follow any support/issue reporting processes they document instead:
https://www.nltk.org/
https://www.anaconda.com/
If someone has an issue with using NLTK on Heroku during a Heroku build (not locally), please open a support ticket (https://help.heroku.com).
I’m locking this issue now, since otherwise people finding this thread via search engines are just going to keep commenting here even though it has nothing to do with their problem.
heroku
locked as spam and limited conversation to collaborators
Jan 27, 2023
Update
As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt
file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.
Original Answer
Here’s a cleaner solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.
I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I’ve made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.
The default heroku buildpack includes a post_compile
step that runs after all of the default build steps have been completed:
# post_compile#!/usr/bin/env bashif [ -f bin/post_compile ]; then echo "-----> Running post-compile hook" chmod +x bin/post_compile sub-env bin/post_compilefi
As you can see, it looks in your project directory for your own post_compile
file in the bin
directory, and it runs it if it exists. You can use this hook to install the nltk data.
-
Create the
bin
directory in the root of your local project. -
Add your own
post_compile
file to thebin
directory.# bin/post_compile#!/usr/bin/env bashif [ -f bin/install_nltk_data ]; then echo "-----> Running install_nltk_data" chmod +x bin/install_nltk_data bin/install_nltk_datafiecho "-----> Post-compile done"
-
Add your own
install_nltk_data
file to thebin
directory.# bin/install_nltk_data#!/usr/bin/env bashsource $BIN_DIR/utilsecho "-----> Starting nltk data installation"# Assumes NLTK_DATA environment variable is already set# $ heroku config:set NLTK_DATA='/app/nltk_data'# Install the nltk data# NOTE: The following command installs the stopwords corpora, # so you may want to change for your specific needs. # See http://www.nltk.org/data.htmlpython -m nltk.downloader stopwords# If using Textblob, use this instead:# python -m textblob.download_corpora lite# Open the NLTK_DATA directorycd ${NLTK_DATA}# Delete all of the zip filesfind . -name "*.zip" -type f -deleteecho "-----> Finished nltk data installation"
-
Add
nltk
to yourrequirements.txt
file (Ortextblob
if you are using Textblob). -
Commit all of these changes to your repo.
-
Set the NLTK_DATA environment variable on your heroku app.
$ heroku config:set NLTK_DATA='/app/nltk_data'
-
Deploy to Heroku. You will see the
post_compile
step trigger at the end of the deployment, followed by the nltk download.
I hope you found this helpful! Enjoy!
I am trying to run a webapp on Heroku using Flask. The webapp is programmed in Python with the NLTK (Natural Language Toolkit library).
One of the file has the following header:
import nltk, json, operator
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
When the webpage with the stopwords code is called, it produces the following error:
LookupError:
**********************************************************************
Resource 'corpora/stopwords' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/app/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
The exact code used:
#remove punctuation
toker = RegexpTokenizer(r'((?<=[^ws])w(?=[^ws])|(W))+', gaps=True)
data = toker.tokenize(data)
#remove stop words and digits
stopword = stopwords.words('english')
data = [w for w in data if w not in stopword and not w.isdigit()]
The webapp on Heroku doesn’t produce the Lookup error when stopword = stopwords.words('english')
is commented out.
The code runs without a glitch on my local computer. I have have installed the required libraries on my computer using
pip install requirements.txt
The virtual environment provided by Heroku was running when I tested the code on my computer.
I have also tried the NLTK provided by two different sources, but the LookupError
is still there. The two sources I used are:
http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc4.zip
https://github.com/nltk/nltk.git
The problem is that the corpus (‘stopwords’ in this case) doesn’t get uploaded to Heroku. Your code works on your local machine because it already has the NLTK corpus. Please follow these steps to solve the issue.
- Create a new directory in your project (let’s call it ‘nltk_data’)
- Download the NLTK corpus in that directory. You will have to configure that during the download.
- Tell nltk to look for this particular path. Just add
nltk.data.path.append('path_to_nltk_data')
to the Python file that’s actually using nltk. - Now push the app to Heroku.
Hope that solves the problem. Worked for me!
Update
As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt
file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.
Original Answer
Here’s a cleaner solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.
I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I’ve made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.
The default heroku buildpack includes a post_compile
step that runs after all of the default build steps have been completed:
# post_compile
#!/usr/bin/env bash
if [ -f bin/post_compile ]; then
echo "-----> Running post-compile hook"
chmod +x bin/post_compile
sub-env bin/post_compile
fi
As you can see, it looks in your project directory for your own post_compile
file in the bin
directory, and it runs it if it exists. You can use this hook to install the nltk data.
-
Create the
bin
directory in the root of your local project. -
Add your own
post_compile
file to thebin
directory.# bin/post_compile #!/usr/bin/env bash if [ -f bin/install_nltk_data ]; then echo "-----> Running install_nltk_data" chmod +x bin/install_nltk_data bin/install_nltk_data fi echo "-----> Post-compile done"
-
Add your own
install_nltk_data
file to thebin
directory.# bin/install_nltk_data #!/usr/bin/env bash source $BIN_DIR/utils echo "-----> Starting nltk data installation" # Assumes NLTK_DATA environment variable is already set # $ heroku config:set NLTK_DATA='/app/nltk_data' # Install the nltk data # NOTE: The following command installs the stopwords corpora, # so you may want to change for your specific needs. # See http://www.nltk.org/data.html python -m nltk.downloader stopwords # If using Textblob, use this instead: # python -m textblob.download_corpora lite # Open the NLTK_DATA directory cd ${NLTK_DATA} # Delete all of the zip files find . -name "*.zip" -type f -delete echo "-----> Finished nltk data installation"
-
Add
nltk
to yourrequirements.txt
file (Ortextblob
if you are using Textblob). -
Commit all of these changes to your repo.
-
Set the NLTK_DATA environment variable on your heroku app.
$ heroku config:set NLTK_DATA='/app/nltk_data'
-
Deploy to Heroku. You will see the
post_compile
step trigger at the end of the deployment, followed by the nltk download.
I hope you found this helpful! Enjoy!