Pip install fasttext error

I pip installed fastText yesterday without any issues. When I tried the same today, it failed with the following error: "error: command 'gcc' failed with exit status 1". I assume ...

Hello, with the latest code Jan 25, 2022, i’m having problem to install fasttext on a Ubuntu 21.04

@Celebio is this related to the version of gcc ?

Thank You
Mo

pip._internal.exceptions.InstallationSubprocessError: Command errored out with exit status 1: /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ install —record /tmp/pip-record-5c2xoor2/install-record.txt —single-version-externally-managed —compile —install-headers /home/projects/projecta12/.projecta12venv/include/site/python3.9/fasttext Check the logs for full command output.
Removed build tracker: ‘/tmp/pip-req-tracker-a28l9v21’

Install Method
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

$gcc —version
gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v —with-pkgversion=’Ubuntu 10.3.0-1ubuntu1′ —with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs —enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 —prefix=/usr —with-gcc-major-version-only —program-suffix=-10 —program-prefix=x86_64-linux-gnu- —enable-shared —enable-linker-build-id —libexecdir=/usr/lib —without-included-gettext —enable-threads=posix —libdir=/usr/lib —enable-nls —enable-bootstrap —enable-clocale=gnu —enable-libstdcxx-debug —enable-libstdcxx-time=yes —with-default-libstdcxx-abi=new —enable-gnu-unique-object —disable-vtable-verify —enable-plugin —enable-default-pie —with-system-zlib —enable-libphobos-checking=release —with-target-system-zlib=auto —enable-objc-gc=auto —enable-multiarch —disable-werror —with-arch-32=i686 —with-abi=m64 —with-multilib-list=m32,m64,mx32 —enable-multilib —with-tune=generic —enable-offload-targets=nvptx-none=/build/gcc-10-gDeRY6/gcc-10-10.3.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-10-gDeRY6/gcc-10-10.3.0/debian/tmp-gcn/usr,hsa —without-cuda-driver —enable-checking=release —build=x86_64-linux-gnu —host=x86_64-linux-gnu —target=x86_64-linux-gnu —with-build-config=bootstrap-lto-lean —enable-link-mutex
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1)

(.projecta12venv) root@kbase:/home/projects/fastText# pip install . —verbose

Using pip 20.3.4 from /home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip (python 3.9)
Non-user install because user site-packages disabled
Created temporary directory: /tmp/pip-ephem-wheel-cache-sk7xeogb
Created temporary directory: /tmp/pip-req-tracker-a28l9v21
Initialized build tracking at /tmp/pip-req-tracker-a28l9v21
Created build tracker: /tmp/pip-req-tracker-a28l9v21
Entered build tracker: /tmp/pip-req-tracker-a28l9v21
Created temporary directory: /tmp/pip-install-0ri5v_cs
Processing /home/projects/fastText
Created temporary directory: /tmp/pip-req-build-b_xalqqz
Added file:///home/projects/fastText to build tracker ‘/tmp/pip-req-tracker-a28l9v21’
Running setup.py (path:/tmp/pip-req-build-b_xalqqz/setup.py) egg_info for package from file:///home/projects/fastText
Created temporary directory: /tmp/pip-pip-egg-info-2rel6zul
Running command python setup.py egg_info
running egg_info
creating /tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info
writing /tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/dependency_links.txt
writing requirements to /tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/requires.txt
writing top-level names to /tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/top_level.txt
writing manifest file ‘/tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/SOURCES.txt’
reading manifest file ‘/tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/SOURCES.txt’
reading manifest template ‘MANIFEST.in’
warning: no files found matching ‘PATENTS’
writing manifest file ‘/tmp/pip-pip-egg-info-2rel6zul/fasttext.egg-info/SOURCES.txt’
Source in /tmp/pip-req-build-b_xalqqz has version 0.9.2, which satisfies requirement fasttext==0.9.2 from file:///home/projects/fastText
Removed fasttext==0.9.2 from file:///home/projects/fastText from build tracker ‘/tmp/pip-req-tracker-a28l9v21’
Requirement already satisfied: numpy in /home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages (from fasttext==0.9.2) (1.21.3)
Requirement already satisfied: pybind11>=2.2 in /home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages (from fasttext==0.9.2) (2.9.0)
Requirement already satisfied: setuptools>=0.7.0 in /home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages (from fasttext==0.9.2) (44.1.1)
Created temporary directory: /tmp/pip-unpack-fujna6k4
Building wheels for collected packages: fasttext
Created temporary directory: /tmp/pip-wheel-r4fglmxi
Building wheel for fasttext (setup.py) … Destination directory: /tmp/pip-wheel-r4fglmxi
Running command /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ bdist_wheel -d /tmp/pip-wheel-r4fglmxi
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/fasttext
copying python/fasttext_module/fasttext/FastText.py -> build/lib.linux-x86_64-3.9/fasttext
copying python/fasttext_module/fasttext/init.py -> build/lib.linux-x86_64-3.9/fasttext
creating build/lib.linux-x86_64-3.9/fasttext/util
copying python/fasttext_module/fasttext/util/util.py -> build/lib.linux-x86_64-3.9/fasttext/util
copying python/fasttext_module/fasttext/util/init.py -> build/lib.linux-x86_64-3.9/fasttext/util
creating build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/test_script.py -> build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/init.py -> build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/test_configurations.py -> build/lib.linux-x86_64-3.9/fasttext/tests
running build_ext
creating tmp
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c /tmp/tmph0_xxpxk.cpp -o tmp/tmph0_xxpxk.o -std=c++11
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c /tmp/tmppw8fp5jm.cpp -o tmp/tmppw8fp5jm.o -fvisibility=hidden
building ‘fasttext_pybind’ extension
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/python
creating build/temp.linux-x86_64-3.9/python/fasttext_module
creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext
creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind
creating build/temp.linux-x86_64-3.9/src
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pybind11/include -I/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pybind11/include -Isrc -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c python/fasttext_module/fasttext/pybind/fasttext_pybind.cc -o build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind/fasttext_pybind.o -DVERSION_INFO=»0.9.2″ -std=c++11 -fvisibility=hidden
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:346:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
346 | for (int32_t i = 0; i < vocab_freq.size(); i++) {
| ~~^~~~~~~~~~~~~~~~~~~
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:360:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
360 | for (int32_t i = 0; i < labels_freq.size(); i++) {
| ~~^~~~~~~~~~~~~~~~~~~~
x86_64-linux-gnu-gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
error: command ‘/usr/bin/x86_64-linux-gnu-gcc’ failed with exit code 1
error
ERROR: Failed building wheel for fasttext
Running setup.py clean for fasttext
Running command /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ clean —all
running clean
removing ‘build/temp.linux-x86_64-3.9’ (and everything under it)
removing ‘build/lib.linux-x86_64-3.9’ (and everything under it)
‘build/bdist.linux-x86_64’ does not exist — can’t clean it
‘build/scripts-3.9’ does not exist — can’t clean it
removing ‘build’
Failed to build fasttext
Installing collected packages: fasttext
Created temporary directory: /tmp/pip-record-5c2xoor2
Running command /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ install —record /tmp/pip-record-5c2xoor2/install-record.txt —single-version-externally-managed —compile —install-headers /home/projects/projecta12/.projecta12venv/include/site/python3.9/fasttext
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/fasttext
copying python/fasttext_module/fasttext/FastText.py -> build/lib.linux-x86_64-3.9/fasttext
copying python/fasttext_module/fasttext/init.py -> build/lib.linux-x86_64-3.9/fasttext
creating build/lib.linux-x86_64-3.9/fasttext/util
copying python/fasttext_module/fasttext/util/util.py -> build/lib.linux-x86_64-3.9/fasttext/util
copying python/fasttext_module/fasttext/util/init.py -> build/lib.linux-x86_64-3.9/fasttext/util
creating build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/test_script.py -> build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/init.py -> build/lib.linux-x86_64-3.9/fasttext/tests
copying python/fasttext_module/fasttext/tests/test_configurations.py -> build/lib.linux-x86_64-3.9/fasttext/tests
running build_ext
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c /tmp/tmpxukhxpxd.cpp -o tmp/tmpxukhxpxd.o -std=c++11
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c /tmp/tmp5hn07uog.cpp -o tmp/tmp5hn07uog.o -fvisibility=hidden
building ‘fasttext_pybind’ extension
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/python
creating build/temp.linux-x86_64-3.9/python/fasttext_module
creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext
creating build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind
creating build/temp.linux-x86_64-3.9/src
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -ffile-prefix-map=/build/python3.9-FZ7wim/python3.9-3.9.5=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pybind11/include -I/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pybind11/include -Isrc -I/home/projects/projecta12/.projecta12venv/include -I/usr/include/python3.9 -c python/fasttext_module/fasttext/pybind/fasttext_pybind.cc -o build/temp.linux-x86_64-3.9/python/fasttext_module/fasttext/pybind/fasttext_pybind.o -DVERSION_INFO=»0.9.2″ -std=c++11 -fvisibility=hidden
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:346:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
346 | for (int32_t i = 0; i < vocab_freq.size(); i++) {
| ~~^~~~~~~~~~~~~~~~~~~
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc: In lambda function:
python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:360:35: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘std::vector::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
360 | for (int32_t i = 0; i < labels_freq.size(); i++) {
| ~~^~~~~~~~~~~~~~~~~~~~
x86_64-linux-gnu-gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
error: command ‘/usr/bin/x86_64-linux-gnu-gcc’ failed with exit code 1
Running setup.py install for fasttext … error
ERROR: Command errored out with exit status 1: /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ install —record /tmp/pip-record-5c2xoor2/install-record.txt —single-version-externally-managed —compile —install-headers /home/projects/projecta12/.projecta12venv/include/site/python3.9/fasttext Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/req/req_install.py», line 848, in install
success = install_legacy(
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/operations/install/legacy.py», line 86, in install
raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/cli/base_command.py», line 223, in _main
status = self.run(options, args)
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/cli/req_command.py», line 180, in wrapper
return func(self, options, args)
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/commands/install.py», line 421, in run
installed = install_given_reqs(
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/req/init.py», line 82, in install_given_reqs
requirement.install(
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/req/req_install.py», line 866, in install
six.reraise(*exc.parent)
File «/usr/share/python-wheels/six-1.15.0-py2.py3-none-any.whl/six.py», line 703, in reraise
raise value
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/operations/install/legacy.py», line 74, in install
runner(
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/utils/subprocess.py», line 292, in runner
call_subprocess(
File «/home/projects/projecta12/.projecta12venv/lib/python3.9/site-packages/pip/_internal/utils/subprocess.py», line 261, in call_subprocess
raise InstallationSubprocessError(proc.returncode, command_desc)
pip._internal.exceptions.InstallationSubprocessError: Command errored out with exit status 1: /home/projects/projecta12/.projecta12venv/bin/python3 -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘; file='»‘»‘/tmp/pip-req-build-b_xalqqz/setup.py'»‘»‘;f=getattr(tokenize, ‘»‘»‘open'»‘»‘, open)(file);code=f.read().replace(‘»‘»‘rn'»‘»‘, ‘»‘»‘n'»‘»‘);f.close();exec(compile(code, file, ‘»‘»‘exec'»‘»‘))’ install —record /tmp/pip-record-5c2xoor2/install-record.txt —single-version-externally-managed —compile —install-headers /home/projects/projecta12/.projecta12venv/include/site/python3.9/fasttext Check the logs for full command output.
Removed build tracker: ‘/tmp/pip-req-tracker-a28l9v21’

fastText

fastText is a library for efficient learning of word representations and sentence classification.

CircleCI

Table of contents

  • Resources
    • Models
    • Supplementary data
    • FAQ
    • Cheatsheet
  • Requirements
  • Building fastText
    • Getting the source code
    • Building fastText using make (preferred)
    • Building fastText using cmake
    • Building fastText for Python
  • Example use cases
    • Word representation learning
    • Obtaining word vectors for out-of-vocabulary words
    • Text classification
  • Full documentation
  • References
    • Enriching Word Vectors with Subword Information
    • Bag of Tricks for Efficient Text Classification
    • FastText.zip: Compressing text classification models
  • Join the fastText community
  • License

Resources

Models

  • Recent state-of-the-art English word vectors.
  • Word vectors for 157 languages trained on Wikipedia and Crawl.
  • Models for language identification and various supervised tasks.

Supplementary data

  • The preprocessed YFCC100M data used in [2].

FAQ

You can find answers to frequently asked questions on our website.

Cheatsheet

We also provide a cheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

  • (g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

  • Python 2.6 or newer
  • NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

  • Python version 2.7 or >=3.4
  • NumPy & SciPy
  • pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

  • Facebook page: https://www.facebook.com/groups/1174547215919768
  • Google group: https://groups.google.com/forum/#!forum/fasttext-library
  • Contact: [email protected], [email protected], [email protected], [email protected]

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

fastText is a library for efficient learning
of word representations and sentence classification.

In this document we present how to use fastText in python.

Table of contents

  • Requirements

  • Installation

  • Usage overview

  • Word representation model

  • Text classification model

  • IMPORTANT: Preprocessing data / encoding
    conventions

  • More examples

  • API

  • train_unsupervised parameters

  • train_supervised parameters

  • model object

Requirements

fastText builds on modern Mac OS and Linux
distributions. Since it uses C++11 features, it requires a compiler with
good C++11 support. You will need Python
(version 2.7 or ≥ 3.4), NumPy &
SciPy and
pybind11.

Installation

To install the latest release, you can do :

$ pip install fasttext

or, to get the latest development version of fasttext, you can install
from our github repository :

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

Usage overview

Word representation model

In order to learn word vectors, as described
here,
we can use fasttext.train_unsupervised function like this:

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

where data.txt is a training file containing utf-8 encoded text.

The returned model object represents your learned model, and you can
use it to retrieve information.

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

Saving and loading a model object

You can save your trained model object by calling the function
save_model.

model.save_model("model_filename.bin")

and retrieve it later thanks to the function load_model :

model = fasttext.load_model("model_filename.bin")

For more information about word representation usage of fasttext, you
can refer to our word representations
tutorial.

Text classification model

In order to train a text classifier using the method described
here,
we can use fasttext.train_supervised function like this:

import fasttext

model = fasttext.train_supervised('data.train.txt')

where data.train.txt is a text file containing a training sentence
per line along with the labels. By default, we assume that labels are
words that are prefixed by the string __label__

Once the model is trained, we can retrieve the list of words and labels:

print(model.words)
print(model.labels)

To evaluate our model by computing the precision at 1 (P@1) and the
recall on a test set, we use the test function:

def print_results(N, p, r):
    print("Nt" + str(N))
    print("P@{}t{:.3f}".format(1, p))
    print("R@{}t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

We can also predict labels for a specific text :

model.predict("Which baking dish is best to bake a banana bread ?")

By default, predict returns only one label : the one with the
highest probability. You can also predict more than one label by
specifying the parameter k:

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

If you want to predict more than one sentence you can pass an array of
strings :

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

Of course, you can also save and load a model to/from a file as in the
word representation usage.

For more information about text classification usage of fasttext, you
can refer to our text classification
tutorial.

Compress model files with quantization

When you want to save a supervised model file, fastText can compress it
in order to have a much smaller model file by sacrificing only a little
bit performance.

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz will have a much smaller size than
model_filename.bin.

For further reading on quantization, you can refer to this paragraph
from our blog
post.

IMPORTANT: Preprocessing data / encoding conventions

In general it is important to properly preprocess your data. In
particular our example scripts in the root
folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for
Python2
and str for
Python3.
The passed text will be encoded as UTF-8 by
pybind11
before passed to the fastText C++ library. This means it is important to
use UTF-8 encoded text when building a model. On Unix-like systems you
can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following
ASCII characters (bytes). In particular, it is not aware of UTF-8
whitespace. We advice the user to convert UTF-8 whitespace / word
boundaries into one of the following symbols as appropiate.

  • space

  • tab

  • vertical tab

  • carriage return

  • formfeed

  • the null character

The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX_LINE_SIZE constant as defined in the Dictionary
header.
This means if you have text that is not separate by newlines, such as
the fil9 dataset, it will be
broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is
not appended.

The length of a token is the number of UTF-8 characters by considering
the leading two bits of a
byte to identify
subsequent bytes of a multi-byte
sequence.
Knowing this is especially important when choosing the minimum and
maximum length of subwords. Further, the EOS token (as specified in the
Dictionary
header)
is considered a character and will not be broken into subwords.

More examples

In order to have a better knowledge of fastText models, please consider
the main
README
and in particular the tutorials on our
website.

You can find further python examples in the doc
folder.

As with any package you can get help on any Python function using the
help function.

For example

+>>> import fasttext
+>>> help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

API

train_unsupervised parameters

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised parameters

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model object

train_supervised, train_unsupervised and load_model
functions return an instance of _FastText class, that we generaly
name model object.

This object exposes those training arguments as properties : lr,
dim, ws, epoch, minCount, minCountLabel, minn,
maxn, neg, wordNgrams, loss, bucket, thread,
lrUpdateRate, t, label, verbose, pretrainedVectors.
So model.wordNgrams will give you the max length of word ngram used
for training this model.

In addition, the object exposes several functions :

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

The properties words, labels return the words and labels from
the dictionary :

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

The object overrides __getitem__ and __contains__ functions in
order to return the representation of a word and to check if a word is
in the vocabulary.

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

Понравилась статья? Поделить с друзьями:
  • Pip install dlib cmake error
  • Physx sdk not initialized physx system software will be installed mafia 2 как исправить
  • Pip install discord py error
  • Physx medal of honor airborne ошибка
  • Physx error reinstalling the application may fix this problem