A high-level Python interface for PoS tagging Icelandic text using the IceBERT-PoS model with classical tokenization.
- Proper device handling (GPU) for tensors
# This package is currently not available on PyPI, so you need to install it directly from the source repository.
# Without PyTorch (lighter, but model inference won't work)
# Allows you to control the PyTorch version
pip install git+ssh://git@github.com/mideind/IceBERT-PoS.git@main # Installs directly from main branch
# With PyTorch support (required for model inference) - RECOMMENDED
pip install "git+ssh://git@github.com/mideind/IceBERT-PoS.git[torch]" # @main impliedNote: The
[torch]extra is required for model inference. Then why package it separately? To avoid pinning to specific versions of PyTorch and allow the user to install the latest version compatible with their system.
Package versioning is done via Git tags. To install a specific version, use the @ syntax:
pip install "git+ssh://git@github.com/mideind/IceBERT-PoS.git@<version>"v0.3.0: Loosentransformersversion requirement from>=4.46.3,<5.0to>=4.46.3,<6.0. In other words, add support fortransformersversion5.v0.2.0: First stable release
- Classical Tokenization: Uses the Miðeind tokenizer for Icelandic tokenziation
- Character Positions: Preserves exact character start/end positions in original text
- Sentence-Aware Processing: Maintains sentence boundaries and processes them in batches
- Dual Format Output: Provides both IFD tags and structured category/features
- Caller-owned Model: Load model once, reuse for multiple calls
- Batch Processing: Efficient processing of multiple sentences
After installation, you can use the icebert-pos command:
# Basic POS tagging with full IFD tags
icebert-pos "Þetta er stutt sýnidæmi."
# Þetta[fahen] er[sfg3en] stutt[lhensf] sýnidæmi[nhen].[pl]
# Get only POS categories (without detailed features)
icebert-pos --only-category "Þetta er stutt sýnidæmi."
# Þetta[fa] er[sf] stutt[l] sýnidæmi[n].[pl]
# Get structured json output
icebert-pos --json "Þetta er stutt sýnidæmi."
# [
# [
# {
# "text": "Þetta",
# "char_start": 0,
# "char_end": 5,
# "category": "fa",
# "features": [
# "neut",
# "sing",
# "nom"
# ],
# "ifd_tag": "fahen"
# },
# ...
# {
# "text": ".",
# "char_start": 23,
# "char_end": 24,
# "category": "pl",
# "features": [],
# "ifd_tag": "pl"
# }
# ]
# ]
# Default behavior is to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens
icebert-pos "samskipta- og kynningarstýra"
# 3 tokens:
# samskipta-[kt] og[c] kynningarstýra[nven]
icebert-pos --keep-composite-tokens "samskipta- og kynningarstýra"
# 1 token:
# samskipta- og kynningarstýra[nven]
# Enable debug logging
icebert-pos --debug "Þetta er stutt sýnidæmi."
# lots of outputThere are some additional command line options available, run icebert-pos --help to see them.
from icebert_pos import pos_tag_text, TaggedToken
from transformers import AutoModel, AutoTokenizer
import torch
# Load model and tokenizer, you need to have trust_remote_code=True to load the custom model code.
# You can check the model repository for details: https://huggingface.co/mideind/IceBERT-PoS
model = AutoModel.from_pretrained("mideind/IceBERT-PoS", trust_remote_code=True)
# set the model to evaluation mode - otherwise the output will be stochastic
model.eval()
# place the model on the appropriate device (CPU/GPU)
model.to("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("mideind/IceBERT-PoS")
text = "Þetta er stutt sýnidæmi."
# POS tag text - returns List[List[TaggedToken]]
sentence_results = pos_tag_text(text, model, tokenizer)
assert sentence_results == [
[
TaggedToken(text="Þetta", char_start=0, char_end=5, category="fa", features=["neut", "sing", "nom"], ifd_tag="fahen"),
TaggedToken(text="er", char_start=6, char_end=8, category="sf", features=["sing", "act", "3", "pres"], ifd_tag="sfg3en"),
TaggedToken(text="stutt", char_start=9, char_end=14, category="l", features=["neut", "sing", "nom", "strong", "pos"], ifd_tag="lhensf"),
TaggedToken(text="sýnidæmi", char_start=15, char_end=23, category="n", features=["neut", "sing", "nom"], ifd_tag="nhen"),
TaggedToken(text=".", char_start=23, char_end=24, category="pl", features=[], ifd_tag="pl")
]
]# For processing multiple sentences efficiently
# The Miðeind tokenizer will split this string into 3 sentences and process them in batches
texts = ["Fyrsti texti.", "Annar texti.", "Þriðji texti."]
# The batching is done automatically by the pos_tag_text function and this will call model.forward twice
sentence_results = pos_tag_text("\n".join(texts), model, tokenizer, batch_size=2)
assert len(sentence_results) == 3 # Should return 3 sentencesfrom icebert_pos import (
segment_text_to_sentences,
prepare_sentence,
batch_sentences,
predict_sentences
)
# Same example as before
text = "Þetta er stutt sýnidæmi."
# Segment text into sentences
sentences = segment_text_to_sentences(text)
# Prepare individual sentences
sentence_tensors = []
for sentence in sentences:
tensors = prepare_sentence(sentence, model, tokenizer, truncate=True)
sentence_tensors.append(tensors)
# Batch multiple sentences for efficient processing
batch_input_ids, batch_attention_mask, batch_word_mask = batch_sentences(
sentence_tensors, tokenizer
)
# Get raw predictions
predictions = predict_sentences(
batch_input_ids, batch_attention_mask, batch_word_mask, model
)
# predictions is List[List[Tuple[str, List[str]]]]
# - List of sentences
# - Each sentence has List of (category, features) tuples for each word
assert predictions == [
[
("fa", ["neut", "sing", "nom"]),
("sf", ["sing", "act", "3", "pres"]),
("l", ["neut", "sing", "nom", "strong", "pos"]),
("n", ["neut", "sing", "nom"]),
("pl", [])
]
]Basic token with text and position:
text: The token textchar_start: Start position in original textchar_end: End position in original text
Collection of tokens representing a sentence:
tokens: List of Token objects
Token with POS tagging information (extends Token):
text: The token textchar_start: Start position in original textchar_end: End position in original textcategory: POS category (e.g., "fp", "sfg")features: List of morphological features (e.g., ["1", "sing", "nom"])ifd_tag: Full IFD POS tag (e.g., "fp1en", "sfg3en")
pos_tag_text(text, model, tokenizer, batch_size=1, split_composite_tokens=True, truncate=False)- Main function for POS taggingsegment_text_to_sentences(text, split_composite_tokens=True)- Segment text into sentences using classical tokenization
batch_size: Number of sentences to process in each batch for efficiency (default: 1)split_composite_tokens: Whether to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens on whitespace (default: True)truncate: Whether to truncate input sequences that exceed the model's maximum length. If False, long sentences may cause errors (default: False)
prepare_sentence(sentence, model, tokenizer, truncate=False)- Prepare tensors for a single sentencebatch_sentences(sentence_tensors, tokenizer)- Batch multiple sentence tensorspredict_sentences(input_ids, attention_mask, word_mask, model)- Get raw predictions from model
When using the lower-level functions you can control more of the processing but will also need to handle device placement and batching manually.
MIT
Copyright (C) Miðeind ehf.