We ❤️ Open Source
We open source text-based toolkit, speech-based toolkit, models and dataset.
End-to-End Toolkit for Text and Speech
Solving text and speech problems, powered by PyTorch.
📄 Text Modules
Text augmentation, tokenizer, language model, spelling correction, normalization, Jawi, Kesalahan Tatabahasa, text classification, embedding, text similarity, text tagging and parsing, summarization, translation, and Zero-shot classification.
Documentation🗣️ Speech Modules
Language model, Speech-to-Text using RNNT, CTC and Seq2Seq, Force Alignment, End-to-End Text-to-Speech, speech classification, speaker diarization and streaming interface using PyAudio and TorchAudio.
Documentation🗄️ Open Dataset
Massive Malaysian dataset gathered using pseudolabel and Large Language Model.
Github RepositoryMulti-Images Multi-Audio Multi-turn Multi-Modal
Groundbreaking Bi-lingual Multimodal Large Language Model designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session.
MaLLaM 🌙 Malaysian Foundation Language Model
MaLLaM 🌙 (Malaysia Large Language Model), Malaysian Foundation Language Model, fully open research, trained on 349GB JSONL equivalent to 90 Billion tokens.
Bi-lingual, Multi-turn, RAG, 32k context length, Malaysian context
We finetuned Malaysian Foundation Models up to 13GB or equivalent to 2B tokens of instruction dataset consist of bilingual Malaysian context instruction.
Malaysian Whisper
End-to-End Speech-to-Text and Speech Translation finetuned on hyperlocal Malaysian context.
Malaysian Translation
Able to translate Malay, local Malay (social media texts or local context), English, Manglish, Javanese, Banjarese and Indonesian to target language. It also able to maintain the text structure as it is and only translate necessary texts, eg, programming code.