We ❤️ Open Source

We open source text-based toolkit, speech-based toolkit, models and dataset.

242
Open Model
174
Open Dataset
1TB+
Size of dataset
🤗 HuggingFace
mockup light

End-to-End Toolkit for Text and Speech

Solving text and speech problems, powered by PyTorch.

📄 Text Modules

Text augmentation, tokenizer, language model, spelling correction, normalization, Jawi, Kesalahan Tatabahasa, text classification, embedding, text similarity, text tagging and parsing, summarization, translation, and Zero-shot classification.

Documentation

🗣️ Speech Modules

Language model, Speech-to-Text using RNNT, CTC and Seq2Seq, Force Alignment, End-to-End Text-to-Speech, speech classification, speaker diarization and streaming interface using PyAudio and TorchAudio.

Documentation

🗄️ Open Dataset

Massive Malaysian dataset gathered using pseudolabel and Large Language Model.

Github Repository

Multi-Images Multi-Audio Multi-turn Multi-Modal

Groundbreaking Bi-lingual Multimodal Large Language Model designed to comprehend multi-images, multi-audio, and multi-images-multi-audio within a single multiturn session.

MaLLaM 🌙 Malaysian Foundation Language Model

MaLLaM 🌙 (Malaysia Large Language Model), Malaysian Foundation Language Model, fully open research, trained on 349GB JSONL equivalent to 90 Billion tokens.

Bi-lingual, Multi-turn, RAG, 32k context length, Malaysian context

We finetuned Malaysian Foundation Models up to 13GB or equivalent to 2B tokens of instruction dataset consist of bilingual Malaysian context instruction.

Malaysian Whisper

End-to-End Speech-to-Text and Speech Translation finetuned on hyperlocal Malaysian context.

Malaysian Translation

Able to translate Malay, local Malay (social media texts or local context), English, Manglish, Javanese, Banjarese and Indonesian to target language. It also able to maintain the text structure as it is and only translate necessary texts, eg, programming code.